TorchBridge Pixel Loom is a demonstration strategy that turns three separate worlds into a single continuous pipeline: a learning world that defines how a tiny neural network should behave, a compute world that applies that behavior at massive scale, and a graphics world that displays the results with minimal copying. The design is not about training a model in real time, and it is not primarily about trading signals. Instead, it is an engineering pattern that shows how a learning library can author numeric parameters, how a parallel compute engine can transform those parameters into per element decisions, and how a graphics engine can present the output in a window at interactive speed. The entire system is constructed as a bridge between libraries that were not originally meant to cooperate, and the code spends as much effort preventing conflicts as it spends performing work. The core mathematical relationship is that all three subsystems are manipulating the same kind of thing, a structured collection of numbers, but each subsystem wants that collection in a different form, with different rules about ownership, memory, and timing. The code is therefore a story about representation, transfer, and synchronization.

The first act is the integration discipline. The file begins by pulling in the learning library first. This is not cosmetic. LibTorch brings heavy template machinery and a long tail of macros and identifiers. Zorro brings its own macro definitions and naming habits, including a short identifier that collides with a major namespace used by LibTorch. If the includes are reversed, the compiler can be pushed into confusing or contradictory interpretations. The chosen include order makes the learning library set the ground rules first, then invites Zorro into that environment after the most dangerous names are neutralized. The code then performs a cleanup sweep that removes common macro landmines such as min, max, abs, and other short identifiers that can silently change the meaning of later code. This is a practical form of mathematical correctness. It is not about equations, it is about ensuring that function names and numeric operations mean what the author expects in every compilation unit.

The second act is the graphics world. OpenGL is used through a Win32 window and a WGL context. The graphics subsystem creates a pixel buffer object, which is a GPU backed container for raw pixel bytes. It also creates a texture that can be updated from that buffer. The pixel buffer object is sized for a full image in four channels per pixel. The texture is configured to display those pixels without filtering tricks that could blur or resample the data. The graphics pipeline here is deliberately simple: update the texture from the buffer, draw a full screen quad, and swap buffers. The goal is not advanced rendering but reliable presentation of a computed image.

The third act is the compute world. OpenCL is used as a parallel compute engine that can write to the same pixel buffer object that OpenGL uses. This is the most important relationship between OpenCL and OpenGL in the program. The code chooses a GPU device that explicitly supports the extension for sharing objects between compute and graphics. That extension makes it possible for the compute engine and the graphics engine to refer to the same underlying GPU memory without staging through the CPU. In symbolic terms, OpenGL owns a canvas, OpenCL is granted a pen that can draw directly onto that canvas, and the program carefully negotiates when the pen is allowed to touch the canvas. That negotiation is done through acquire and release calls. When compute begins, it acquires the shared object, which is like taking a lock. When compute ends, it releases the object, which is like returning the lock so graphics can read the new pixels. This lock style coordination is the heart of correctness. Without it, compute and graphics could act at the same time on the same memory, causing tearing, undefined behavior, or driver errors.

The fourth act is the learning world. LibTorch is used here not as an online training engine but as an authoring tool for weights and biases of a tiny neural network. The network is a small multilayer perceptron with two inputs, one hidden layer, and three outputs. It uses a smooth nonlinearity in each layer so that its responses vary continuously rather than stepping abruptly. The model is created and set into evaluation mode. A no gradient guard is used so that the library does not build training graphs or store extra history. The code then extracts the weight matrices and bias vectors from the layers, ensures they are contiguous in memory, ensures they are on the CPU, and copies them into plain arrays. This conversion step is the key relationship between LibTorch and OpenCL. LibTorch stores parameters as tensors with rich metadata and potential device placement. OpenCL wants raw buffers of floats. The code therefore performs a change of representation: from a high level tensor world into a flat array world. That conversion is the mathematical handshake between the learning library and the compute kernel. It is also the moment where the author ensures the tiny network in OpenCL is using the same parameters as the model defined by LibTorch.

Once the weights are in plain arrays, OpenCL buffers are created for each parameter block. Those buffers are marked read only because the kernel will not modify them. The arrays from LibTorch are then uploaded into the OpenCL buffers using write commands on the command queue. At this point, the learning system has effectively published a set of numeric constants into the GPU compute system. The relationship is one way: LibTorch produces, OpenCL consumes. In this demo, the parameters are built once at initialization. In a more advanced version, parameters could be refreshed periodically to reflect training updates, but that is not the goal here.

The next relationship is between the OpenCL kernel and the pixel buffer. The kernel is launched across a two dimensional grid matching the image width and height. Each work item corresponds to one pixel. For each pixel, the kernel first maps pixel coordinates to a normalized coordinate space and builds two input values from those coordinates and a moving phase variable. This phase acts like a clock signal that animates the image, ensuring that the pipeline is alive rather than static. The kernel then runs the tiny neural network forward pass. It computes hidden activations by combining the two inputs with the first layer weights and biases, applies the nonlinearity, then combines those hidden values with the second layer weights and biases, applies the nonlinearity again, and produces three output channels. Those channels are then mixed with additional procedural components such as stripes and a vignette effect, and finally converted to bytes. The resulting four channel pixel is written into the output buffer, which is the shared OpenGL pixel buffer object.

This is where the “mathematical relationship” between LibTorch and OpenCL becomes visible. LibTorch defines the numeric transformation embodied by the weights and biases. OpenCL executes that transformation at massive scale, once per pixel, on a GPU. The output is not just any calculation; it is the same functional shape that the LibTorch model represents, but applied in a different domain. Instead of being applied to training data, it is applied to spatial coordinates and a time phase. In other words, the neural network is used as a generative function, and OpenCL is the engine that evaluates that function for a whole image at once.

The relationship between OpenCL and OpenGL is equally structural. The output buffer is not copied back to the CPU. Instead, OpenGL updates a texture directly from the pixel buffer object, and the image is displayed. The shared object path avoids a major performance bottleneck. The acquire and release calls ensure that the buffer transitions cleanly between compute ownership and graphics ownership. The command queue finish call ensures that the compute kernel is fully complete before the graphics subsystem uploads and draws. That is a timing relationship, a synchronization contract that keeps the pipeline coherent frame after frame.

Finally, the code is embedded inside a Zorro oriented DLL entry. Zorro is not the star of this demo, but it provides a controlled host environment, lifecycle flags, and a consistent way to stop execution. The Win32 loop is allowed to close if Zorro signals exit, and a guard is added to avoid a known startup state combination that could trigger premature closure. The DLL main function forces a single cycle execution to prevent relaunch loops and then calls the WinMain routine once. This turns the whole program into a single run demonstration that can be launched from within a Zorro workflow, which is useful if the broader project is a trading system that wants a compute visualization or a diagnostic display.

In abstract terms, TorchBridge Pixel Loom is a three stage loom. LibTorch spins the thread by defining the tiny network parameters. OpenCL weaves the thread across a two dimensional fabric by evaluating the network for every pixel in parallel. OpenGL displays the woven fabric by texturing a screen aligned quad. The mathematics is not expressed through explicit formula writing; it is expressed through a consistent mapping of numeric state across subsystems: parameter tensors become float arrays, float arrays become device buffers, device buffers feed a kernel, the kernel writes pixels, and the pixels become an image without leaving the GPU. The value of the strategy is the pattern: it demonstrates how to connect a learning definition, a compute executor, and a graphics presenter into a single coherent system with clear ownership rules, minimal copying, and stable synchronization.

Code
// Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch.

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif

#define _CRT_SECURE_NO_WARNINGS

// ============================================================
// 1) Include LibTorch FIRST (like your working file)
//    Public/shareable variant: no machine-specific include paths.
// ============================================================
#if defined(__has_include)
  #if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
    #include <torch/torch.h>
    #include <torch/script.h>
  #else
    #error "LibTorch headers not found. Add LibTorch include paths to your build configuration."
  #endif
#else
  #include <torch/torch.h>
  #include <torch/script.h>
#endif

// (Optional) CUDA headers (safe pattern used by your working file)
// Keep them conditional so CPU-only LibTorch setups still compile.
#if defined(__has_include)
  #if __has_include(<torch/cuda.h>)
    #include <torch/cuda.h>
    #define HAVE_TORCH_CUDA_HEADER 1
  #else
    #define HAVE_TORCH_CUDA_HEADER 0
  #endif
  #if __has_include(<cuda_runtime_api.h>)
    #include <cuda_runtime_api.h>
    #define HAVE_CUDA_RUNTIME_API_HEADER 1
  #else
    #define HAVE_CUDA_RUNTIME_API_HEADER 0
  #endif
#else
  #define HAVE_TORCH_CUDA_HEADER 0
  #define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif

#if defined(__has_include)
  #if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
    #include <c10/cuda/CUDAGuard.h>
    #define HAVE_C10_CUDAGUARD 1
  #else
    #define HAVE_C10_CUDAGUARD 0
  #endif
#else
  #define HAVE_C10_CUDAGUARD 0
#endif

// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>

// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at' to avoid conflict
//    (exact pattern from your working file)
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at

// ============================================================
// 4) Cleanup macro landmines (exact style from your working file)
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif

#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif

// ============================================================
// OpenCL + OpenGL includes (after the macro cleanup is safest)
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h>     // cl_khr_gl_sharing
#include <CL/cl_gl_ext.h> // CL_GL_CONTEXT_KHR / CL_WGL_HDC_KHR
#include <GL/gl.h>

#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif

// ------------------------- Globals -------------------------
static HWND   gHwnd = 0;
static HDC    gHdc  = 0;
static HGLRC  gHgl  = 0;

static int    gW = 640;
static int    gH = 480;
static float  gPhase = 0.0f;

static int read_env_int(const char* key, int fallback)
{
  const char* s = getenv(key);
  if(!s || !*s) return fallback;
  int v = atoi(s);
  return (v > 0) ? v : fallback;
}

// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);

// ===========================================================
//                Minimal OpenGL function loading
// ===========================================================

#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif

#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif

typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);

static PFNGLGENBUFFERSPROC    p_glGenBuffers    = 0;
static PFNGLBINDBUFFERPROC    p_glBindBuffer    = 0;
static PFNGLBUFFERDATAPROC    p_glBufferData    = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;

static void* gl_get_proc(const char* name)
{
  void* p = (void*)wglGetProcAddress(name);
  if(!p) {
    HMODULE ogl = GetModuleHandleA("opengl32.dll");
    if(ogl) p = (void*)GetProcAddress(ogl, name);
  }
  return p;
}

static int gl_load_ext()
{
  p_glGenBuffers    = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
  p_glBindBuffer    = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
  p_glBufferData    = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
  p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");

  if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
    return 0;
  return 1;
}

// ===========================================================
//                       OpenGL objects
// ===========================================================

static GLuint gPBO = 0;
static GLuint gTex = 0;

static void gl_release_all()
{
  if(gTex) {
    glDeleteTextures(1, &gTex);
    gTex = 0;
  }
  if(gPBO) {
    if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
    gPBO = 0;
  }

  if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
  if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}

static int gl_init_wgl(HWND hwnd)
{
  gHwnd = hwnd;
  gHdc = GetDC(hwnd);
  if(!gHdc) return 0;

  PIXELFORMATDESCRIPTOR pfd;
  ZeroMemory(&pfd, sizeof(pfd));
  pfd.nSize      = sizeof(pfd);
  pfd.nVersion   = 1;
  pfd.dwFlags    = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
  pfd.iPixelType = PFD_TYPE_RGBA;
  pfd.cColorBits = 32;
  pfd.cDepthBits = 16;
  pfd.iLayerType = PFD_MAIN_PLANE;

  int pf = ChoosePixelFormat(gHdc, &pfd);
  if(pf == 0) return 0;
  if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;

  gHgl = wglCreateContext(gHdc);
  if(!gHgl) return 0;
  if(!wglMakeCurrent(gHdc, gHgl)) return 0;

  if(!gl_load_ext()) {
    printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
    return 0;
  }

  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, gW, gH);

  // Create PBO for RGBA pixels
  p_glGenBuffers(1, &gPBO);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  // Create texture
  glGenTextures(1, &gTex);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  glBindTexture(GL_TEXTURE_2D, 0);

  return 1;
}

// ===========================================================
//                  Tiny NN (LibTorch -> weights)
// ===========================================================

#define NN_IN 2
#define NN_H 16
#define NN_OUT 3

struct TinyMLPImpl : torch::nn::Module {
  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
  TinyMLPImpl() {
    fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
    fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
  }
  torch::Tensor forward(torch::Tensor x) {
    x = torch::tanh(fc1->forward(x));
    x = torch::tanh(fc2->forward(x));
    return x;
  }
};
TORCH_MODULE(TinyMLP);

static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
  if(!W1 || !b1 || !W2 || !b2) return 0;

  try {
    torch::NoGradGuard ng;
    torch::manual_seed(1);
    TinyMLP m;
    m->eval();

    auto w1  = m->fc1->weight.detach().contiguous().to(torch::kCPU);
    auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);

    auto w2  = m->fc2->weight.detach().contiguous().to(torch::kCPU);
    auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);

    memcpy(W1, w1.data_ptr<float>(),  sizeof(float)*NN_H*NN_IN);
    memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
    memcpy(W2, w2.data_ptr<float>(),  sizeof(float)*NN_OUT*NN_H);
    memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);

    return 1;
  }
  catch(const c10::Error& e) {
    printf("\n[LibTorch] Error: %s", e.what());
    return 0;
  }
  catch(...) {
    printf("\n[LibTorch] Unknown error.");
    return 0;
  }
}

// ===========================================================
//                    OpenCL (GL sharing)
// ===========================================================

static int gCL_Ready = 0;

static cl_platform_id   gCL_Platform = 0;
static cl_device_id     gCL_Device   = 0;
static cl_context       gCL_Context  = 0;
static cl_command_queue gCL_Queue    = 0;
static cl_program       gCL_Program  = 0;

static cl_kernel        gCL_K_NN     = 0;

static cl_mem gCL_PBO = 0; // CL view of GL PBO

static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;

#define STR2(x) #x
#define XSTR(x) STR2(x)

static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height,              \n"
"  __global const float* W1, __global const float* b1,                             \n"
"  __global const float* W2, __global const float* b2, float phase)                \n"
"{                                                                                 \n"
"  int xpix = (int)get_global_id(0);                                               \n"
"  int ypix = (int)get_global_id(1);                                               \n"
"  if(xpix >= width || ypix >= height) return;                                     \n"
"                                                                                  \n"
"  float x = ((float)xpix / (float)(width  - 1)) * 2.0f - 1.0f;                    \n"
"  float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f;                    \n"
"  float in0 = 2.8f*x + 0.7f*sin(3.0f*y + phase);                                  \n"
"  float in1 = -2.8f*y + 0.7f*cos(3.0f*x - 1.3f*phase);                            \n"
"                                                                                  \n"
"  float h[" XSTR(NN_H) "];                                                        \n"
"  for(int j=0;j<" XSTR(NN_H) ";j++){                                              \n"
"    float acc = b1[j];                                                            \n"
"    acc += in0 * W1[j*" XSTR(NN_IN) " + 0];                                       \n"
"    acc += in1 * W1[j*" XSTR(NN_IN) " + 1];                                       \n"
"    h[j] = tanh(acc);                                                             \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float o[" XSTR(NN_OUT) "];                                                      \n"
"  for(int k=0;k<" XSTR(NN_OUT) ";k++){                                            \n"
"    float acc = b2[k];                                                            \n"
"    for(int j=0;j<" XSTR(NN_H) ";j++){                                            \n"
"      acc += h[j] * W2[k*" XSTR(NN_H) " + j];                                     \n"
"    }                                                                             \n"
"    float s = 0.5f + 0.5f*tanh(acc);                                              \n"
"    if(s<0) s=0; if(s>1) s=1;                                                     \n"
"    o[k] = s;                                                                     \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float radial = sqrt(x*x + y*y);                                                 \n"
"  float vignette = clamp(1.15f - radial, 0.0f, 1.0f);                             \n"
"  float stripe = 0.5f + 0.5f*sin(10.0f*(x + y) + phase);                          \n"
"  float rcol = clamp(0.70f*o[0] + 0.30f*stripe, 0.0f, 1.0f) * vignette;           \n"
"  float gcol = clamp(0.85f*o[1] + 0.15f*(1.0f - stripe), 0.0f, 1.0f) * vignette;  \n"
"  float bcol = clamp(0.75f*o[2] + 0.25f*(0.5f + 0.5f*cos(8.0f*x - phase)),0.0f,1.0f);\n"
"  uchar r = (uchar)(255.0f*rcol);                                                 \n"
"  uchar g = (uchar)(255.0f*gcol);                                                 \n"
"  uchar b = (uchar)(255.0f*bcol);                                                 \n"
"  out[ypix*width + xpix] = (uchar4)(r,g,b,255);                                   \n"
"}                                                                                 \n";

static void cl_release_all()
{
  if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
  if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
  if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
  if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }

  if(gCL_PBO)    { clReleaseMemObject(gCL_PBO);    gCL_PBO = 0; }
  if(gCL_K_NN)   { clReleaseKernel(gCL_K_NN);      gCL_K_NN = 0; }
  if(gCL_Program){ clReleaseProgram(gCL_Program);  gCL_Program = 0; }
  if(gCL_Queue)  { clReleaseCommandQueue(gCL_Queue); gCL_Queue = 0; }
  if(gCL_Context){ clReleaseContext(gCL_Context);  gCL_Context = 0; }

  gCL_Device = 0;
  gCL_Platform = 0;
  gCL_Ready = 0;
}

static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
  cl_uint nPlatforms = 0;
  if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
    return 0;

  cl_platform_id platforms[8];
  if(nPlatforms > 8) nPlatforms = 8;
  if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
    return 0;

  for(cl_uint p=0; p<nPlatforms; p++)
  {
    cl_uint nDev = 0;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
      continue;

    cl_device_id devs[8];
    if(nDev > 8) nDev = 8;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
      continue;

    for(cl_uint d=0; d<nDev; d++)
    {
      char ext[8192];
      size_t sz = 0;
      if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
        continue;

      if(strstr(ext, "cl_khr_gl_sharing"))
      {
        *outP = platforms[p];
        *outD = devs[d];
        return 1;
      }
    }
  }

  return 0;
}

static int cl_init_glshare()
{
  cl_int err = CL_SUCCESS;

  cl_platform_id P = 0;
  cl_device_id   D = 0;

  if(!cl_pick_device_with_glshare(&P, &D)) {
    printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
    return 0;
  }

  gCL_Platform = P;
  gCL_Device   = D;

  cl_context_properties props[] = {
    CL_GL_CONTEXT_KHR,   (cl_context_properties)wglGetCurrentContext(),
    CL_WGL_HDC_KHR,      (cl_context_properties)wglGetCurrentDC(),
    CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
    0
  };

  gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
  if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }

  gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
  if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }

  gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
  if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }

  err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
  if(err != CL_SUCCESS)
  {
    char logbuf[8192];
    size_t logsz = 0;
    clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
    printf("\nOpenCL build failed:\n%s", logbuf);
    cl_release_all();
    return 0;
  }

  gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
  if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }

  gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
  if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
  gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
  gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
  gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
  if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }

  float hW1[NN_H*NN_IN];
  float hb1[NN_H];
  float hW2[NN_OUT*NN_H];
  float hb2[NN_OUT];

  if(!build_weights_from_libtorch(hW1, hb1, hW2, hb2)) {
    printf("\n[LibTorch] Failed to build weights.");
    cl_release_all();
    return 0;
  }

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, hW1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, hb1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, hW2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, hb2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }

  gCL_Ready = 1;
  printf("\nOpenCL: GL-sharing enabled. NN kernel ready.");
  return 1;
}

// ===========================================================
//                      Render (CL -> GL)
// ===========================================================

static void RenderFrame()
{
  if(!gCL_Ready) return;

  size_t global[2] = { (size_t)gW, (size_t)gH };
  size_t local[2]  = { 16, 16 };

  cl_int err = CL_SUCCESS;

  err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  if(err != CL_SUCCESS) return;

  int arg = 0;
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gW);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gH);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float),  &gPhase);

  err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
  if(err != CL_SUCCESS) {
    err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
  }

  clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  clFinish(gCL_Queue);

  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glClear(GL_COLOR_BUFFER_BIT);
  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, gTex);

  glBegin(GL_QUADS);
    glTexCoord2f(0,0); glVertex2f(-1,-1);
    glTexCoord2f(1,0); glVertex2f( 1,-1);
    glTexCoord2f(1,1); glVertex2f( 1, 1);
    glTexCoord2f(0,1); glVertex2f(-1, 1);
  glEnd();

  glBindTexture(GL_TEXTURE_2D, 0);
  SwapBuffers(gHdc);

  gPhase += 0.03f;
}

// ===========================================================
//                         WinMain
// ===========================================================

int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
  // 0 means no auto-close; window stays until user closes it.
  const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
  ULONGLONG startTick = GetTickCount64();

  const char* szClass = "Mendb02NNCLGLClass";
  UnregisterClassA(szClass, hInst);

  WNDCLASSEXA wc;
  ZeroMemory(&wc, sizeof(wc));
  wc.cbSize = sizeof(wc);
  wc.style = CS_HREDRAW | CS_VREDRAW;
  wc.lpfnWndProc = WndProc;
  wc.hInstance = hInst;
  wc.hCursor = LoadCursor(NULL, IDC_ARROW);
  wc.lpszClassName = szClass;
  RegisterClassExA(&wc);

  RECT r;
  r.left=0; r.top=0; r.right=gW; r.bottom=gH;
  AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);

  HWND hwnd = CreateWindowExA(
    0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
    WS_OVERLAPPEDWINDOW,
    100, 100, (r.right-r.left), (r.bottom-r.top),
    0, 0, hInst, 0);

  if(!hwnd) return 0;

  ShowWindow(hwnd, SW_SHOW);
  UpdateWindow(hwnd);

  if(!gl_init_wgl(hwnd))
  {
    MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
    gl_release_all();
    return 0;
  }

  if(!cl_init_glshare())
  {
    MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
    cl_release_all();
    gl_release_all();
    return 0;
  }

  MSG msg;
  ZeroMemory(&msg, sizeof(msg));

  while(msg.message != WM_QUIT)
  {
    while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
    {
      TranslateMessage(&msg);
      DispatchMessage(&msg);
    }

    // Allow Zorro STOP to close this Win32 loop cleanly, but ignore
    // the sticky FIRSTINITRUN+EXITRUN combo seen at startup.
    if(is(EXITRUN) && !is(FIRSTINITRUN)) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    if(!IsWindow(hwnd))
      break;

    if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    RenderFrame();
  }

  cl_release_all();
  gl_release_all();
  gHwnd = 0;
  return 0;
}

// ===========================================================
//                         Input
// ===========================================================

LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
  switch(msg)
  {
    case WM_CLOSE:
      DestroyWindow(hWnd);
      return 0;

    case WM_KEYDOWN:
      if(wParam == VK_ESCAPE || wParam == VK_F12) {
        PostMessage(hWnd, WM_CLOSE, 0, 0);
        return 0;
      }
      return 0;

    case WM_DESTROY:
      PostQuitMessage(0);
      return 0;
  }
  return DefWindowProc(hWnd, msg, wParam, lParam);
}

// ===========================================================
//                         Zorro DLL entry
// ===========================================================

DLLFUNC int main()
{
  // Force single-cycle execution in Zorro to avoid automatic relaunches.
  NumTotalCycles = 1;
  NumWFOCycles = 1;
  NumSampleCycles = 1;
  set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);

  static int done = 0;
  if(is(FIRSTINITRUN))
    done = 0;

  if(done)
    return 0;

  (void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
  done = 1;
  return quit("!Mendb02 finished");
}


Description on how to get Zorro to execute the code :

Code
# Mendb02 (Public) Build Setup Guide

This guide explains the required directory/layout and build settings for `Mendb02.cpp` and `Mendb02_viz.cpp` without machine-specific paths.

## 1) Required Components

- Zorro 64-bit C++ strategy environment (with `zorro.h` and `ZorroDLL.cpp`)
- Visual Studio Build Tools (x64 C++ compiler)
- OpenCL SDK/runtime headers and libs
- OpenGL system libs (Windows)
- LibTorch C++ distribution (CPU or CUDA build)

## 2) Recommended Folder Variables

Define these paths in your build script (or IDE project settings):

- `ZORRO_ROOT` - Zorro installation root
- `LIBTORCH_ROOT` - LibTorch root folder containing `include/` and `lib/`
- `CUDA_ROOT` - CUDA toolkit root (only if using CUDA-enabled LibTorch)

Your source file can stay anywhere (for example `Strategy/Mendb02_viz.cpp`).

## 3) Include Directories

Add these include directories:

- `<ZORRO_ROOT>/include`
- `<LIBTORCH_ROOT>/include`
- `<LIBTORCH_ROOT>/include/torch/csrc/api/include`
- `<CUDA_ROOT>/include` (optional; needed for CUDA headers)

## 4) Library Directories

Add these library directories:

- `<LIBTORCH_ROOT>/lib`
- `<CUDA_ROOT>/lib/x64` (if CUDA build)

## 5) Link Libraries

Minimum Windows/OpenCL/OpenGL libs:

- `OpenCL.lib`
- `OpenGL32.lib`
- `User32.lib`
- `Gdi32.lib`

LibTorch libs (CPU-only setup):

- `c10.lib`
- `torch.lib`
- `torch_cpu.lib`

LibTorch libs (CUDA setup):

- `c10.lib`
- `c10_cuda.lib`
- `torch.lib`
- `torch_cpu.lib`
- `torch_cuda.lib`
- `cudart.lib`

## 6) Runtime DLLs

At runtime, required DLLs must be discoverable by Windows loader (either next to strategy DLL or on `PATH`).

Typical requirement:

- All needed files from `<LIBTORCH_ROOT>/lib/*.dll`
- CUDA runtime DLLs (if CUDA-enabled build)

## 7) Compile Flags (Typical)

Recommended flags for this code style:

- `/MD` (dynamic runtime)
- `/EHa`
- `/O2`
- `/std:c++17`
- `/permissive`
- `/D _WINDLL`

## 8) Zorro compile64.bat Routing

If Zorro compiles strategies by filename dispatch, ensure each LibTorch-based file is routed to a LibTorch-enabled branch.

Example logic:

- `if /I "%SRCNAME%"=="Mendb02.cpp" goto :build_libtorch`
- `if /I "%SRCNAME%"=="Mendb02_viz.cpp" goto :build_libtorch`

If this mapping is missing, compilation may fail with missing `torch/torch.h`.

## 9) Common Failure Modes

- **C1189 / torch headers not found**: missing LibTorch include paths
- **LNK1104 on output DLL**: target DLL is locked by running process
- **OpenCL GL-sharing init failed**: kernel compile/runtime mismatch, unsupported GL-sharing device, or context mismatch
- **Runtime DLL load error**: required LibTorch/CUDA DLLs not on loader path

## 10) Publish-Safe Notes

The public source variants intentionally avoid hardcoded local absolute paths. Keep all machine-specific paths in build scripts or environment variables.


Last edited by TipmyPip; Yesterday at 18:03.