This code is a hybrid visual computation system that turns an evolving internal state into animated graphics through the cooperation of several layers of software and hardware. At its core, it is not simply a graphics program and not simply a neural network demo. It is better understood as a symbolic machine for transforming synthetic randomness into structured visual behavior. The program opens a native Windows rendering window, builds an OpenGL drawing surface, attaches an OpenCL compute pipeline to that surface, initializes a small neural network through LibTorch, and then uses all of those pieces together to generate a continuously changing image. The result is an engine in which a neural system, a random process, and a graphics pipeline become different expressions of the same evolving state.

The first important idea in the design is separation of roles. LibTorch is used to define and initialize the neural network. It provides the model structure, the layer weights, and the tensor-compatible machinery for generating the initial parameters of the tiny multilayer perceptron. OpenCL is used as the numerical worker that evaluates the network at image scale. Rather than asking the CPU to run the network for every pixel, the program sends the weights and contextual inputs to an OpenCL kernel, which computes the color field in parallel on the graphics device. OpenGL then serves as the presentation layer. It does not perform the neural reasoning itself. Instead, it receives the finished pixel buffer and displays it as a texture on a full-screen quad. In symbolic terms, LibTorch provides the form of thought, OpenCL provides the act of distributed evaluation, and OpenGL provides the visible body of the result.

A second important idea is that the image is not based on market data or a fixed deterministic simulation. The network is conditioned by synthetic state variables that are themselves derived from evolving non-deterministic internal values. These variables are named regime, volatility, trend, and risk. They do not correspond to actual trading statistics in this version. Instead, they are abstract latent descriptors computed from the current random seed, the animation phase, and the changing neural parameters themselves. This gives the output a deeper structure than simple noise. Each frame is no longer only a reaction to pixel position and per-pixel jitter. It becomes the visual manifestation of a global internal condition. In that sense, the program behaves like a synthetic cognitive field, where each point in the image is influenced both by local coordinates and by a shared hidden mood.

The program begins with careful include ordering because it combines many libraries that can conflict with one another. LibTorch is included first, then Zorro, and macros that might cause name collisions are cleaned up afterward. This is a practical but very important architectural detail. It reflects that the file is meant to compile inside the Zorro ecosystem while also depending on modern machine learning and graphics toolchains. A small but important note is that to compile this successfully with Zorro64, local directories containing the needed DLL files and library dependencies must be available and correctly configured. In practice, this usually means that LibTorch runtime DLLs, OpenCL support, and any required graphics-related binaries must be reachable through local compiler and runtime paths. Without those local dependency directories, the code may compile incorrectly or fail at load time even if the source itself is valid.

The OpenGL initialization part constructs the visible world. A Win32 window is created, a device context is obtained, and a WGL context is attached. The code then dynamically loads the OpenGL buffer functions needed for pixel buffer object support. This is essential because the rendered image is not drawn point by point through traditional immediate graphics commands. Instead, a pixel buffer object is allocated as a block of GPU-visible memory, and a texture is created to receive that data. The texture becomes the canvas that OpenGL displays every frame. This means the visible image is really the endpoint of a dataflow pipeline, not a manually painted scene.

The neural network is intentionally small. It has six input channels, one hidden layer, and three outputs that are later interpreted as red, green, and blue color tendencies. The six inputs include the two spatial coordinates together with the four frame-level context values. This turns the network into a context-conditioned generator rather than a simple coordinate mapper. In abstract terms, the network learns a function from place and latent state into color. Since the weights are not trained on a dataset here, their meaning emerges from initialization and subsequent internal evolution. That gives the system an experimental character. It is less like a classifier and more like a dynamic symbolic organ whose activity is made visible.

LibTorch is used only on the host side to instantiate the network and extract its initial weights and biases. Once those parameters are copied into plain arrays, they are transferred into OpenCL buffers. From that point on, the network is evaluated inside the OpenCL kernel for every pixel. That division is mathematically elegant. The high-level neural definition exists in LibTorch, but the large-scale field evaluation is delegated to a massively parallel compute layer. This means the neural model has a dual existence: as a conceptual structure in LibTorch and as a numerical stencil in OpenCL. One defines the architecture, the other enacts it across space.

The OpenCL kernel is where the local and global levels meet. For each pixel, the kernel computes normalized coordinates, derives a deterministic jitter from the current seed and pixel index, then constructs the six neural inputs. The first two are simply position. The remaining four are the shared context channels, slightly modulated by local oscillation and jitter. This is a key conceptual improvement over a purely local shader. The frame now possesses a coherent internal theme because all pixels are influenced by the same synthetic regime, volatility, trend, and risk state. At the same time, local perturbations preserve texture and detail. This balance between shared condition and local variation is what gives the output the feeling of organized complexity.

After the hidden layer and output layer are evaluated, the network outputs are transformed into color components. These are then mixed with radial shading, stripes, and oscillatory modulation. This means the final picture is not a raw neural output. It is a composition between neural activation and geometric post-processing. Symbolically, the network provides the semantic field, while the handcrafted spatial operators provide the visual grammar. The result is a blend of learned structure and procedural ornament.

Another major feature is that the neural parameters themselves evolve over time. The code packs all weights and biases into a single parameter vector and updates them through a recurrence that combines neighbor coupling, slow drift, bounded noise, and weak mean-reversion-like corrections. This makes the network a dynamic object rather than a frozen model. Its internal configuration changes from frame to frame, and the random context variables are partly derived from the statistical properties of those changing parameters. In effect, the network influences the context, and the context influences how the network is evaluated. This creates a feedback loop. Even though there is no training objective in the usual sense, the system still exhibits a kind of endogenous evolution.

That feedback loop is the deepest symbolic aspect of the code. The neural network is not merely being used as a function approximator. It becomes part of a self-modulating visual dynamical system. The parameter field changes over time, the derived context summarizes aspects of that changing field, the context is fed back into the neural inputs, and the resulting image becomes the visible trace of the current internal condition. This is why the code can be described as an engine of symbolic stochastic expression rather than just a renderer.

The WinMain loop keeps this whole machine alive. It pumps messages, maintains the window, allows graceful shutdown, and calls the rendering function repeatedly. Zorro is only used here as the hosting environment and lifecycle controller. The exported main function prevents repeated relaunches and ensures the visual process runs once as intended inside Zorro64. That makes the code both a standalone visual machine and a Zorro-compatible DLL-based component.

In summary, this program is a layered computational artwork and experimental system. It combines machine learning initialization, GPU parallel evaluation, graphics interop, procedural modulation, and self-evolving stochastic state. Its symbolic name, The Stochastic Prism Engine, fits because the code takes hidden random structure and refracts it into visible organized color. It is a prism not for light alone, but for latent computational state.

Code
// Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch
// + Random-context-conditioned rendering: x, y, regime, volatility, trend, risk

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif

#define _CRT_SECURE_NO_WARNINGS

// ============================================================
// 1) Include LibTorch FIRST
// ============================================================
#if defined(__has_include)
  #if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
    #include <torch/torch.h>
    #include <torch/script.h>
  #else
    #error "LibTorch headers not found. Add LibTorch include paths to your compiler settings."
  #endif
#else
  #include <torch/torch.h>
  #include <torch/script.h>
#endif

// Optional CUDA headers
#if defined(__has_include)
  #if __has_include(<torch/cuda.h>)
    #include <torch/cuda.h>
    #define HAVE_TORCH_CUDA_HEADER 1
  #else
    #define HAVE_TORCH_CUDA_HEADER 0
  #endif
  #if __has_include(<cuda_runtime_api.h>)
    #include <cuda_runtime_api.h>
    #define HAVE_CUDA_RUNTIME_API_HEADER 1
  #else
    #define HAVE_CUDA_RUNTIME_API_HEADER 0
  #endif
#else
  #define HAVE_TORCH_CUDA_HEADER 0
  #define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif

#if defined(__has_include)
  #if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
    #include <c10/cuda/CUDAGuard.h>
    #define HAVE_C10_CUDAGUARD 1
  #else
    #define HAVE_C10_CUDAGUARD 0
  #endif
#else
  #define HAVE_C10_CUDAGUARD 0
#endif

// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at'
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at

// ============================================================
// 4) Cleanup macro landmines
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif
#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif

// ============================================================
// OpenCL + OpenGL includes
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h>
#include <CL/cl_gl_ext.h>
#include <GL/gl.h>

#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif

// ------------------------- Globals -------------------------
static HWND   gHwnd = 0;
static HDC    gHdc  = 0;
static HGLRC  gHgl  = 0;

static int    gW = 640;
static int    gH = 480;
static float  gPhase = 0.0f;
static unsigned int gNoiseSeed = 1u;

struct RandomContext {
  float regime;
  float volatility;
  float trend;
  float risk;
};

static RandomContext gCtx = {0.0f, 0.0f, 0.0f, 0.0f};

static int read_env_int(const char* key, int fallback)
{
  const char* s = getenv(key);
  if(!s || !*s) return fallback;
  int v = atoi(s);
  return (v > 0) ? v : fallback;
}

// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);

// ===========================================================
//                Minimal OpenGL function loading
// ===========================================================

#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif

#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif

typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);

static PFNGLGENBUFFERSPROC    p_glGenBuffers    = 0;
static PFNGLBINDBUFFERPROC    p_glBindBuffer    = 0;
static PFNGLBUFFERDATAPROC    p_glBufferData    = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;

static void* gl_get_proc(const char* name)
{
  void* p = (void*)wglGetProcAddress(name);
  if(!p) {
    HMODULE ogl = GetModuleHandleA("opengl32.dll");
    if(ogl) p = (void*)GetProcAddress(ogl, name);
  }
  return p;
}

static int gl_load_ext()
{
  p_glGenBuffers    = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
  p_glBindBuffer    = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
  p_glBufferData    = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
  p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");

  if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
    return 0;
  return 1;
}

// ===========================================================
//                       OpenGL objects
// ===========================================================

static GLuint gPBO = 0;
static GLuint gTex = 0;

static void gl_release_all()
{
  if(gTex) {
    glDeleteTextures(1, &gTex);
    gTex = 0;
  }
  if(gPBO) {
    if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
    gPBO = 0;
  }

  if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
  if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}

static int gl_init_wgl(HWND hwnd)
{
  gHwnd = hwnd;
  gHdc = GetDC(hwnd);
  if(!gHdc) return 0;

  PIXELFORMATDESCRIPTOR pfd;
  ZeroMemory(&pfd, sizeof(pfd));
  pfd.nSize      = sizeof(pfd);
  pfd.nVersion   = 1;
  pfd.dwFlags    = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
  pfd.iPixelType = PFD_TYPE_RGBA;
  pfd.cColorBits = 32;
  pfd.cDepthBits = 16;
  pfd.iLayerType = PFD_MAIN_PLANE;

  int pf = ChoosePixelFormat(gHdc, &pfd);
  if(pf == 0) return 0;
  if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;

  gHgl = wglCreateContext(gHdc);
  if(!gHgl) return 0;
  if(!wglMakeCurrent(gHdc, gHgl)) return 0;

  if(!gl_load_ext()) {
    printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
    return 0;
  }

  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, gW, gH);

  p_glGenBuffers(1, &gPBO);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glGenTextures(1, &gTex);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  glBindTexture(GL_TEXTURE_2D, 0);

  return 1;
}

// ===========================================================
//                  Tiny NN (LibTorch -> weights)
// ===========================================================

#define NN_IN 6
#define NN_H 16
#define NN_OUT 3
#define NN_PARAM_COUNT (NN_H*NN_IN + NN_H + NN_OUT*NN_H + NN_OUT)

static float gHost_W1[NN_H*NN_IN];
static float gHost_b1[NN_H];
static float gHost_W2[NN_OUT*NN_H];
static float gHost_b2[NN_OUT];

struct TinyMLPImpl : torch::nn::Module {
  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
  TinyMLPImpl() {
    fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
    fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
  }
  torch::Tensor forward(torch::Tensor x) {
    x = torch::tanh(fc1->forward(x));
    x = torch::tanh(fc2->forward(x));
    return x;
  }
};
TORCH_MODULE(TinyMLP);

static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
  if(!W1 || !b1 || !W2 || !b2) return 0;

  try {
    torch::NoGradGuard ng;
    torch::manual_seed((uint64_t)time(NULL) ^ (uint64_t)GetTickCount64());
    TinyMLP m;
    m->eval();

    auto w1  = m->fc1->weight.detach().contiguous().to(torch::kCPU);
    auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);

    auto w2  = m->fc2->weight.detach().contiguous().to(torch::kCPU);
    auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);

    memcpy(W1, w1.data_ptr<float>(),  sizeof(float)*NN_H*NN_IN);
    memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
    memcpy(W2, w2.data_ptr<float>(),  sizeof(float)*NN_OUT*NN_H);
    memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);

    return 1;
  }
  catch(const c10::Error& e) {
    printf("\n[LibTorch] Error: %s", e.what());
    return 0;
  }
  catch(...) {
    printf("\n[LibTorch] Unknown error.");
    return 0;
  }
}

// ===========================================================
//                    OpenCL (GL sharing)
// ===========================================================

static int gCL_Ready = 0;

static cl_platform_id   gCL_Platform = 0;
static cl_device_id     gCL_Device   = 0;
static cl_context       gCL_Context  = 0;
static cl_command_queue gCL_Queue    = 0;
static cl_program       gCL_Program  = 0;

static cl_kernel        gCL_K_NN     = 0;

static cl_mem gCL_PBO = 0;
static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;

static void pack_params(float* theta);
static void unpack_params(const float* theta);
static void evolve_params_accumulated(float phase, unsigned int seed);

#define STR2(x) #x
#define XSTR(x) STR2(x)

static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height,              \n"
"  __global const float* W1, __global const float* b1,                             \n"
"  __global const float* W2, __global const float* b2,                             \n"
"  float phase, uint seed,                                                         \n"
"  float ctxRegime, float ctxVol, float ctxTrend, float ctxRisk)                   \n"
"{                                                                                 \n"
"  int xpix = (int)get_global_id(0);                                               \n"
"  int ypix = (int)get_global_id(1);                                               \n"
"  if(xpix >= width || ypix >= height) return;                                     \n"
"                                                                                  \n"
"  float x = ((float)xpix / (float)(width  - 1)) * 2.0f - 1.0f;                    \n"
"  float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f;                    \n"
"  uint n = (uint)(xpix*1973u) ^ (uint)(ypix*9277u) ^ (seed*26699u + 911u);        \n"
"  n = (n << 13) ^ n;                                                               \n"
"  uint m = (n * (n*n*15731u + 789221u) + 1376312589u);                            \n"
"  float jitter = ((float)(m & 0x00ffffffu) / 16777215.0f) * 2.0f - 1.0f;          \n"
"                                                                                  \n"
"  float in0 = x;                                                                  \n"
"  float in1 = y;                                                                  \n"
"  float in2 = clamp(ctxRegime + 0.20f*jitter + 0.15f*sin(phase + 3.0f*x), -1.0f, 1.0f); \n"
"  float in3 = clamp(ctxVol + 0.25f*fabs(jitter) + 0.10f*cos(phase + 4.0f*y), -1.0f, 1.0f); \n"
"  float in4 = clamp(ctxTrend + 0.15f*sin(2.0f*x - 1.5f*y + phase), -1.0f, 1.0f); \n"
"  float in5 = clamp(ctxRisk + 0.10f*cos(3.0f*(x+y) - phase), -1.0f, 1.0f);        \n"
"                                                                                  \n"
"  float h[" XSTR(NN_H) "];                                                        \n"
"  for(int j=0;j<" XSTR(NN_H) ";j++){                                              \n"
"    float acc = b1[j];                                                            \n"
"    acc += in0 * W1[j*" XSTR(NN_IN) " + 0];                                       \n"
"    acc += in1 * W1[j*" XSTR(NN_IN) " + 1];                                       \n"
"    acc += in2 * W1[j*" XSTR(NN_IN) " + 2];                                       \n"
"    acc += in3 * W1[j*" XSTR(NN_IN) " + 3];                                       \n"
"    acc += in4 * W1[j*" XSTR(NN_IN) " + 4];                                       \n"
"    acc += in5 * W1[j*" XSTR(NN_IN) " + 5];                                       \n"
"    h[j] = tanh(acc);                                                             \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float o[" XSTR(NN_OUT) "];                                                      \n"
"  for(int k=0;k<" XSTR(NN_OUT) ";k++){                                            \n"
"    float acc = b2[k];                                                            \n"
"    for(int j=0;j<" XSTR(NN_H) ";j++){                                            \n"
"      acc += h[j] * W2[k*" XSTR(NN_H) " + j];                                     \n"
"    }                                                                             \n"
"    float s = 0.5f + 0.5f*tanh(acc);                                              \n"
"    if(s<0) s=0; if(s>1) s=1;                                                     \n"
"    o[k] = s;                                                                     \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float radial = sqrt(x*x + y*y);                                                 \n"
"  float vignette = clamp(1.15f - radial, 0.0f, 1.0f);                             \n"
"  float stripe = 0.5f + 0.5f*sin(10.0f*(x + y) + phase + 2.0f*jitter);            \n"
"  float rcol = clamp(0.70f*o[0] + 0.30f*stripe, 0.0f, 1.0f) * vignette;           \n"
"  float gcol = clamp(0.85f*o[1] + 0.15f*(1.0f - stripe), 0.0f, 1.0f) * vignette;  \n"
"  float bcol = clamp(0.75f*o[2] + 0.25f*(0.5f + 0.5f*cos(8.0f*x - phase)),0.0f,1.0f);\n"
"  uchar r = (uchar)(255.0f*rcol);                                                 \n"
"  uchar g = (uchar)(255.0f*gcol);                                                 \n"
"  uchar b = (uchar)(255.0f*bcol);                                                 \n"
"  out[ypix*width + xpix] = (uchar4)(r,g,b,255);                                   \n"
"}                                                                                 \n";

static void cl_release_all()
{
  if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
  if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
  if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
  if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }

  if(gCL_PBO)     { clReleaseMemObject(gCL_PBO);     gCL_PBO = 0; }
  if(gCL_K_NN)    { clReleaseKernel(gCL_K_NN);       gCL_K_NN = 0; }
  if(gCL_Program) { clReleaseProgram(gCL_Program);   gCL_Program = 0; }
  if(gCL_Queue)   { clReleaseCommandQueue(gCL_Queue);gCL_Queue = 0; }
  if(gCL_Context) { clReleaseContext(gCL_Context);   gCL_Context = 0; }

  gCL_Device = 0;
  gCL_Platform = 0;
  gCL_Ready = 0;
}

static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
  cl_uint nPlatforms = 0;
  if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
    return 0;

  cl_platform_id platforms[8];
  if(nPlatforms > 8) nPlatforms = 8;
  if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
    return 0;

  for(cl_uint p=0; p<nPlatforms; p++)
  {
    cl_uint nDev = 0;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
      continue;

    cl_device_id devs[8];
    if(nDev > 8) nDev = 8;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
      continue;

    for(cl_uint d=0; d<nDev; d++)
    {
      char ext[8192];
      size_t sz = 0;
      if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
        continue;

      if(strstr(ext, "cl_khr_gl_sharing"))
      {
        *outP = platforms[p];
        *outD = devs[d];
        return 1;
      }
    }
  }

  return 0;
}

static int cl_init_glshare()
{
  cl_int err = CL_SUCCESS;

  cl_platform_id P = 0;
  cl_device_id   D = 0;

  if(!cl_pick_device_with_glshare(&P, &D)) {
    printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
    return 0;
  }

  gCL_Platform = P;
  gCL_Device   = D;

  cl_context_properties props[] = {
    CL_GL_CONTEXT_KHR,   (cl_context_properties)wglGetCurrentContext(),
    CL_WGL_HDC_KHR,      (cl_context_properties)wglGetCurrentDC(),
    CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
    0
  };

  gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
  if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }

  gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
  if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }

  gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
  if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }

  err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
  if(err != CL_SUCCESS)
  {
    char logbuf[8192];
    size_t logsz = 0;
    clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
    printf("\nOpenCL build failed:\n%s", logbuf);
    cl_release_all();
    return 0;
  }

  gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
  if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }

  gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
  if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
  gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
  gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
  gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
  if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }

  if(!build_weights_from_libtorch(gHost_W1, gHost_b1, gHost_W2, gHost_b2)) {
    printf("\n[LibTorch] Failed to build weights.");
    cl_release_all();
    return 0;
  }

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, gHost_W1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, gHost_b1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, gHost_W2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, gHost_b2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }

  gCL_Ready = 1;
  return 1;
}

// ===========================================================
//                  Random context helpers
// ===========================================================

static float clampf(float x, float lo, float hi)
{
  if(x < lo) return lo;
  if(x > hi) return hi;
  return x;
}

static float u32_to_unit(unsigned int x)
{
  return (float)(x & 0x00FFFFFFu) / 16777215.0f;
}

static unsigned int mix_u32(unsigned int x)
{
  x ^= x >> 16;
  x *= 2246822519u;
  x ^= x >> 13;
  x *= 3266489917u;
  x ^= x >> 16;
  return x;
}

static void compute_random_context(float phase, unsigned int seed)
{
  float meanW1 = 0.0f, meanb1 = 0.0f, meanW2 = 0.0f, meanb2 = 0.0f;
  float absW1  = 0.0f, absW2  = 0.0f;

  for(int i=0;i<NN_H*NN_IN;i++) {
    meanW1 += gHost_W1[i];
    absW1  += fabsf(gHost_W1[i]);
  }
  for(int i=0;i<NN_H;i++) {
    meanb1 += gHost_b1[i];
  }
  for(int i=0;i<NN_OUT*NN_H;i++) {
    meanW2 += gHost_W2[i];
    absW2  += fabsf(gHost_W2[i]);
  }
  for(int i=0;i<NN_OUT;i++) {
    meanb2 += gHost_b2[i];
  }

  meanW1 /= (float)(NN_H*NN_IN);
  meanb1 /= (float)NN_H;
  meanW2 /= (float)(NN_OUT*NN_H);
  meanb2 /= (float)NN_OUT;
  absW1  /= (float)(NN_H*NN_IN);
  absW2  /= (float)(NN_OUT*NN_H);

  unsigned int h0 = mix_u32(seed ^ 0xA341316Cu);
  unsigned int h1 = mix_u32(seed ^ 0xC8013EA4u);
  unsigned int h2 = mix_u32(seed ^ 0xAD90777Du);
  unsigned int h3 = mix_u32(seed ^ 0x7E95761Eu);

  float n0 = u32_to_unit(h0) * 2.0f - 1.0f;
  float n1 = u32_to_unit(h1) * 2.0f - 1.0f;
  float n2 = u32_to_unit(h2) * 2.0f - 1.0f;
  float n3 = u32_to_unit(h3) * 2.0f - 1.0f;

  gCtx.regime =
    tanhf(0.9f*sinf(0.31f*phase) + 0.6f*cosf(0.17f*phase) + 0.35f*meanW1 + 0.20f*n0);

  gCtx.volatility =
    clampf(0.5f + 0.8f*absW1 + 0.6f*absW2 + 0.15f*n1, -1.0f, 1.0f);

  gCtx.trend =
    tanhf(1.4f*(meanW2 - meanW1) + 0.5f*sinf(0.12f*phase + 1.3f) + 0.20f*n2);

  {
    float raw =
      0.8f*gCtx.regime - 0.6f*fabsf(gCtx.volatility) + 0.5f*gCtx.trend + 0.15f*n3;
    gCtx.risk = tanhf(raw);
  }
}

// ===========================================================
//                      Render (CL -> GL)
// ===========================================================

static void RenderFrame()
{
  if(!gCL_Ready) return;

  size_t global[2] = { (size_t)gW, (size_t)gH };
  size_t local[2]  = { 16, 16 };

  cl_int err = CL_SUCCESS;

  err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  if(err != CL_SUCCESS) return;

  LARGE_INTEGER qpc;
  QueryPerformanceCounter(&qpc);
  gNoiseSeed = (unsigned int)(qpc.QuadPart ^ (qpc.QuadPart >> 32) ^ (LONGLONG)GetTickCount64());

  evolve_params_accumulated(gPhase, gNoiseSeed);
  compute_random_context(gPhase, gNoiseSeed);

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_FALSE, 0, bytesW1, gHost_W1, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_FALSE, 0, bytesb1, gHost_b1, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_FALSE, 0, bytesW2, gHost_W2, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_FALSE, 0, bytesb2, gHost_b2, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }

  int arg = 0;
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gW);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gH);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float),  &gPhase);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(unsigned int), &gNoiseSeed);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float), &gCtx.regime);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float), &gCtx.volatility);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float), &gCtx.trend);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float), &gCtx.risk);

  err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
  if(err != CL_SUCCESS) {
    err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
  }

  clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  clFinish(gCL_Queue);

  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glClear(GL_COLOR_BUFFER_BIT);
  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, gTex);

  glBegin(GL_QUADS);
    glTexCoord2f(0,0); glVertex2f(-1,-1);
    glTexCoord2f(1,0); glVertex2f( 1,-1);
    glTexCoord2f(1,1); glVertex2f( 1, 1);
    glTexCoord2f(0,1); glVertex2f(-1, 1);
  glEnd();

  glBindTexture(GL_TEXTURE_2D, 0);
  SwapBuffers(gHdc);

  gPhase += 0.03f;
}

// ===========================================================
//                  Parameter evolution helpers
// ===========================================================

static void pack_params(float* theta)
{
  int p = 0;
  for(int i=0;i<NN_H*NN_IN;i++) theta[p++] = gHost_W1[i];
  for(int i=0;i<NN_H;i++)       theta[p++] = gHost_b1[i];
  for(int i=0;i<NN_OUT*NN_H;i++)theta[p++] = gHost_W2[i];
  for(int i=0;i<NN_OUT;i++)     theta[p++] = gHost_b2[i];
}

static void unpack_params(const float* theta)
{
  int p = 0;
  for(int i=0;i<NN_H*NN_IN;i++) gHost_W1[i] = theta[p++];
  for(int i=0;i<NN_H;i++)       gHost_b1[i] = theta[p++];
  for(int i=0;i<NN_OUT*NN_H;i++)gHost_W2[i] = theta[p++];
  for(int i=0;i<NN_OUT;i++)     gHost_b2[i] = theta[p++];
}

static void evolve_params_accumulated(float phase, unsigned int seed)
{
  float theta[NN_PARAM_COUNT];
  float nextv[NN_PARAM_COUNT];
  pack_params(theta);

  for(int i=0;i<NN_PARAM_COUNT;i++) {
    int l = (i == 0) ? (NN_PARAM_COUNT - 1) : (i - 1);
    int r = (i + 1) % NN_PARAM_COUNT;
    float coupled = 0.55f*theta[l] + 0.45f*theta[r];
    float drift = 0.015f*sinf(0.8f*phase + 0.17f*(float)i);

    unsigned int h = mix_u32(seed ^ (unsigned int)(i*747796405u + 2891336453u));
    float noise = (((float)(h & 0xFFFFu) / 65535.0f) * 2.0f - 1.0f) * 0.010f;

    float v = 0.982f*theta[i] + 0.022f*coupled + drift + noise;
    if(v > 3.0f) v = 3.0f;
    if(v < -3.0f) v = -3.0f;
    nextv[i] = v;
  }

  unpack_params(nextv);

  float mW1 = 0.0f, mb1 = 0.0f, mW2 = 0.0f, mb2 = 0.0f;
  for(int i=0;i<NN_H*NN_IN;i++) mW1 += gHost_W1[i];
  for(int i=0;i<NN_H;i++) mb1 += gHost_b1[i];
  for(int i=0;i<NN_OUT*NN_H;i++) mW2 += gHost_W2[i];
  for(int i=0;i<NN_OUT;i++) mb2 += gHost_b2[i];
  mW1 /= (float)(NN_H*NN_IN);
  mb1 /= (float)NN_H;
  mW2 /= (float)(NN_OUT*NN_H);
  mb2 /= (float)NN_OUT;

  for(int i=0;i<NN_H*NN_IN;i++) gHost_W1[i] += 0.003f*(mb1 - mW1);
  for(int i=0;i<NN_H;i++) gHost_b1[i] += 0.004f*(mW2 - mb1);
  for(int i=0;i<NN_OUT*NN_H;i++) gHost_W2[i] += 0.003f*(mb2 - mW2);
  for(int i=0;i<NN_OUT;i++) gHost_b2[i] += 0.004f*(mW1 - mb2);
}

// ===========================================================
//                         WinMain
// ===========================================================

int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
  const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
  ULONGLONG startTick = GetTickCount64();

  const char* szClass = "Mendb02NNCLGLClass";
  UnregisterClassA(szClass, hInst);

  WNDCLASSEXA wc;
  ZeroMemory(&wc, sizeof(wc));
  wc.cbSize = sizeof(wc);
  wc.style = CS_HREDRAW | CS_VREDRAW;
  wc.lpfnWndProc = WndProc;
  wc.hInstance = hInst;
  wc.hCursor = LoadCursor(NULL, IDC_ARROW);
  wc.lpszClassName = szClass;
  RegisterClassExA(&wc);

  RECT r;
  r.left=0; r.top=0; r.right=gW; r.bottom=gH;
  AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);

  HWND hwnd = CreateWindowExA(
    0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
    WS_OVERLAPPEDWINDOW,
    100, 100, (r.right-r.left), (r.bottom-r.top),
    0, 0, hInst, 0);

  if(!hwnd) return 0;

  ShowWindow(hwnd, SW_SHOW);
  UpdateWindow(hwnd);

  if(!gl_init_wgl(hwnd))
  {
    MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
    gl_release_all();
    return 0;
  }

  if(!cl_init_glshare())
  {
    MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
    cl_release_all();
    gl_release_all();
    return 0;
  }

  MSG msg;
  ZeroMemory(&msg, sizeof(msg));

  while(msg.message != WM_QUIT)
  {
    while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
    {
      TranslateMessage(&msg);
      DispatchMessage(&msg);
    }

    if(is(EXITRUN) && !is(FIRSTINITRUN)) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    if(!IsWindow(hwnd))
      break;

    if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    RenderFrame();
  }

  cl_release_all();
  gl_release_all();
  gHwnd = 0;
  return 0;
}

// ===========================================================
//                         Input
// ===========================================================

LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
  switch(msg)
  {
    case WM_CLOSE:
      DestroyWindow(hWnd);
      return 0;

    case WM_KEYDOWN:
      if(wParam == VK_ESCAPE || wParam == VK_F12) {
        PostMessage(hWnd, WM_CLOSE, 0, 0);
        return 0;
      }
      return 0;

    case WM_DESTROY:
      PostQuitMessage(0);
      return 0;
  }
  return DefWindowProc(hWnd, msg, wParam, lParam);
}

// ===========================================================
//                         Zorro DLL entry
// ===========================================================

DLLFUNC int main()
{
  NumTotalCycles = 1;
  NumWFOCycles = 1;
  NumSampleCycles = 1;
  set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);

  static int done = 0;
  if(is(FIRSTINITRUN))
    done = 0;

  if(done)
    return 0;

  (void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
  done = 1;
  return 0;
}