ZorroGPT

Gamestudio Links

Zorro Links

Newest Posts

ZorroGPT
by TipmyPip. 03/02/26 18:13

Zorro not reconnecting to IB gateway after restart
by clint000. 02/28/26 22:12

â™ªâ™«â™ª [For hire] VOLKOVSTUDIO - Music, SFX, Voice over, implementa
by Volkovstudio. 02/26/26 17:11

WFO Training with parallel cores Zorro64
by Martin_HH. 02/26/26 16:03

Zorro version 3.0 prerelease!
by TipmyPip. 02/25/26 16:38

evaluation shell, is there a CSV file that stores the p-values
by Petra. 02/22/26 07:51

Camera always moves upwards?
by clonman. 02/21/26 09:29

Sam Foster Sound | Experienced Game Composer for Hire
by titanicpiano14. 02/19/26 13:22

AUM Magazine

Latest Screens

Who's Online Now

4 registered members (TipmyPip, Martin_HH, AndrewAMD, Quad), 4,984 guests, and 2 spiders.

Key: Admin, Global Mod, Mod

Newest Members

the1, alx, ApprenticeInMuc, PatrickH90, USER0328
19200 Registered Users

Print Thread

Rate Thread

Page 21 of 21

…

NeuroLattice Momentum Engine v12 (RL) [Re: TipmyPip] #489285
Yesterday at 16:40 Yesterday at 16:40

Joined: Sep 2017
Posts: 276

TipmyPip

OP
Member

TipmyPip

OP
Member

Joined: Sep 2017
Posts: 276

NeuroLattice Momentum Engine is a machine learning driven portfolio selector built for a basket of currency pairs, designed to continuously reshape its preferences as market structure changes. It starts by turning each pair into a stream of compact feature signals such as short return, longer return, volatility, price deviation, range pressure, flow proxy, persistence, and a simple regime flag. These features are stored in a structured ring buffer so the system always has a recent window of behavior for every asset and every feature without heavy memory churn.

At regular update intervals, the engine builds a cross asset similarity picture by comparing every pair against every other pair across all features. That similarity step is the heaviest computation, so the strategy can offload it to an OpenCL kernel when available and automatically fall back to a CPU version when not. The result is a dense relationship map that represents how similarly assets behave across the feature space. This relationship map is then blended with an exposure based distance idea so that similarity is not purely statistical but also respects shared currency risk.

From the blended distances the engine computes a connectivity summary for each asset by running an all pairs path refinement and then compressing the result into a compactness score. Compactness acts like a stability signal: assets embedded in coherent structure are treated differently from assets that look isolated or noisy. The strategy then combines momentum, compactness, and crowding pressure into a bounded score for each asset, producing a ranked list of candidates.

The machine learning controller sits above this scoring layer and acts like an adaptive governor. It builds a compact snapshot of the entire portfolio state and feeds it through several lightweight learning modules. An unsupervised learner assigns a regime label and confidence. A principal component style reducer tracks whether the system is dominated by one factor or rotating between factors. Mixture and hidden state models estimate uncertainty and switching risk and convert that into a risk scaling signal and parameter blending. A reinforcement learner tests simple allocation actions and updates its preference from realized improvement in average score. A novelty detector based on an autoencoder watches for unfamiliar conditions and automatically reduces risk and selection breadth when the market no longer matches recent patterns.

The final outcome is a ranked and filtered selection that adapts its aggressiveness, diversification, and sensitivity based on learned regime confidence, structural stability, and novelty, while still remaining robust through fallbacks and guardrails.

Code

// TGr06E_MomentumBias_v12.cpp - Zorro64 Strategy DLL
// Strategy E v12: Momentum-Biased with MX06 OOP + OpenCL + Learning Controller
//
// Notes:
// - Keeps full CPU fallback.
// - OpenCL is optional: if OpenCL.dll missing / no device / kernel build fails -> CPU path.
// - OpenCL accelerates the heavy correlation matrix step by offloading pairwise correlations.
// - Correlation is computed in float on GPU; results are stored back into fvar corrMatrix.

#define _CRT_SECURE_NO_WARNINGS
#include <zorro.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <windows.h>
#include <stddef.h>

#define INF 1e30
#define EPS 1e-12
#define N_ASSETS 28
#define FEAT_N 9
#define FEAT_WINDOW 200
#define UPDATE_EVERY 5
#define TOP_K 5

#define ALPHA 0.1
#define BETA 0.2
#define GAMMA 3.5
#define LAMBDA_META 0.7

#define USE_ML 1
#define USE_UNSUP 1
#define USE_RL 1
#define USE_PCA 1
#define USE_GMM 1
#define USE_HMM 1
#define HMM_K 3
#define HMM_DIM 8
#define HMM_VAR_FLOOR 1e-4
#define HMM_SMOOTH 0.02
#define HMM_ENTROPY_TH 0.85
#define HMM_SWITCH_TH 0.35
#define HMM_MIN_RISK 0.25
#define HMM_COOLDOWN_UPDATES 2
#define HMM_ONLINE_UPDATE 1
#define USE_KMEANS 1
#define KMEANS_K 3
#define KMEANS_DIM 8
#define KMEANS_ETA 0.03
#define KMEANS_DIST_EMA 0.08
#define KMEANS_STABILITY_MIN 0.35
#define KMEANS_ONLINE_UPDATE 1
#define USE_SPECTRAL 1
#define SPECTRAL_K 4
#define USE_HCLUST 1
#define HCLUST_COARSE_K 4
#define HCLUST_FINE_K 8
#define USE_COMMUNITY 1
#define COMM_W_MIN 0.15
#define COMM_TOPM 6
#define COMM_ITERS 4
#define COMM_Q_EMA 0.20
#define COMM_Q_LOW 0.20
#define COMM_Q_HIGH 0.45
#define USE_AE 1
#define AE_INPUT_DIM 8
#define AE_LATENT_DIM 4
#define AE_NORM_ALPHA 0.02
#define AE_ERR_EMA 0.10
#define AE_Z_LOW 1.0
#define AE_Z_HIGH 2.0
#define GMM_K 3
#define GMM_DIM 8
#define GMM_ALPHA 0.02
#define GMM_VAR_FLOOR 1e-4
#define GMM_ENTROPY_COEFF 0.45
#define GMM_MIN_RISK 0.25
#define GMM_ONLINE_UPDATE 1
#define STRATEGY_PROFILE 4
#define PCA_DIM 6
#define PCA_COMP 3
#define PCA_WINDOW 128
#define PCA_REBUILD_EVERY 4

#ifdef TIGHT_MEM
typedef float fvar;
#else
typedef double fvar;
#endif

static const char* ASSET_NAMES[] = {
  "EURUSD","GBPUSD","USDCHF","USDJPY","AUDUSD","AUDCAD","AUDCHF","AUDJPY","AUDNZD",
  "CADJPY","CADCHF","EURAUD","EURCAD","EURCHF","EURGBP","EURJPY","EURNZD","GBPAUD",
  "GBPCAD","GBPCHF","GBPJPY","GBPNZD","NZDCAD","NZDCHF","NZDJPY","NZDUSD","USDCAD"
};
static const char* CURRENCIES[] = {"EUR","GBP","USD","CHF","JPY","AUD","CAD","NZD"};
#define N_CURRENCIES 8

// ---------------------------- Exposure Table ----------------------------

struct ExposureTable {
  int exposure[N_ASSETS][N_CURRENCIES];
  double exposureDist[N_ASSETS][N_ASSETS];

  void init() {
    for(int i=0;i<N_ASSETS;i++){
      for(int c=0;c<N_CURRENCIES;c++){
        exposure[i][c] = 0;
      }
    }
    for(int i=0;i<N_ASSETS;i++){
      for(int j=0;j<N_ASSETS;j++){
        exposureDist[i][j] = 0.0;
      }
    }
  }

  inline double getDist(int i,int j) const { return exposureDist[i][j]; }
};

// ---------------------------- Slab Allocator ----------------------------

template<typename T>
class SlabAllocator {
public:
  T* data;
  int capacity;

  SlabAllocator() : data(NULL), capacity(0) {}
  ~SlabAllocator() { shutdown(); }

  void init(int size) {
    shutdown();
    capacity = size;
    data = (T*)malloc((size_t)capacity * sizeof(T));
    if(data) memset(data, 0, (size_t)capacity * sizeof(T));
  }

  void shutdown() {
    if(data) free(data);
    data = NULL;
    capacity = 0;
  }

  T& operator[](int i) { return data[i]; }
  const T& operator[](int i) const { return data[i]; }
};

// ---------------------------- Feature Buffer (SoA ring) ----------------------------

struct FeatureBufferSoA {
  SlabAllocator<fvar> buffer;
  int windowSize;
  int currentIndex;

  void init(int assets, int window) {
    windowSize = window;
    currentIndex = 0;
    buffer.init(FEAT_N * assets * window);
  }

  void shutdown() { buffer.shutdown(); }

  inline int offset(int feat,int asset,int t) const {
    return (feat * N_ASSETS + asset) * windowSize + t;
  }

  void push(int feat,int asset,fvar value) {
    buffer[offset(feat, asset, currentIndex)] = value;
    currentIndex = (currentIndex + 1) % windowSize;
  }

  // t=0 => most recent
  fvar get(int feat,int asset,int t) const {
    int idx = (currentIndex - 1 - t + windowSize) % windowSize;
    return buffer[offset(feat, asset, idx)];
  }
};

// ---------------------------- Minimal OpenCL (dynamic) ----------------------------

typedef struct _cl_platform_id*   cl_platform_id;
typedef struct _cl_device_id*     cl_device_id;
typedef struct _cl_context*       cl_context;
typedef struct _cl_command_queue* cl_command_queue;
typedef struct _cl_program*       cl_program;
typedef struct _cl_kernel*        cl_kernel;
typedef struct _cl_mem*           cl_mem;
typedef unsigned int              cl_uint;
typedef int                       cl_int;
typedef unsigned long long        cl_ulong;
typedef size_t                    cl_bool;

#define CL_SUCCESS 0
#define CL_DEVICE_TYPE_CPU (1ULL << 1)
#define CL_DEVICE_TYPE_GPU (1ULL << 2)
#define CL_MEM_READ_ONLY   (1ULL << 2)
#define CL_MEM_WRITE_ONLY  (1ULL << 1)
#define CL_MEM_READ_WRITE  (1ULL << 0)
#define CL_TRUE  1
#define CL_FALSE 0
#define CL_PROGRAM_BUILD_LOG 0x1183

class OpenCLBackend {
public:
  HMODULE hOpenCL;
  int ready;

  cl_platform_id platform;
  cl_device_id device;
  cl_context context;
  cl_command_queue queue;
  cl_program program;
  cl_kernel kCorr;

  cl_mem bufFeat;
  cl_mem bufCorr;

  int featBytes;
  int corrBytes;

  cl_int (*clGetPlatformIDs)(cl_uint, cl_platform_id*, cl_uint*);
  cl_int (*clGetDeviceIDs)(cl_platform_id, cl_ulong, cl_uint, cl_device_id*, cl_uint*);
  cl_context (*clCreateContext)(void*, cl_uint, const cl_device_id*, void*, void*, cl_int*);
  cl_command_queue (*clCreateCommandQueue)(cl_context, cl_device_id, cl_ulong, cl_int*);
  cl_program (*clCreateProgramWithSource)(cl_context, cl_uint, const char**, const size_t*, cl_int*);
  cl_int (*clBuildProgram)(cl_program, cl_uint, const cl_device_id*, const char*, void*, void*);
  cl_int (*clGetProgramBuildInfo)(cl_program, cl_device_id, cl_uint, size_t, void*, size_t*);
  cl_kernel (*clCreateKernel)(cl_program, const char*, cl_int*);
  cl_int (*clSetKernelArg)(cl_kernel, cl_uint, size_t, const void*);
  cl_mem (*clCreateBuffer)(cl_context, cl_ulong, size_t, void*, cl_int*);
  cl_int (*clEnqueueWriteBuffer)(cl_command_queue, cl_mem, cl_bool, size_t, size_t, const void*, cl_uint, const void*, void*);
  cl_int (*clEnqueueReadBuffer)(cl_command_queue, cl_mem, cl_bool, size_t, size_t, void*, cl_uint, const void*, void*);
  cl_int (*clEnqueueNDRangeKernel)(cl_command_queue, cl_kernel, cl_uint, const size_t*, const size_t*, const size_t*, cl_uint, const void*, void*);
  cl_int (*clFinish)(cl_command_queue);
  cl_int (*clReleaseMemObject)(cl_mem);
  cl_int (*clReleaseKernel)(cl_kernel);
  cl_int (*clReleaseProgram)(cl_program);
  cl_int (*clReleaseCommandQueue)(cl_command_queue);
  cl_int (*clReleaseContext)(cl_context);

  OpenCLBackend()
  : hOpenCL(NULL), ready(0),
    platform(NULL), device(NULL), context(NULL), queue(NULL), program(NULL), kCorr(NULL),
    bufFeat(NULL), bufCorr(NULL),
    featBytes(0), corrBytes(0),
    clGetPlatformIDs(NULL), clGetDeviceIDs(NULL), clCreateContext(NULL), clCreateCommandQueue(NULL),
    clCreateProgramWithSource(NULL), clBuildProgram(NULL), clGetProgramBuildInfo(NULL),
    clCreateKernel(NULL), clSetKernelArg(NULL),
    clCreateBuffer(NULL), clEnqueueWriteBuffer(NULL), clEnqueueReadBuffer(NULL),
    clEnqueueNDRangeKernel(NULL), clFinish(NULL),
    clReleaseMemObject(NULL), clReleaseKernel(NULL), clReleaseProgram(NULL),
    clReleaseCommandQueue(NULL), clReleaseContext(NULL)
  {}

  int loadSymbol(void** fp, const char* name) {
    *fp = (void*)GetProcAddress(hOpenCL, name);
    return (*fp != NULL);
  }

  const char* kernelSource() {
    return
      "__kernel void corr_pairwise(\n"
      "  __global const float* feat,\n"
      "  __global float* outCorr,\n"
      "  const int nAssets,\n"
      "  const int nFeat,\n"
      "  const int windowSize,\n"
      "  const float eps\n"
      "){\n"
      "  int a = (int)get_global_id(0);\n"
      "  int b = (int)get_global_id(1);\n"
      "  if(a >= nAssets || b >= nAssets) return;\n"
      "  if(a >= b) return;\n"
      "  float acc = 0.0f;\n"
      "  for(int f=0; f<nFeat; f++){\n"
      "    int baseA = (f*nAssets + a) * windowSize;\n"
      "    int baseB = (f*nAssets + b) * windowSize;\n"
      "    float mx = 0.0f;\n"
      "    float my = 0.0f;\n"
      "    for(int t=0; t<windowSize; t++){\n"
      "      mx += feat[baseA + t];\n"
      "      my += feat[baseB + t];\n"
      "    }\n"
      "    mx /= (float)windowSize;\n"
      "    my /= (float)windowSize;\n"
      "    float sxx = 0.0f;\n"
      "    float syy = 0.0f;\n"
      "    float sxy = 0.0f;\n"
      "    for(int t=0; t<windowSize; t++){\n"
      "      float dx = feat[baseA + t] - mx;\n"
      "      float dy = feat[baseB + t] - my;\n"
      "      sxx += dx*dx;\n"
      "      syy += dy*dy;\n"
      "      sxy += dx*dy;\n"
      "    }\n"
      "    float den = sqrt(sxx*syy + eps);\n"
      "    float corr = (den > eps) ? (sxy/den) : 0.0f;\n"
      "    acc += corr;\n"
      "  }\n"
      "  outCorr[a*nAssets + b] = acc / (float)nFeat;\n"
      "}\n";
  }

  void printBuildLog() {
    if(!clGetProgramBuildInfo || !program || !device) return;
    size_t logSize = 0;
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &logSize);
    if(logSize == 0) return;
    char* log = (char*)malloc(logSize + 1);
    if(!log) return;
    memset(log, 0, logSize + 1);
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, logSize, log, NULL);
    printf("OpenCL build log:\n%s\n", log);
    free(log);
  }

  void init() {
    ready = 0;

    hOpenCL = LoadLibraryA("OpenCL.dll");
    if(!hOpenCL) {
      printf("OpenCL: CPU (OpenCL.dll missing)\n");
      return;
    }

    if(!loadSymbol((void**)&clGetPlatformIDs,       "clGetPlatformIDs")) return;
    if(!loadSymbol((void**)&clGetDeviceIDs,         "clGetDeviceIDs")) return;
    if(!loadSymbol((void**)&clCreateContext,        "clCreateContext")) return;
    if(!loadSymbol((void**)&clCreateCommandQueue,   "clCreateCommandQueue")) return;
    if(!loadSymbol((void**)&clCreateProgramWithSource,"clCreateProgramWithSource")) return;
    if(!loadSymbol((void**)&clBuildProgram,         "clBuildProgram")) return;
    if(!loadSymbol((void**)&clGetProgramBuildInfo,  "clGetProgramBuildInfo")) return;
    if(!loadSymbol((void**)&clCreateKernel,         "clCreateKernel")) return;
    if(!loadSymbol((void**)&clSetKernelArg,         "clSetKernelArg")) return;
    if(!loadSymbol((void**)&clCreateBuffer,         "clCreateBuffer")) return;
    if(!loadSymbol((void**)&clEnqueueWriteBuffer,   "clEnqueueWriteBuffer")) return;
    if(!loadSymbol((void**)&clEnqueueReadBuffer,    "clEnqueueReadBuffer")) return;
    if(!loadSymbol((void**)&clEnqueueNDRangeKernel, "clEnqueueNDRangeKernel")) return;
    if(!loadSymbol((void**)&clFinish,               "clFinish")) return;
    if(!loadSymbol((void**)&clReleaseMemObject,     "clReleaseMemObject")) return;
    if(!loadSymbol((void**)&clReleaseKernel,        "clReleaseKernel")) return;
    if(!loadSymbol((void**)&clReleaseProgram,       "clReleaseProgram")) return;
    if(!loadSymbol((void**)&clReleaseCommandQueue,  "clReleaseCommandQueue")) return;
    if(!loadSymbol((void**)&clReleaseContext,       "clReleaseContext")) return;

    cl_uint nPlat = 0;
    if(clGetPlatformIDs(0, NULL, &nPlat) != CL_SUCCESS || nPlat == 0) {
      printf("OpenCL: CPU (no platform)\n");
      return;
    }
    clGetPlatformIDs(1, &platform, NULL);

    cl_uint nDev = 0;
    cl_int ok = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, &nDev);
    if(ok != CL_SUCCESS || nDev == 0) {
      ok = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, &nDev);
      if(ok != CL_SUCCESS || nDev == 0) {
        printf("OpenCL: CPU (no device)\n");
        return;
      }
    }

    cl_int err = 0;
    context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
    if(err != CL_SUCCESS || !context) {
      printf("OpenCL: CPU (context fail)\n");
      return;
    }

    queue = clCreateCommandQueue(context, device, 0, &err);
    if(err != CL_SUCCESS || !queue) {
      printf("OpenCL: CPU (queue fail)\n");
      return;
    }

    const char* src = kernelSource();
    program = clCreateProgramWithSource(context, 1, &src, NULL, &err);
    if(err != CL_SUCCESS || !program) {
      printf("OpenCL: CPU (program fail)\n");
      return;
    }

    err = clBuildProgram(program, 1, &device, "", NULL, NULL);
    if(err != CL_SUCCESS) {
      printf("OpenCL: CPU (build fail)\n");
      printBuildLog();
      return;
    }

    kCorr = clCreateKernel(program, "corr_pairwise", &err);
    if(err != CL_SUCCESS || !kCorr) {
      printf("OpenCL: CPU (kernel fail)\n");
      printBuildLog();
      return;
    }

    featBytes = FEAT_N * N_ASSETS * FEAT_WINDOW * (int)sizeof(float);
    corrBytes = N_ASSETS * N_ASSETS * (int)sizeof(float);

    bufFeat = clCreateBuffer(context, CL_MEM_READ_ONLY, (size_t)featBytes, NULL, &err);
    if(err != CL_SUCCESS || !bufFeat) {
      printf("OpenCL: CPU (bufFeat fail)\n");
      return;
    }

    bufCorr = clCreateBuffer(context, CL_MEM_WRITE_ONLY, (size_t)corrBytes, NULL, &err);
    if(err != CL_SUCCESS || !bufCorr) {
      printf("OpenCL: CPU (bufCorr fail)\n");
      return;
    }

    ready = 1;
    printf("OpenCL: READY (kernel+buffers)\n");
  }

  void shutdown() {
    if(bufCorr) { clReleaseMemObject(bufCorr); bufCorr = NULL; }
    if(bufFeat) { clReleaseMemObject(bufFeat); bufFeat = NULL; }
    if(kCorr) { clReleaseKernel(kCorr); kCorr = NULL; }
    if(program) { clReleaseProgram(program); program = NULL; }
    if(queue) { clReleaseCommandQueue(queue); queue = NULL; }
    if(context) { clReleaseContext(context); context = NULL; }
    if(hOpenCL) { FreeLibrary(hOpenCL); hOpenCL = NULL; }
    ready = 0;
  }

  int computeCorrelationMatrixCL(const float* featLinear, float* outCorr, int nAssets, int nFeat, int windowSize) {
    if(!ready) return 0;
    if(!featLinear || !outCorr) return 0;

    cl_int err = clEnqueueWriteBuffer(queue, bufFeat, CL_TRUE, 0, (size_t)featBytes, featLinear, 0, NULL, NULL);
    if(err != CL_SUCCESS) return 0;

    float eps = 1e-12f;
    err = CL_SUCCESS;
    err |= clSetKernelArg(kCorr, 0, sizeof(cl_mem), &bufFeat);
    err |= clSetKernelArg(kCorr, 1, sizeof(cl_mem), &bufCorr);
    err |= clSetKernelArg(kCorr, 2, sizeof(int), &nAssets);
    err |= clSetKernelArg(kCorr, 3, sizeof(int), &nFeat);
    err |= clSetKernelArg(kCorr, 4, sizeof(int), &windowSize);
    err |= clSetKernelArg(kCorr, 5, sizeof(float), &eps);
    if(err != CL_SUCCESS) return 0;

    size_t global[2];
    global[0] = (size_t)nAssets;
    global[1] = (size_t)nAssets;

    err = clEnqueueNDRangeKernel(queue, kCorr, 2, NULL, global, NULL, 0, NULL, NULL);
    if(err != CL_SUCCESS) return 0;

    err = clFinish(queue);
    if(err != CL_SUCCESS) return 0;

    err = clEnqueueReadBuffer(queue, bufCorr, CL_TRUE, 0, (size_t)corrBytes, outCorr, 0, NULL, NULL);
    if(err != CL_SUCCESS) return 0;

    return 1;
  }
};

// ---------------------------- Learning Layer ----------------------------

struct LearningSnapshot {
  double meanScore;
  double meanCompactness;
  double meanVol;
  int regime;
  double regimeConfidence;
};

class UnsupervisedModel {
public:
  double centroids[3][3]; int counts[3]; int initialized;
  UnsupervisedModel() : initialized(0) { memset(centroids,0,sizeof(centroids)); memset(counts,0,sizeof(counts)); }
  void init(){ initialized=0; memset(centroids,0,sizeof(centroids)); memset(counts,0,sizeof(counts)); }
  void update(const LearningSnapshot& s, int* regimeOut, double* confOut){
    double x0=s.meanScore,x1=s.meanCompactness,x2=s.meanVol;
    if(!initialized){ for(int k=0;k<3;k++){ centroids[k][0]=x0+0.01*(k-1); centroids[k][1]=x1+0.01*(1-k); centroids[k][2]=x2+0.005*(k-1); counts[k]=1; } initialized=1; }
    int best=0; double bestDist=INF,secondDist=INF;
    for(int k=0;k<3;k++){ double d0=x0-centroids[k][0],d1=x1-centroids[k][1],d2=x2-centroids[k][2]; double dist=d0*d0+d1*d1+d2*d2; if(dist<bestDist){ secondDist=bestDist; bestDist=dist; best=k; } else if(dist<secondDist) secondDist=dist; }
    counts[best]++; double lr=1.0/(double)counts[best]; centroids[best][0]+=lr*(x0-centroids[best][0]); centroids[best][1]+=lr*(x1-centroids[best][1]); centroids[best][2]+=lr*(x2-centroids[best][2]);
    *regimeOut=best; *confOut=1.0/(1.0+sqrt(fabs(secondDist-bestDist)+EPS));
  }
};

class RLAgent {
public:
  double q[4]; int n[4]; int lastAction; double lastMeanScore;
  RLAgent() : lastAction(0), lastMeanScore(0) { for(int i=0;i<4;i++){q[i]=0;n[i]=0;} }
  void init(){ lastAction=0; lastMeanScore=0; for(int i=0;i<4;i++){q[i]=0;n[i]=0;} }
  int chooseAction(int updateCount){ if((updateCount%10)==0) return updateCount%4; int b=0; for(int i=1;i<4;i++) if(q[i]>q[b]) b=i; return b; }
  void updateReward(double newMeanScore){ double r=newMeanScore-lastMeanScore; n[lastAction]++; q[lastAction]+=(r-q[lastAction])/(double)n[lastAction]; lastMeanScore=newMeanScore; }
};

class PCAModel {
public:
  double hist[PCA_WINDOW][PCA_DIM];
  double mean[PCA_DIM];
  double stdev[PCA_DIM];
  double latent[PCA_COMP];
  double explainedVar[PCA_COMP];
  int writeIdx;
  int count;
  int rebuildEvery;
  int updates;
  double dom;
  double rot;
  double prevExplained0;

  PCAModel() : writeIdx(0), count(0), rebuildEvery(PCA_REBUILD_EVERY), updates(0), dom(0), rot(0), prevExplained0(0) {
    memset(hist, 0, sizeof(hist));
    memset(mean, 0, sizeof(mean));
    memset(stdev, 0, sizeof(stdev));
    memset(latent, 0, sizeof(latent));
    memset(explainedVar, 0, sizeof(explainedVar));
  }

  void init() {
    writeIdx = 0;
    count = 0;
    updates = 0;
    dom = 0;
    rot = 0;
    prevExplained0 = 0;
    memset(hist, 0, sizeof(hist));
    memset(mean, 0, sizeof(mean));
    memset(stdev, 0, sizeof(stdev));
    memset(latent, 0, sizeof(latent));
    memset(explainedVar, 0, sizeof(explainedVar));
  }

  void pushSnapshot(const double x[PCA_DIM]) {
    for(int d=0; d<PCA_DIM; d++) hist[writeIdx][d] = x[d];
    writeIdx = (writeIdx + 1) % PCA_WINDOW;
    if(count < PCA_WINDOW) count++;
  }

  void rebuildStats() {
    if(count <= 0) return;
    for(int d=0; d<PCA_DIM; d++) {
      double m = 0;
      for(int i=0; i<count; i++) m += hist[i][d];
      m /= (double)count;
      mean[d] = m;

      double v = 0;
      for(int i=0; i<count; i++) {
        double dd = hist[i][d] - m;
        v += dd * dd;
      }
      v /= (double)count;
      stdev[d] = sqrt(v + EPS);
    }
  }

  void update(const LearningSnapshot& snap, int regime, double conf) {
    double x[PCA_DIM];
    x[0] = snap.meanScore;
    x[1] = snap.meanCompactness;
    x[2] = snap.meanVol;
    x[3] = (double)regime / 2.0;
    x[4] = conf;
    x[5] = snap.meanScore - snap.meanCompactness;

    pushSnapshot(x);
    updates++;
    if((updates % rebuildEvery) == 0 || count < 4) rebuildStats();

    double z[PCA_DIM];
    for(int d=0; d<PCA_DIM; d++) z[d] = (x[d] - mean[d]) / (stdev[d] + EPS);

    latent[0] = 0.60*z[0] + 0.30*z[1] + 0.10*z[2];
    latent[1] = 0.25*z[0] - 0.45*z[1] + 0.20*z[2] + 0.10*z[4];
    latent[2] = 0.20*z[2] + 0.50*z[3] - 0.30*z[5];

    double a0 = fabs(latent[0]);
    double a1 = fabs(latent[1]);
    double a2 = fabs(latent[2]);
    double sumA = a0 + a1 + a2 + EPS;

    explainedVar[0] = a0 / sumA;
    explainedVar[1] = a1 / sumA;
    explainedVar[2] = a2 / sumA;

    dom = explainedVar[0];
    rot = fabs(explainedVar[0] - prevExplained0);
    prevExplained0 = explainedVar[0];
  }
};

class GMMRegimeModel {
public:
  double pi[GMM_K];
  double mu[GMM_K][GMM_DIM];
  double var[GMM_K][GMM_DIM];
  double p[GMM_K];
  double entropy;
  double conf;
  int bestRegime;
  int initialized;

  GMMRegimeModel() : entropy(0), conf(0), bestRegime(0), initialized(0) {
    memset(pi, 0, sizeof(pi));
    memset(mu, 0, sizeof(mu));
    memset(var, 0, sizeof(var));
    memset(p, 0, sizeof(p));
  }

  void init() {
    initialized = 0;
    entropy = 0;
    conf = 0;
    bestRegime = 0;
    for(int k=0;k<GMM_K;k++) {
      pi[k] = 1.0 / (double)GMM_K;
      for(int d=0; d<GMM_DIM; d++) {
        mu[k][d] = 0.02 * (k - 1);
        var[k][d] = 1.0;
      }
      p[k] = 1.0 / (double)GMM_K;
    }
    initialized = 1;
  }

  static double gaussianDiag(const double* x, const double* m, const double* v) {
    double logp = 0;
    for(int d=0; d<GMM_DIM; d++) {
      double vv = v[d];
      if(vv < GMM_VAR_FLOOR) vv = GMM_VAR_FLOOR;
      double z = x[d] - m[d];
      logp += -0.5 * (z*z / vv + log(vv + EPS));
    }
    if(logp < -80.0) logp = -80.0;
    return exp(logp);
  }

  void infer(const double x[GMM_DIM]) {
    if(!initialized) init();
    double sum = 0;
    for(int k=0;k<GMM_K;k++) {
      double g = gaussianDiag(x, mu[k], var[k]);
      p[k] = pi[k] * g;
      sum += p[k];
    }
    if(sum < EPS) {
      for(int k=0;k<GMM_K;k++) p[k] = 1.0 / (double)GMM_K;
    } else {
      for(int k=0;k<GMM_K;k++) p[k] /= sum;
    }

    bestRegime = 0;
    conf = p[0];
    for(int k=1;k<GMM_K;k++) {
      if(p[k] > conf) {
        conf = p[k];
        bestRegime = k;
      }
    }

    entropy = 0;
    for(int k=0;k<GMM_K;k++) entropy -= p[k] * log(p[k] + EPS);

#if GMM_ONLINE_UPDATE
    // lightweight incremental update (EM-like with forgetting)
    for(int k=0;k<GMM_K;k++) {
      double w = GMM_ALPHA * p[k];
      pi[k] = (1.0 - GMM_ALPHA) * pi[k] + w;
      for(int d=0; d<GMM_DIM; d++) {
        double diff = x[d] - mu[k][d];
        mu[k][d] += w * diff;
        var[k][d] = (1.0 - w) * var[k][d] + w * diff * diff;
        if(var[k][d] < GMM_VAR_FLOOR) var[k][d] = GMM_VAR_FLOOR;
      }
    }
#endif
  }
};


class HMMRegimeModel {
public:
  double A[HMM_K][HMM_K];
  double mu[HMM_K][HMM_DIM];
  double var[HMM_K][HMM_DIM];
  double posterior[HMM_K];
  double entropy;
  double conf;
  double switchProb;
  int regime;
  int initialized;

  HMMRegimeModel() : entropy(0), conf(0), switchProb(0), regime(0), initialized(0) {
    memset(A, 0, sizeof(A));
    memset(mu, 0, sizeof(mu));
    memset(var, 0, sizeof(var));
    memset(posterior, 0, sizeof(posterior));
  }

  void init() {
    for(int i=0;i<HMM_K;i++) {
      for(int j=0;j<HMM_K;j++) A[i][j] = (i==j) ? 0.90 : 0.10/(double)(HMM_K-1);
      for(int d=0; d<HMM_DIM; d++) {
        mu[i][d] = 0.03 * (i - 1);
        var[i][d] = 1.0;
      }
      posterior[i] = 1.0/(double)HMM_K;
    }
    regime = 0;
    conf = posterior[0];
    entropy = 0;
    switchProb = 0;
    initialized = 1;
  }

  static double emissionDiag(const double* x, const double* m, const double* v) {
    double logp = 0;
    for(int d=0; d<HMM_DIM; d++) {
      double vv = v[d];
      if(vv < HMM_VAR_FLOOR) vv = HMM_VAR_FLOOR;
      double z = x[d] - m[d];
      logp += -0.5 * (z*z / vv + log(vv + EPS));
    }
    if(logp < -80.0) logp = -80.0;
    return exp(logp);
  }

  void filter(const double obs[HMM_DIM]) {
    if(!initialized) init();

    double pred[HMM_K];
    for(int j=0;j<HMM_K;j++) {
      pred[j] = 0;
      for(int i=0;i<HMM_K;i++) pred[j] += posterior[i] * A[i][j];
    }

    double alpha[HMM_K];
    double sum = 0;
    for(int k=0;k<HMM_K;k++) {
      double emit = emissionDiag(obs, mu[k], var[k]);
      alpha[k] = pred[k] * emit;
      sum += alpha[k];
    }
    if(sum < EPS) {
      for(int k=0;k<HMM_K;k++) alpha[k] = 1.0/(double)HMM_K;
    } else {
      for(int k=0;k<HMM_K;k++) alpha[k] /= sum;
    }

    for(int k=0;k<HMM_K;k++) posterior[k] = alpha[k];

    regime = 0;
    conf = posterior[0];
    for(int k=1;k<HMM_K;k++) if(posterior[k] > conf) { conf = posterior[k]; regime = k; }

    entropy = 0;
    for(int k=0;k<HMM_K;k++) entropy -= posterior[k] * log(posterior[k] + EPS);

    switchProb = 1.0 - A[regime][regime];
    if(switchProb < 0) switchProb = 0;
    if(switchProb > 1) switchProb = 1;

#if HMM_ONLINE_UPDATE
    for(int k=0;k<HMM_K;k++) {
      double w = HMM_SMOOTH * posterior[k];
      for(int d=0; d<HMM_DIM; d++) {
        double diff = obs[d] - mu[k][d];
        mu[k][d] += w * diff;
        var[k][d] = (1.0 - w) * var[k][d] + w * diff * diff;
        if(var[k][d] < HMM_VAR_FLOOR) var[k][d] = HMM_VAR_FLOOR;
      }
    }
#endif
  }
};

class KMeansRegimeModel {
public:
  double centroids[KMEANS_K][KMEANS_DIM];
  double distEma;
  double distVarEma;
  int initialized;
  int regime;
  double dist;
  double stability;

  KMeansRegimeModel() : distEma(0), distVarEma(1), initialized(0), regime(0), dist(0), stability(0) {
    memset(centroids, 0, sizeof(centroids));
  }

  void init() {
    distEma = 0;
    distVarEma = 1;
    initialized = 0;
    regime = 0;
    dist = 0;
    stability = 0;
    memset(centroids, 0, sizeof(centroids));
  }

  void seed(const double x[KMEANS_DIM]) {
    for(int k=0;k<KMEANS_K;k++) {
      for(int d=0; d<KMEANS_DIM; d++) {
        centroids[k][d] = x[d] + 0.03 * (k - 1);
      }
    }
    initialized = 1;
  }

  static double clampRange(double x, double lo, double hi) {
    if(x < lo) return lo;
    if(x > hi) return hi;
    return x;
  }

  void predictAndUpdate(const double x[KMEANS_DIM]) {
    if(!initialized) seed(x);

    int best = 0;
    double bestDist = INF;
    for(int k=0;k<KMEANS_K;k++) {
      double s = 0;
      for(int d=0; d<KMEANS_DIM; d++) {
        double z = x[d] - centroids[k][d];
        s += z * z;
      }
      double dk = sqrt(s + EPS);
      if(dk < bestDist) {
        bestDist = dk;
        best = k;
      }
    }

    regime = best;
    dist = bestDist;

    distEma = (1.0 - KMEANS_DIST_EMA) * distEma + KMEANS_DIST_EMA * dist;
    double dd = dist - distEma;
    distVarEma = (1.0 - KMEANS_DIST_EMA) * distVarEma + KMEANS_DIST_EMA * dd * dd;
    double distStd = sqrt(distVarEma + EPS);
    double zDist = (dist - distEma) / (distStd + EPS);
    stability = clampRange(1.0 / (1.0 + exp(zDist)), 0.0, 1.0);

#if KMEANS_ONLINE_UPDATE
    for(int d=0; d<KMEANS_DIM; d++) {
      centroids[best][d] += KMEANS_ETA * (x[d] - centroids[best][d]);
    }
#endif
  }
};


class SpectralClusterModel {
public:
  int clusterId[N_ASSETS];
  int nClusters;

  void init() {
    nClusters = SPECTRAL_K;
    for(int i=0;i<N_ASSETS;i++) clusterId[i] = i % SPECTRAL_K;
  }

  void update(const fvar* distMatrix) {
    if(!distMatrix) return;
    // lightweight deterministic clustering surrogate from distance rows
    for(int i=0;i<N_ASSETS;i++) {
      double sig = 0;
      for(int j=0;j<N_ASSETS;j++) {
        if(i == j) continue;
        double d = (double)distMatrix[i*N_ASSETS + j];
        if(d < INF) sig += d;
      }
      int cid = (int)fmod(fabs(sig * 1000.0), (double)SPECTRAL_K);
      if(cid < 0) cid = 0;
      if(cid >= SPECTRAL_K) cid = SPECTRAL_K - 1;
      clusterId[i] = cid;
    }
  }
};


class HierarchicalClusteringModel {
public:
  int clusterCoarse[N_ASSETS];
  int clusterFine[N_ASSETS];
  int nCoarse;
  int nFine;

  int leftChild[2*N_ASSETS];
  int rightChild[2*N_ASSETS];
  int nodeSize[2*N_ASSETS];
  double nodeHeight[2*N_ASSETS];
  double nodeDist[2*N_ASSETS][2*N_ASSETS];
  int rootNode;

  void init() {
    nCoarse = HCLUST_COARSE_K;
    nFine = HCLUST_FINE_K;
    rootNode = N_ASSETS - 1;
    for(int i=0;i<N_ASSETS;i++) {
      clusterCoarse[i] = i % HCLUST_COARSE_K;
      clusterFine[i] = i % HCLUST_FINE_K;
    }
  }

  void collectLeaves(int node, int clusterId, int* out) {
    int stack[2*N_ASSETS];
    int sp = 0;
    stack[sp++] = node;
    while(sp > 0) {
      int cur = stack[--sp];
      if(cur < N_ASSETS) {
        out[cur] = clusterId;
      } else {
        if(leftChild[cur] >= 0) stack[sp++] = leftChild[cur];
        if(rightChild[cur] >= 0) stack[sp++] = rightChild[cur];
      }
    }
  }

  void cutByK(int K, int* out) {
    for(int i=0;i<N_ASSETS;i++) out[i] = -1;
    if(K <= 1) {
      for(int i=0;i<N_ASSETS;i++) out[i] = 0;
      return;
    }

    int clusters[2*N_ASSETS];
    int count = 1;
    clusters[0] = rootNode;

    while(count < K) {
      int bestPos = -1;
      double bestHeight = -1;
      for(int i=0;i<count;i++) {
        int node = clusters[i];
        if(node >= N_ASSETS && nodeHeight[node] > bestHeight) {
          bestHeight = nodeHeight[node];
          bestPos = i;
        }
      }
      if(bestPos < 0) break;
      int node = clusters[bestPos];
      int l = leftChild[node];
      int r = rightChild[node];
      clusters[bestPos] = l;
      clusters[count++] = r;
    }

    for(int c=0;c<count;c++) {
      collectLeaves(clusters[c], c, out);
    }
    for(int i=0;i<N_ASSETS;i++) if(out[i] < 0) out[i] = 0;
  }

  void update(const fvar* distMatrix) {
    if(!distMatrix) return;

    int totalNodes = 2 * N_ASSETS;
    for(int i=0;i<totalNodes;i++) {
      leftChild[i] = -1;
      rightChild[i] = -1;
      nodeSize[i] = (i < N_ASSETS) ? 1 : 0;
      nodeHeight[i] = 0;
      for(int j=0;j<totalNodes;j++) nodeDist[i][j] = INF;
    }

    for(int i=0;i<N_ASSETS;i++) {
      for(int j=0;j<N_ASSETS;j++) {
        if(i == j) nodeDist[i][j] = 0;
        else {
          double d = (double)distMatrix[i*N_ASSETS + j];
          if(d < 0 || d >= INF) d = 1.0;
          nodeDist[i][j] = d;
        }
      }
    }

    int active[2*N_ASSETS];
    int nActive = N_ASSETS;
    for(int i=0;i<N_ASSETS;i++) active[i] = i;
    int nextNode = N_ASSETS;

    while(nActive > 1 && nextNode < 2*N_ASSETS) {
      int ai = 0, aj = 1;
      double best = INF;
      for(int i=0;i<nActive;i++) {
        for(int j=i+1;j<nActive;j++) {
          int a = active[i], b = active[j];
          if(nodeDist[a][b] < best) {
            best = nodeDist[a][b];
            ai = i; aj = j;
          }
        }
      }

      int a = active[ai];
      int b = active[aj];
      int m = nextNode++;

      leftChild[m] = a;
      rightChild[m] = b;
      nodeHeight[m] = best;
      nodeSize[m] = nodeSize[a] + nodeSize[b];

      for(int i=0;i<nActive;i++) {
        if(i == ai || i == aj) continue;
        int k = active[i];
        double da = nodeDist[a][k];
        double db = nodeDist[b][k];
        double dm = (nodeSize[a] * da + nodeSize[b] * db) / (double)(nodeSize[a] + nodeSize[b]);
        nodeDist[m][k] = dm;
        nodeDist[k][m] = dm;
      }
      nodeDist[m][m] = 0;

      if(aj < ai) { int t=ai; ai=aj; aj=t; }
      for(int i=aj;i<nActive-1;i++) active[i] = active[i+1];
      nActive--;
      for(int i=ai;i<nActive-1;i++) active[i] = active[i+1];
      nActive--;
      active[nActive++] = m;
    }

    rootNode = active[0];

    int kc = HCLUST_COARSE_K;
    if(kc < 1) kc = 1;
    if(kc > N_ASSETS) kc = N_ASSETS;
    int kf = HCLUST_FINE_K;
    if(kf < 1) kf = 1;
    if(kf > N_ASSETS) kf = N_ASSETS;

    cutByK(kc, clusterCoarse);
    cutByK(kf, clusterFine);
    nCoarse = kc;
    nFine = kf;
  }
};


class CommunityDetectionModel {
public:
  int communityId[N_ASSETS];
  int clusterCoarse[N_ASSETS];
  int clusterFine[N_ASSETS];
  int nCommunities;
  fvar modularityQ;
  fvar qSmooth;

  void init() {
    nCommunities = 1;
    modularityQ = 0;
    qSmooth = 0;
    for(int i=0;i<N_ASSETS;i++) {
      communityId[i] = 0;
      clusterCoarse[i] = i % HCLUST_COARSE_K;
      clusterFine[i] = i % HCLUST_FINE_K;
    }
  }

  static int argmaxLabel(const fvar w[N_ASSETS], const int label[N_ASSETS], int node) {
    fvar acc[N_ASSETS];
    for(int i=0;i<N_ASSETS;i++) acc[i] = 0;
    for(int j=0;j<N_ASSETS;j++) {
      if(j == node) continue;
      int l = label[j];
      if(l < 0 || l >= N_ASSETS) continue;
      acc[l] += w[j];
    }
    int best = label[node];
    fvar bestV = -1;
    for(int l=0;l<N_ASSETS;l++) {
      if(acc[l] > bestV) { bestV = acc[l]; best = l; }
    }
    return best;
  }

  void update(const fvar* corrMatrix, const fvar* distMatrix) {
    if(!corrMatrix || !distMatrix) return;

    fvar W[N_ASSETS][N_ASSETS];
    fvar degree[N_ASSETS];
    int label[N_ASSETS];

    for(int i=0;i<N_ASSETS;i++) {
      degree[i] = 0;
      label[i] = i;
      for(int j=0;j<N_ASSETS;j++) {
        if(i == j) W[i][j] = 0;
        else {
          fvar w = (fvar)fabs((double)corrMatrix[i*N_ASSETS + j]);
          if(w < (fvar)COMM_W_MIN) w = 0;
          W[i][j] = w;
          degree[i] += w;
        }
      }
    }

    // Optional top-M pruning for determinism/noise control
    for(int i=0;i<N_ASSETS;i++) {
      int keep[N_ASSETS];
      for(int j=0;j<N_ASSETS;j++) keep[j] = 0;
      for(int k=0;k<COMM_TOPM;k++) {
        int best = -1;
        fvar bestW = 0;
        for(int j=0;j<N_ASSETS;j++) {
          if(i==j || keep[j]) continue;
          if(W[i][j] > bestW) { bestW = W[i][j]; best = j; }
        }
        if(best >= 0) keep[best] = 1;
      }
      for(int j=0;j<N_ASSETS;j++) if(i!=j && !keep[j]) W[i][j] = 0;
    }

    for(int it=0; it<COMM_ITERS; it++) {
      for(int i=0;i<N_ASSETS;i++) {
        label[i] = argmaxLabel(W[i], label, i);
      }
    }

    // compress labels
    int map[N_ASSETS];
    for(int i=0;i<N_ASSETS;i++) map[i] = -1;
    int nLab = 0;
    for(int i=0;i<N_ASSETS;i++) {
      int l = label[i];
      if(l < 0 || l >= N_ASSETS) l = 0;
      if(map[l] < 0) map[l] = nLab++;
      communityId[i] = map[l];
    }
    if(nLab < 1) nLab = 1;
    nCommunities = nLab;

    // modularity approximation
    fvar m2 = 0;
    for(int i=0;i<N_ASSETS;i++) for(int j=0;j<N_ASSETS;j++) m2 += W[i][j];
    if(m2 < (fvar)EPS) {
      modularityQ = 0;
    } else {
      fvar q = 0;
      for(int i=0;i<N_ASSETS;i++) {
        for(int j=0;j<N_ASSETS;j++) {
          if(communityId[i] == communityId[j]) {
            q += W[i][j] - (degree[i] * degree[j] / m2);
          }
        }
      }
      modularityQ = q / m2;
    }

    qSmooth = (fvar)(1.0 - COMM_Q_EMA) * qSmooth + (fvar)COMM_Q_EMA * modularityQ;

    for(int i=0;i<N_ASSETS;i++) {
      int c = communityId[i];
      if(c < 0) c = 0;
      clusterCoarse[i] = c % HCLUST_COARSE_K;
      clusterFine[i] = c % HCLUST_FINE_K;
    }
  }
};


class AutoencoderModel {
public:
  double mu[AE_INPUT_DIM];
  double sigma[AE_INPUT_DIM];
  double W1[AE_LATENT_DIM][AE_INPUT_DIM];
  double W2[AE_INPUT_DIM][AE_LATENT_DIM];
  int initialized;

  void init() {
    initialized = 1;
    for(int i=0;i<AE_INPUT_DIM;i++) {
      mu[i] = 0;
      sigma[i] = 1;
    }
    for(int z=0;z<AE_LATENT_DIM;z++) {
      for(int d=0;d<AE_INPUT_DIM;d++) {
        double w = sin((double)(z+1)*(d+1)) * 0.05;
        W1[z][d] = w;
        W2[d][z] = w;
      }
    }
  }

  static double act(double x) {
    if(x > 4) x = 4;
    if(x < -4) x = -4;
    return tanh(x);
  }

  double infer(const double xIn[AE_INPUT_DIM]) {
    if(!initialized) init();

    double x[AE_INPUT_DIM];
    for(int d=0;d<AE_INPUT_DIM;d++) x[d] = (xIn[d] - mu[d]) / (sigma[d] + EPS);

    double z[AE_LATENT_DIM];
    for(int k=0;k<AE_LATENT_DIM;k++) {
      double s = 0;
      for(int d=0;d<AE_INPUT_DIM;d++) s += W1[k][d] * x[d];
      z[k] = act(s);
    }

    double recon[AE_INPUT_DIM];
    for(int d=0;d<AE_INPUT_DIM;d++) {
      double s = 0;
      for(int k=0;k<AE_LATENT_DIM;k++) s += W2[d][k] * z[k];
      recon[d] = act(s);
    }

    double err = 0;
    for(int d=0;d<AE_INPUT_DIM;d++) {
      double e = x[d] - recon[d];
      err += e*e;
    }
    err /= (double)AE_INPUT_DIM;

    for(int d=0;d<AE_INPUT_DIM;d++) {
      mu[d] = (1.0 - AE_NORM_ALPHA) * mu[d] + AE_NORM_ALPHA * xIn[d];
      double dv = xIn[d] - mu[d];
      sigma[d] = (1.0 - AE_NORM_ALPHA) * sigma[d] + AE_NORM_ALPHA * sqrt(dv*dv + EPS);
      if(sigma[d] < 1e-5) sigma[d] = 1e-5;
    }
    return err;
  }
};

class NoveltyController {
public:
  double errEma;
  double errVar;
  double zRecon;
  int regime;
  double riskScale;

  void init() {
    errEma = 0;
    errVar = 1;
    zRecon = 0;
    regime = 0;
    riskScale = 1.0;
  }

  static double clampRange(double x, double lo, double hi) {
    if(x < lo) return lo;
    if(x > hi) return hi;
    return x;
  }

  void update(double reconError) {
    errEma = (1.0 - AE_ERR_EMA) * errEma + AE_ERR_EMA * reconError;
    double d = reconError - errEma;
    errVar = (1.0 - AE_ERR_EMA) * errVar + AE_ERR_EMA * d*d;
    double errStd = sqrt(errVar + EPS);
    zRecon = (reconError - errEma) / (errStd + EPS);

    if(zRecon >= AE_Z_HIGH) { regime = 2; riskScale = 0.20; }
    else if(zRecon >= AE_Z_LOW) { regime = 1; riskScale = 0.60; }
    else { regime = 0; riskScale = 1.00; }

    riskScale = clampRange(riskScale, 0.20, 1.00);
  }

  void apply(int* topK, double* scoreScale) {
    if(regime == 2) {
      if(*topK > 3) *topK -= 2;
      *scoreScale *= 0.60;
    } else if(regime == 1) {
      if(*topK > 3) *topK -= 1;
      *scoreScale *= 0.85;
    }
    if(*topK < 1) *topK = 1;
    if(*topK > TOP_K) *topK = TOP_K;
    *scoreScale = clampRange(*scoreScale, 0.10, 2.00);
  }
};

class StrategyController {
public:
  UnsupervisedModel unsup;
  RLAgent rl;
  PCAModel pca;
  GMMRegimeModel gmm;
  HMMRegimeModel hmm;
  KMeansRegimeModel kmeans;
  int dynamicTopK;
  double scoreScale;
  int regime;
  double adaptiveGamma;
  double adaptiveAlpha;
  double adaptiveBeta;
  double adaptiveLambda;
  double riskScale;
  int cooldown;

  StrategyController()
  : dynamicTopK(TOP_K), scoreScale(1.0), regime(0),
    adaptiveGamma(1.0), adaptiveAlpha(1.0), adaptiveBeta(1.0), adaptiveLambda(1.0), riskScale(1.0), cooldown(0) {}

  static double clampRange(double x, double lo, double hi) {
    if(x < lo) return lo;
    if(x > hi) return hi;
    return x;
  }

  void init() {
    unsup.init();
    rl.init();
    pca.init();
    gmm.init();
    hmm.init();
    kmeans.init();
    dynamicTopK = TOP_K;
    scoreScale = 1.0;
    regime = 0;
    adaptiveGamma = 1.0;
    adaptiveAlpha = 1.0;
    adaptiveBeta = 1.0;
    adaptiveLambda = 1.0;
    riskScale = 1.0;
    cooldown = 0;
  }

  void buildGMMState(const LearningSnapshot& snap, int reg, double conf, double x[GMM_DIM]) {
    x[0] = snap.meanScore;
    x[1] = snap.meanCompactness;
    x[2] = snap.meanVol;
    x[3] = pca.dom;
    x[4] = pca.rot;
    x[5] = (double)reg / 2.0;
    x[6] = conf;
    x[7] = snap.meanScore - snap.meanCompactness;
  }

  void buildHMMObs(const LearningSnapshot& snap, int reg, double conf, double x[HMM_DIM]) {
    x[0] = pca.latent[0];
    x[1] = pca.latent[1];
    x[2] = pca.latent[2];
    x[3] = snap.meanVol;
    x[4] = snap.meanScore;
    x[5] = snap.meanCompactness;
    x[6] = (double)reg / 2.0;
    x[7] = conf;
  }

  void buildKMeansState(const LearningSnapshot& snap, int reg, double conf, double x[KMEANS_DIM]) {
    x[0] = pca.latent[0];
    x[1] = pca.latent[1];
    x[2] = pca.latent[2];
    x[3] = snap.meanVol;
    x[4] = snap.meanScore;
    x[5] = snap.meanCompactness;
    x[6] = (double)reg / 2.0;
    x[7] = conf;
  }

  void onUpdate(const LearningSnapshot& snap, fvar* scores, int nScores, int updateCount) {
#if USE_ML
    double unsupConf = 0;
    unsup.update(snap, &regime, &unsupConf);
#if USE_PCA
    pca.update(snap, regime, unsupConf);
#else
    pca.dom = 0.5;
    pca.rot = 0.0;
#endif

#if USE_GMM
    double gx[GMM_DIM];
    buildGMMState(snap, regime, unsupConf, gx);
    gmm.infer(gx);
#if USE_HMM
    double hx[HMM_DIM];
    buildHMMObs(snap, regime, unsupConf, hx);
    hmm.filter(hx);
#if USE_KMEANS
    double kx[KMEANS_DIM];
    buildKMeansState(snap, regime, unsupConf, kx);
    kmeans.predictAndUpdate(kx);
#endif
#endif
    // regime presets: [gamma, alpha, beta, lambda]
    const double presets[GMM_K][4] = {
      {1.05, 1.00, 0.95, 1.00},
      {0.95, 1.05, 1.05, 0.95},
      {1.00, 0.95, 1.10, 1.05}
    };
    adaptiveGamma = 0;
    adaptiveAlpha = 0;
    adaptiveBeta  = 0;
    adaptiveLambda = 0;
    for(int k=0;k<GMM_K;k++) {
#if USE_HMM
      adaptiveGamma += hmm.posterior[k] * presets[k][0];
      adaptiveAlpha += hmm.posterior[k] * presets[k][1];
      adaptiveBeta  += hmm.posterior[k] * presets[k][2];
      adaptiveLambda += hmm.posterior[k] * presets[k][3];
#else
      adaptiveGamma += gmm.p[k] * presets[k][0];
      adaptiveAlpha += gmm.p[k] * presets[k][1];
      adaptiveBeta  += gmm.p[k] * presets[k][2];
      adaptiveLambda += gmm.p[k] * presets[k][3];
#endif
    }
#if USE_HMM
    double entNorm = hmm.entropy / log((double)HMM_K + EPS);
    riskScale = clampRange(1.0 - 0.45 * entNorm, HMM_MIN_RISK, 1.0);
    if(hmm.entropy > HMM_ENTROPY_TH || hmm.switchProb > HMM_SWITCH_TH) cooldown = HMM_COOLDOWN_UPDATES;
    else if(cooldown > 0) cooldown--;
#else
    double entNorm = gmm.entropy / log((double)GMM_K + EPS);
    riskScale = clampRange(1.0 - GMM_ENTROPY_COEFF * entNorm, GMM_MIN_RISK, 1.0);
#endif
#else
    adaptiveGamma = 1.0 + 0.35 * pca.dom - 0.25 * pca.rot;
    adaptiveAlpha = 1.0 + 0.30 * pca.dom;
    adaptiveBeta  = 1.0 + 0.25 * pca.rot;
    adaptiveLambda = 1.0 + 0.20 * pca.dom - 0.20 * pca.rot;
    riskScale = 1.0;
#endif

    adaptiveGamma = clampRange(adaptiveGamma, 0.80, 1.40);
    adaptiveAlpha = clampRange(adaptiveAlpha, 0.85, 1.35);
    adaptiveBeta  = clampRange(adaptiveBeta, 0.85, 1.35);
    adaptiveLambda = clampRange(adaptiveLambda, 0.85, 1.25);

#if USE_KMEANS
    const double kmPreset[KMEANS_K][4] = {
      {1.02, 1.00, 0.98, 1.00},
      {1.08, 0.96, 0.95, 1.02},
      {0.94, 1.08, 1.08, 0.92}
    };
    int kr = kmeans.regime;
    if(kr < 0) kr = 0;
    if(kr >= KMEANS_K) kr = KMEANS_K - 1;
    double wkm = clampRange(kmeans.stability, 0.0, 1.0);
    adaptiveGamma = (1.0 - wkm) * adaptiveGamma + wkm * kmPreset[kr][0];
    adaptiveAlpha = (1.0 - wkm) * adaptiveAlpha + wkm * kmPreset[kr][1];
    adaptiveBeta  = (1.0 - wkm) * adaptiveBeta  + wkm * kmPreset[kr][2];
    adaptiveLambda = (1.0 - wkm) * adaptiveLambda + wkm * kmPreset[kr][3];
    if(kmeans.stability < KMEANS_STABILITY_MIN) {
      riskScale *= 0.85;
      if(cooldown < 1) cooldown = 1;
    }
#endif

    rl.updateReward(snap.meanScore);
    rl.lastAction = rl.chooseAction(updateCount);

    int baseTopK = TOP_K;
    if(rl.lastAction == 0) baseTopK = TOP_K - 2;
    else if(rl.lastAction == 1) baseTopK = TOP_K;
    else if(rl.lastAction == 2) baseTopK = TOP_K;
    else baseTopK = TOP_K - 1;

    double profileBias[5] = {1.00, 0.98, 0.99, 0.97, 1.02};
    scoreScale = (1.0 + 0.06 * (adaptiveGamma - 1.0) + 0.04 * (adaptiveAlpha - 1.0) - 0.04 * (adaptiveBeta - 1.0))
               * profileBias[STRATEGY_PROFILE] * riskScale;

    if(pca.dom > 0.60) baseTopK -= 1;
    if(pca.rot > 0.15) baseTopK -= 1;
#if USE_HMM
    if(hmm.regime == 2) baseTopK -= 1;
    if(cooldown > 0) baseTopK -= 1;
#if USE_KMEANS
    if(kmeans.regime == 2) baseTopK -= 1;
#endif
#elif USE_GMM
    if(gmm.bestRegime == 2) baseTopK -= 1;
#endif

    dynamicTopK = baseTopK;
    if(dynamicTopK < 1) dynamicTopK = 1;
    if(dynamicTopK > TOP_K) dynamicTopK = TOP_K;

    for(int i=0; i<nScores; i++) {
      double s = (double)scores[i] * scoreScale;
      if(s > 1.0) s = 1.0;
      if(s < 0.0) s = 0.0;
      scores[i] = (fvar)s;
    }
#else
    (void)snap; (void)scores; (void)nScores; (void)updateCount;
#endif
  }
};

// ---------------------------- Strategy ----------------------------

class MomentumBiasStrategy {
public:
  ExposureTable exposureTable;
  FeatureBufferSoA featSoA;
  OpenCLBackend openCL;

  SlabAllocator<fvar> corrMatrix;
  SlabAllocator<fvar> distMatrix;
  SlabAllocator<fvar> compactness;
  SlabAllocator<fvar> momentum;
  SlabAllocator<fvar> scores;

  SlabAllocator<float> featLinear;
  SlabAllocator<float> corrLinear;

  int barCount;
  int updateCount;
  StrategyController controller;
  HierarchicalClusteringModel hclust;
  CommunityDetectionModel comm;
  AutoencoderModel ae;
  NoveltyController novelty;

  MomentumBiasStrategy() : barCount(0), updateCount(0) {}

  void init() {
    printf("MomentumBias_v12: Initializing...\n");
    exposureTable.init();
    featSoA.init(N_ASSETS, FEAT_WINDOW);
    corrMatrix.init(N_ASSETS * N_ASSETS);
    distMatrix.init(N_ASSETS * N_ASSETS);
    compactness.init(N_ASSETS);
    momentum.init(N_ASSETS);
    scores.init(N_ASSETS);
    featLinear.init(FEAT_N * N_ASSETS * FEAT_WINDOW);
    corrLinear.init(N_ASSETS * N_ASSETS);
    openCL.init();
    printf("MomentumBias_v12: Ready (OpenCL=%d)\n", openCL.ready);
    controller.init();
    hclust.init();
    comm.init();
    ae.init();
    novelty.init();

    barCount = 0;
    updateCount = 0;
  }

  void shutdown() {
    printf("MomentumBias_v12: Shutting down...\n");

    openCL.shutdown();

    featSoA.shutdown();
    corrMatrix.shutdown();
    distMatrix.shutdown();
    compactness.shutdown();
    momentum.shutdown();
    scores.shutdown();

    featLinear.shutdown();
    corrLinear.shutdown();
  }

  void computeFeatures(int assetIdx) {
    asset((char*)ASSET_NAMES[assetIdx]);

    vars C = series(priceClose(0));
    vars V = series(Volatility(C, 20));

    if(Bar < 50) return;

    fvar r1 = (fvar)log(C[0] / C[1]);
    fvar rN = (fvar)log(C[0] / C[12]);
    fvar vol = (fvar)V[0];
    fvar zscore = (fvar)((C[0] - C[50]) / (V[0] * 20.0 + EPS));
    fvar rangeP = (fvar)((C[0] - C[50]) / (C[0] + EPS));
    fvar flow = (fvar)(r1 * vol);
    fvar regime = (fvar)((vol > 0.001) ? 1.0 : 0.0);
    fvar volOfVol = (fvar)(vol * vol);
    fvar persistence = (fvar)fabs(r1);

    featSoA.push(0, assetIdx, r1);
    featSoA.push(1, assetIdx, rN);
    featSoA.push(2, assetIdx, vol);
    featSoA.push(3, assetIdx, zscore);
    featSoA.push(4, assetIdx, rangeP);
    featSoA.push(5, assetIdx, flow);
    featSoA.push(6, assetIdx, regime);
    featSoA.push(7, assetIdx, volOfVol);
    featSoA.push(8, assetIdx, persistence);
  }

  void computeCorrelationMatrixCPU() {
    for(int i=0;i<N_ASSETS*N_ASSETS;i++) corrMatrix[i] = 0;

    for(int f=0; f<FEAT_N; f++){
      for(int a=0; a<N_ASSETS; a++){
        for(int b=a+1; b<N_ASSETS; b++){
          fvar mx = 0, my = 0;
          for(int t=0; t<FEAT_WINDOW; t++){
            mx += featSoA.get(f,a,t);
            my += featSoA.get(f,b,t);
          }
          mx /= (fvar)FEAT_WINDOW;
          my /= (fvar)FEAT_WINDOW;

          fvar sxx = 0, syy = 0, sxy = 0;
          for(int t=0; t<FEAT_WINDOW; t++){
            fvar dx = featSoA.get(f,a,t) - mx;
            fvar dy = featSoA.get(f,b,t) - my;
            sxx += dx*dx;
            syy += dy*dy;
            sxy += dx*dy;
          }

          fvar den = (fvar)sqrt((double)(sxx*syy + (fvar)EPS));
          fvar corr = 0;
          if(den > (fvar)EPS) corr = sxy / den;
          else corr = 0;

          int idx = a*N_ASSETS + b;
          corrMatrix[idx] += corr / (fvar)FEAT_N;
          corrMatrix[b*N_ASSETS + a] = corrMatrix[idx];
        }
      }
    }
  }

  void buildFeatLinear() {
    int idx = 0;
    for(int f=0; f<FEAT_N; f++){
      for(int a=0; a<N_ASSETS; a++){
        for(int t=0; t<FEAT_WINDOW; t++){
          featLinear[idx] = (float)featSoA.get(f, a, t);
          idx++;
        }
      }
    }
  }

  void computeCorrelationMatrix() {
    if(openCL.ready) {
      buildFeatLinear();

      for(int i=0;i<N_ASSETS*N_ASSETS;i++) corrLinear[i] = 0.0f;

      int ok = openCL.computeCorrelationMatrixCL(
        featLinear.data,
        corrLinear.data,
        N_ASSETS,
        FEAT_N,
        FEAT_WINDOW
      );

      if(ok) {
        for(int i=0;i<N_ASSETS*N_ASSETS;i++) corrMatrix[i] = (fvar)0;

        for(int a=0; a<N_ASSETS; a++){
          corrMatrix[a*N_ASSETS + a] = (fvar)1.0;
          for(int b=a+1; b<N_ASSETS; b++){
            float c = corrLinear[a*N_ASSETS + b];
            corrMatrix[a*N_ASSETS + b] = (fvar)c;
            corrMatrix[b*N_ASSETS + a] = (fvar)c;
          }
        }
        return;
      }

      printf("OpenCL: runtime fail -> CPU fallback\n");
      openCL.ready = 0;
    }

    computeCorrelationMatrixCPU();
  }

  void computeDistanceMatrix() {
    for(int i=0;i<N_ASSETS;i++){
      for(int j=0;j<N_ASSETS;j++){
        if(i == j) {
          distMatrix[i*N_ASSETS + j] = (fvar)0;
        } else {
          fvar corrDist = (fvar)1.0 - (fvar)fabs((double)corrMatrix[i*N_ASSETS + j]);
          fvar expDist  = (fvar)exposureTable.getDist(i, j);
          fvar blended = (fvar)LAMBDA_META * corrDist + (fvar)(1.0 - (double)LAMBDA_META) * expDist;
          distMatrix[i*N_ASSETS + j] = blended;
        }
      }
    }
  }

  void floydWarshall() {
    fvar d[28][28];

    for(int i=0;i<N_ASSETS;i++){
      for(int j=0;j<N_ASSETS;j++){
        d[i][j] = distMatrix[i*N_ASSETS + j];
        if(i == j) d[i][j] = (fvar)0;
        if(d[i][j] < (fvar)0) d[i][j] = (fvar)INF;
      }
    }

    for(int k=0;k<N_ASSETS;k++){
      for(int i=0;i<N_ASSETS;i++){
        for(int j=0;j<N_ASSETS;j++){
          if(d[i][k] < (fvar)INF && d[k][j] < (fvar)INF) {
            fvar nk = d[i][k] + d[k][j];
            if(nk < d[i][j]) d[i][j] = nk;
          }
        }
      }
    }

    for(int i=0;i<N_ASSETS;i++){
      fvar w = 0;
      for(int j=i+1;j<N_ASSETS;j++){
        if(d[i][j] < (fvar)INF) w += d[i][j];
      }
      if(w > (fvar)0) compactness[i] = (fvar)(1.0 / (1.0 + (double)w));
      else compactness[i] = (fvar)0;
      momentum[i] = featSoA.get(1, i, 0);
    }
  }

  void computeScores() {
    for(int i=0;i<N_ASSETS;i++){
      fvar coupling = 0;
      int count = 0;

      for(int j=0;j<N_ASSETS;j++){
        if(i != j && distMatrix[i*N_ASSETS + j] < (fvar)INF) {
          coupling += compactness[j];
          count++;
        }
      }

      fvar pCouple = 0;
      if(count > 0) pCouple = coupling / (fvar)count;
      else pCouple = (fvar)0;

      fvar rawScore = (fvar)GAMMA * momentum[i] + (fvar)ALPHA * compactness[i] - (fvar)BETA * pCouple;

      if(rawScore > (fvar)30) rawScore = (fvar)30;
      if(rawScore < (fvar)-30) rawScore = (fvar)-30;

      scores[i] = (fvar)(1.0 / (1.0 + exp(-(double)rawScore)));
    }
  }

  LearningSnapshot buildSnapshot() {
    LearningSnapshot s;
    s.meanScore = 0; s.meanCompactness = 0; s.meanVol = 0;
    for(int i=0;i<N_ASSETS;i++) {
      s.meanScore += (double)scores[i];
      s.meanCompactness += (double)compactness[i];
      s.meanVol += (double)featSoA.get(2, i, 0);
    }
    s.meanScore /= (double)N_ASSETS;
    s.meanCompactness /= (double)N_ASSETS;
    s.meanVol /= (double)N_ASSETS;
    s.regime = 0;
    s.regimeConfidence = 0;
    return s;
  }

  void onBar() {
    barCount++;

    for(int i=0;i<N_ASSETS;i++) computeFeatures(i);

    if(barCount % UPDATE_EVERY == 0) {
      updateCount++;

      computeCorrelationMatrix();
      computeDistanceMatrix();
#if USE_COMMUNITY
      hclust.update(distMatrix.data);
#endif
#if USE_COMMUNITY
      comm.update(corrMatrix.data, distMatrix.data);
#endif
      floydWarshall();
      computeScores();
      controller.onUpdate(buildSnapshot(), scores.data, N_ASSETS, updateCount);
#if USE_AE
      double aeState[AE_INPUT_DIM];
      double ms=0, mc=0, mv=0;
      for(int i=0;i<N_ASSETS;i++){ ms += (double)scores[i]; mc += (double)compactness[i]; mv += (double)featSoA.get(2, i, 0); }
      ms /= (double)N_ASSETS; mc /= (double)N_ASSETS; mv /= (double)N_ASSETS;
      aeState[0] = ms;
      aeState[1] = mc;
      aeState[2] = mv;
      aeState[3] = controller.scoreScale;
      aeState[4] = (double)controller.dynamicTopK;
      aeState[5] = (double)barCount / (double)(LookBack + 1);
      aeState[6] = (double)updateCount / 1000.0;
      aeState[7] = (double)openCL.ready;
      double reconErr = ae.infer(aeState);
      novelty.update(reconErr);
      novelty.apply(&controller.dynamicTopK, &controller.scoreScale);
      for(int i=0;i<N_ASSETS;i++){{
        double s = (double)scores[i] * novelty.riskScale;
        if(s > 1.0) s = 1.0;
        if(s < 0.0) s = 0.0;
        scores[i] = (fvar)s;
      }}
#endif
      printTopK();
    }
  }

  void printTopK() {
    int indices[N_ASSETS];
    for(int i=0;i<N_ASSETS;i++) indices[i] = i;

    int topN = controller.dynamicTopK;
#if USE_COMMUNITY
    if(comm.qSmooth < (fvar)COMM_Q_LOW && topN > 2) topN--;
    if(comm.qSmooth > (fvar)COMM_Q_HIGH && topN < TOP_K) topN++;
#endif
    for(int i=0;i<topN;i++){
      for(int j=i+1;j<N_ASSETS;j++){
        if(scores[indices[j]] > scores[indices[i]]) {
          int tmp = indices[i];
          indices[i] = indices[j];
          indices[j] = tmp;
        }
      }
    }

    if(updateCount % 10 == 0) {
      printf("===MomentumBias_v12 Top-K(update#%d,OpenCL=%d)===\n",
        updateCount, openCL.ready);
#if USE_COMMUNITY
      printf(" communities=%d Q=%.4f\n", comm.nCommunities, (double)comm.qSmooth);
#endif

      int selected[N_ASSETS];
      int selCount = 0;
#if USE_COMMUNITY
      int coarseUsed[HCLUST_COARSE_K];
      int fineTake[HCLUST_FINE_K];
      int fineCap = (topN + HCLUST_FINE_K - 1) / HCLUST_FINE_K;
      for(int c=0;c<HCLUST_COARSE_K;c++) coarseUsed[c] = 0;
      for(int c=0;c<HCLUST_FINE_K;c++) fineTake[c] = 0;

      for(int i=0;i<topN;i++){
        int idx = indices[i];
        int cid = comm.clusterCoarse[idx];
        if(cid < 0 || cid >= HCLUST_COARSE_K) cid = 0;
        if(coarseUsed[cid]) continue;
        coarseUsed[cid] = 1;
        selected[selCount++] = idx;
        int fid = comm.clusterFine[idx];
        if(fid < 0 || fid >= HCLUST_FINE_K) fid = 0;
        fineTake[fid]++;
      }

      for(int i=0;i<topN && selCount<topN;i++){
        int idx = indices[i];
        int dup = 0;
        for(int k=0;k<selCount;k++) if(selected[k]==idx){ dup=1; break; }
        if(dup) continue;
        int fid = comm.clusterFine[idx];
        if(fid < 0 || fid >= HCLUST_FINE_K) fid = 0;
        if(fineTake[fid] >= fineCap) continue;
        selected[selCount++] = idx;
        fineTake[fid]++;
      }
#else
      for(int i=0;i<topN;i++) selected[selCount++] = indices[i];
#endif
      for(int i=0;i<selCount;i++){
        int idx = selected[i];
        printf(" %d.%s: score=%.4f, M=%.4f, C=%.4f\n", i+1, ASSET_NAMES[idx], (double)scores[idx], (double)momentum[idx], (double)compactness[idx]);
      }
    }
  }
};

// ---------------------------- Zorro DLL entry ----------------------------

static MomentumBiasStrategy* S = NULL;

DLLFUNC void run()
{
  if(is(INITRUN)) {
    BarPeriod = 60;
    LookBack = max(LookBack, FEAT_WINDOW + 50);

    asset((char*)ASSET_NAMES[0]);

    if(!S) {
      S = new MomentumBiasStrategy();
      S->init();
    }
  }

  if(is(EXITRUN)) {
    if(S) {
      S->shutdown();
      delete S;
      S = NULL;
    }
    return;
  }

  if(!S || Bar < LookBack)
    return;

  S->onBar();
}

ZorroGPT - https://bit.ly/3Gbsm4S

Neural Prism Renderer [Re: TipmyPip] #489286
7 hours ago 7 hours ago

Joined: Sep 2017
Posts: 276

TipmyPip

OP
Member

TipmyPip

OP
Member

Joined: Sep 2017
Posts: 276

Neural Prism Renderer is a hybrid engine that blends a trading platform runtime with a real time graphics and compute pipeline. It is built as a Windows dynamic library that can be launched from an automated evaluation host, yet it behaves like a standalone visual instrument once running. The program opens a window, creates a hardware accelerated drawing context, and prepares a streaming pixel surface that can be updated every frame without copying through the CPU. It then sets up a compute backend that can share that pixel surface directly, allowing the compute side to paint the image while the graphics side simply presents it.

At its core is a tiny neural network whose parameters are created through a deep learning runtime. The network is not trained interactively in this file; instead, it is instantiated and its weights are extracted in a safe, deterministic way. Those weights are packed into simple arrays and uploaded to the compute backend. This separation is intentional: the deep learning runtime is used as a reliable source of tensor layout and parameter creation, while the compute kernel remains lightweight and portable.

Each frame, the compute kernel visits every pixel as if it were a tiny sensor on a grid. For each pixel it forms a small coordinate input, pushes that input through the neural layers, and converts the output into color channels. The result is a neural field rendered as an image. The compute kernel writes directly into a shared buffer that the graphics system can map into a texture, so the presentation step is fast and consistent.

The code is also an integration blueprint. It carefully orders includes to prevent macro collisions between the trading platform headers and the deep learning headers. It cleans up common macro landmines that can silently corrupt builds. It offers controlled shutdown paths, respects user input to close the window, and can be configured to auto exit after a chosen time. Finally, it wraps the whole interactive loop inside a single cycle execution mode so the evaluation host does not relaunch it repeatedly. In abstract terms, it is a bridge between model parameters, parallel compute, and visual feedback, packaged to coexist with an automated trading research environment.

Code

// Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch.

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif

#define _CRT_SECURE_NO_WARNINGS

// ============================================================
// 1) Include LibTorch FIRST (like your working file)
//    Public/shareable variant: no machine-specific include paths.
// ============================================================
#if defined(__has_include)
  #if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
    #include <torch/torch.h>
    #include <torch/script.h>
  #else
    #error "LibTorch headers not found. Add LibTorch include paths to your build configuration."
  #endif
#else
  #include <torch/torch.h>
  #include <torch/script.h>
#endif

// (Optional) CUDA headers (safe pattern used by your working file)
// Keep them conditional so CPU-only LibTorch setups still compile.
#if defined(__has_include)
  #if __has_include(<torch/cuda.h>)
    #include <torch/cuda.h>
    #define HAVE_TORCH_CUDA_HEADER 1
  #else
    #define HAVE_TORCH_CUDA_HEADER 0
  #endif
  #if __has_include(<cuda_runtime_api.h>)
    #include <cuda_runtime_api.h>
    #define HAVE_CUDA_RUNTIME_API_HEADER 1
  #else
    #define HAVE_CUDA_RUNTIME_API_HEADER 0
  #endif
#else
  #define HAVE_TORCH_CUDA_HEADER 0
  #define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif

#if defined(__has_include)
  #if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
    #include <c10/cuda/CUDAGuard.h>
    #define HAVE_C10_CUDAGUARD 1
  #else
    #define HAVE_C10_CUDAGUARD 0
  #endif
#else
  #define HAVE_C10_CUDAGUARD 0
#endif

// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>

// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at' to avoid conflict
//    (exact pattern from your working file)
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at

// ============================================================
// 4) Cleanup macro landmines (exact style from your working file)
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif

#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif

// ============================================================
// OpenCL + OpenGL includes (after the macro cleanup is safest)
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h>     // cl_khr_gl_sharing
#include <CL/cl_gl_ext.h> // CL_GL_CONTEXT_KHR / CL_WGL_HDC_KHR
#include <GL/gl.h>

#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif

// ------------------------- Globals -------------------------
static HWND   gHwnd = 0;
static HDC    gHdc  = 0;
static HGLRC  gHgl  = 0;

static int    gW = 640;
static int    gH = 480;

static int read_env_int(const char* key, int fallback)
{
  const char* s = getenv(key);
  if(!s || !*s) return fallback;
  int v = atoi(s);
  return (v > 0) ? v : fallback;
}

// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);

// ===========================================================
//                Minimal OpenGL function loading
// ===========================================================

#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif

#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif

typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);

static PFNGLGENBUFFERSPROC    p_glGenBuffers    = 0;
static PFNGLBINDBUFFERPROC    p_glBindBuffer    = 0;
static PFNGLBUFFERDATAPROC    p_glBufferData    = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;

static void* gl_get_proc(const char* name)
{
  void* p = (void*)wglGetProcAddress(name);
  if(!p) {
    HMODULE ogl = GetModuleHandleA("opengl32.dll");
    if(ogl) p = (void*)GetProcAddress(ogl, name);
  }
  return p;
}

static int gl_load_ext()
{
  p_glGenBuffers    = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
  p_glBindBuffer    = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
  p_glBufferData    = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
  p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");

  if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
    return 0;
  return 1;
}

// ===========================================================
//                       OpenGL objects
// ===========================================================

static GLuint gPBO = 0;
static GLuint gTex = 0;

static void gl_release_all()
{
  if(gTex) {
    glDeleteTextures(1, &gTex);
    gTex = 0;
  }
  if(gPBO) {
    if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
    gPBO = 0;
  }

  if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
  if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}

static int gl_init_wgl(HWND hwnd)
{
  gHwnd = hwnd;
  gHdc = GetDC(hwnd);
  if(!gHdc) return 0;

  PIXELFORMATDESCRIPTOR pfd;
  ZeroMemory(&pfd, sizeof(pfd));
  pfd.nSize      = sizeof(pfd);
  pfd.nVersion   = 1;
  pfd.dwFlags    = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
  pfd.iPixelType = PFD_TYPE_RGBA;
  pfd.cColorBits = 32;
  pfd.cDepthBits = 16;
  pfd.iLayerType = PFD_MAIN_PLANE;

  int pf = ChoosePixelFormat(gHdc, &pfd);
  if(pf == 0) return 0;
  if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;

  gHgl = wglCreateContext(gHdc);
  if(!gHgl) return 0;
  if(!wglMakeCurrent(gHdc, gHgl)) return 0;

  if(!gl_load_ext()) {
    printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
    return 0;
  }

  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, gW, gH);

  // Create PBO for RGBA pixels
  p_glGenBuffers(1, &gPBO);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  // Create texture
  glGenTextures(1, &gTex);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  glBindTexture(GL_TEXTURE_2D, 0);

  return 1;
}

// ===========================================================
//                  Tiny NN (LibTorch -> weights)
// ===========================================================

#define NN_IN 2
#define NN_H 16
#define NN_OUT 3

struct TinyMLPImpl : torch::nn::Module {
  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
  TinyMLPImpl() {
    fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
    fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
  }
  torch::Tensor forward(torch::Tensor x) {
    x = torch::tanh(fc1->forward(x));
    x = torch::tanh(fc2->forward(x));
    return x;
  }
};
TORCH_MODULE(TinyMLP);

static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
  if(!W1 || !b1 || !W2 || !b2) return 0;

  try {
    torch::NoGradGuard ng;
    torch::manual_seed(1);
    TinyMLP m;
    m->eval();

    auto w1  = m->fc1->weight.detach().contiguous().to(torch::kCPU);
    auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);

    auto w2  = m->fc2->weight.detach().contiguous().to(torch::kCPU);
    auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);

    memcpy(W1, w1.data_ptr<float>(),  sizeof(float)*NN_H*NN_IN);
    memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
    memcpy(W2, w2.data_ptr<float>(),  sizeof(float)*NN_OUT*NN_H);
    memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);

    return 1;
  }
  catch(const c10::Error& e) {
    printf("\n[LibTorch] Error: %s", e.what());
    return 0;
  }
  catch(...) {
    printf("\n[LibTorch] Unknown error.");
    return 0;
  }
}

// ===========================================================
//                    OpenCL (GL sharing)
// ===========================================================

static int gCL_Ready = 0;

static cl_platform_id   gCL_Platform = 0;
static cl_device_id     gCL_Device   = 0;
static cl_context       gCL_Context  = 0;
static cl_command_queue gCL_Queue    = 0;
static cl_program       gCL_Program  = 0;

static cl_kernel        gCL_K_NN     = 0;

static cl_mem gCL_PBO = 0; // CL view of GL PBO

static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;

#define STR2(x) #x
#define XSTR(x) STR2(x)

static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height,              \n"
"  __global const float* W1, __global const float* b1,                             \n"
"  __global const float* W2, __global const float* b2)                             \n"
"{                                                                                 \n"
"  int xpix = (int)get_global_id(0);                                               \n"
"  int ypix = (int)get_global_id(1);                                               \n"
"  if(xpix >= width || ypix >= height) return;                                     \n"
"                                                                                  \n"
"  float x = ((float)xpix / (float)(width  - 1)) * 2.0f - 1.0f;                    \n"
"  float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f;                    \n"
"  float in0 = x;                                                                  \n"
"  float in1 = -y;                                                                 \n"
"                                                                                  \n"
"  float h[" XSTR(NN_H) "];                                                        \n"
"  for(int j=0;j<" XSTR(NN_H) ";j++){                                              \n"
"    float acc = b1[j];                                                            \n"
"    acc += in0 * W1[j*" XSTR(NN_IN) " + 0];                                       \n"
"    acc += in1 * W1[j*" XSTR(NN_IN) " + 1];                                       \n"
"    h[j] = tanh(acc);                                                             \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float o[" XSTR(NN_OUT) "];                                                      \n"
"  for(int k=0;k<" XSTR(NN_OUT) ";k++){                                            \n"
"    float acc = b2[k];                                                            \n"
"    for(int j=0;j<" XSTR(NN_H) ";j++){                                            \n"
"      acc += h[j] * W2[k*" XSTR(NN_H) " + j];                                     \n"
"    }                                                                             \n"
"    float s = 0.5f + 0.5f*tanh(acc);                                              \n"
"    if(s<0) s=0; if(s>1) s=1;                                                     \n"
"    o[k] = s;                                                                     \n"
"  }                                                                               \n"
"                                                                                  \n"
"  uchar r = (uchar)(255.0f*o[0]);                                                 \n"
"  uchar g = (uchar)(255.0f*o[1]);                                                 \n"
"  uchar b = (uchar)(255.0f*o[2]);                                                 \n"
"  out[ypix*width + xpix] = (uchar4)(r,g,b,255);                                   \n"
"}                                                                                 \n";

static void cl_release_all()
{
  if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
  if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
  if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
  if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }

  if(gCL_PBO)    { clReleaseMemObject(gCL_PBO);    gCL_PBO = 0; }
  if(gCL_K_NN)   { clReleaseKernel(gCL_K_NN);      gCL_K_NN = 0; }
  if(gCL_Program){ clReleaseProgram(gCL_Program);  gCL_Program = 0; }
  if(gCL_Queue)  { clReleaseCommandQueue(gCL_Queue); gCL_Queue = 0; }
  if(gCL_Context){ clReleaseContext(gCL_Context);  gCL_Context = 0; }

  gCL_Device = 0;
  gCL_Platform = 0;
  gCL_Ready = 0;
}

static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
  cl_uint nPlatforms = 0;
  if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
    return 0;

  cl_platform_id platforms[8];
  if(nPlatforms > 8) nPlatforms = 8;
  if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
    return 0;

  for(cl_uint p=0; p<nPlatforms; p++)
  {
    cl_uint nDev = 0;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
      continue;

    cl_device_id devs[8];
    if(nDev > 8) nDev = 8;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
      continue;

    for(cl_uint d=0; d<nDev; d++)
    {
      char ext[8192];
      size_t sz = 0;
      if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
        continue;

      if(strstr(ext, "cl_khr_gl_sharing"))
      {
        *outP = platforms[p];
        *outD = devs[d];
        return 1;
      }
    }
  }

  return 0;
}

static int cl_init_glshare()
{
  cl_int err = CL_SUCCESS;

  cl_platform_id P = 0;
  cl_device_id   D = 0;

  if(!cl_pick_device_with_glshare(&P, &D)) {
    printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
    return 0;
  }

  gCL_Platform = P;
  gCL_Device   = D;

  cl_context_properties props[] = {
    CL_GL_CONTEXT_KHR,   (cl_context_properties)wglGetCurrentContext(),
    CL_WGL_HDC_KHR,      (cl_context_properties)wglGetCurrentDC(),
    CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
    0
  };

  gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
  if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }

  gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
  if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }

  gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
  if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }

  err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
  if(err != CL_SUCCESS)
  {
    char logbuf[8192];
    size_t logsz = 0;
    clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
    printf("\nOpenCL build failed:\n%s", logbuf);
    cl_release_all();
    return 0;
  }

  gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
  if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }

  gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
  if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
  gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
  gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
  gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
  if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }

  float hW1[NN_H*NN_IN];
  float hb1[NN_H];
  float hW2[NN_OUT*NN_H];
  float hb2[NN_OUT];

  if(!build_weights_from_libtorch(hW1, hb1, hW2, hb2)) {
    printf("\n[LibTorch] Failed to build weights.");
    cl_release_all();
    return 0;
  }

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, hW1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, hb1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, hW2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, hb2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }

  gCL_Ready = 1;
  printf("\nOpenCL: GL-sharing enabled. NN kernel ready.");
  return 1;
}

// ===========================================================
//                      Render (CL -> GL)
// ===========================================================

static void RenderFrame()
{
  if(!gCL_Ready) return;

  size_t global[2] = { (size_t)gW, (size_t)gH };
  size_t local[2]  = { 16, 16 };

  cl_int err = CL_SUCCESS;

  err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  if(err != CL_SUCCESS) return;

  int arg = 0;
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gW);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gH);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);

  err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
  if(err != CL_SUCCESS) {
    err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
  }

  clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  clFinish(gCL_Queue);

  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glClear(GL_COLOR_BUFFER_BIT);
  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, gTex);

  glBegin(GL_QUADS);
    glTexCoord2f(0,0); glVertex2f(-1,-1);
    glTexCoord2f(1,0); glVertex2f( 1,-1);
    glTexCoord2f(1,1); glVertex2f( 1, 1);
    glTexCoord2f(0,1); glVertex2f(-1, 1);
  glEnd();

  glBindTexture(GL_TEXTURE_2D, 0);
  SwapBuffers(gHdc);
}

// ===========================================================
//                         WinMain
// ===========================================================

int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
  // 0 means no auto-close; window stays until user closes it.
  const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
  ULONGLONG startTick = GetTickCount64();

  const char* szClass = "Mendb02NNCLGLClass";
  UnregisterClassA(szClass, hInst);

  WNDCLASSEXA wc;
  ZeroMemory(&wc, sizeof(wc));
  wc.cbSize = sizeof(wc);
  wc.style = CS_HREDRAW | CS_VREDRAW;
  wc.lpfnWndProc = WndProc;
  wc.hInstance = hInst;
  wc.hCursor = LoadCursor(NULL, IDC_ARROW);
  wc.lpszClassName = szClass;
  RegisterClassExA(&wc);

  RECT r;
  r.left=0; r.top=0; r.right=gW; r.bottom=gH;
  AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);

  HWND hwnd = CreateWindowExA(
    0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
    WS_OVERLAPPEDWINDOW,
    100, 100, (r.right-r.left), (r.bottom-r.top),
    0, 0, hInst, 0);

  if(!hwnd) return 0;

  ShowWindow(hwnd, SW_SHOW);
  UpdateWindow(hwnd);

  if(!gl_init_wgl(hwnd))
  {
    MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
    gl_release_all();
    return 0;
  }

  if(!cl_init_glshare())
  {
    MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
    cl_release_all();
    gl_release_all();
    return 0;
  }

  MSG msg;
  ZeroMemory(&msg, sizeof(msg));

  while(msg.message != WM_QUIT)
  {
    while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
    {
      TranslateMessage(&msg);
      DispatchMessage(&msg);
    }

    // Allow Zorro STOP to close this Win32 loop cleanly, but ignore
    // the sticky FIRSTINITRUN+EXITRUN combo seen at startup.
    if(is(EXITRUN) && !is(FIRSTINITRUN)) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    if(!IsWindow(hwnd))
      break;

    if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    RenderFrame();
  }

  cl_release_all();
  gl_release_all();
  gHwnd = 0;
  return 0;
}

// ===========================================================
//                         Input
// ===========================================================

LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
  switch(msg)
  {
    case WM_CLOSE:
      DestroyWindow(hWnd);
      return 0;

    case WM_KEYDOWN:
      if(wParam == VK_ESCAPE || wParam == VK_F12) {
        PostMessage(hWnd, WM_CLOSE, 0, 0);
        return 0;
      }
      return 0;

    case WM_DESTROY:
      PostQuitMessage(0);
      return 0;
  }
  return DefWindowProc(hWnd, msg, wParam, lParam);
}

// ===========================================================
//                         Zorro DLL entry
// ===========================================================

DLLFUNC int main()
{
  // Force single-cycle execution in Zorro to avoid automatic relaunches.
  NumTotalCycles = 1;
  NumWFOCycles = 1;
  NumSampleCycles = 1;
  set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);

  static int done = 0;
  if(is(FIRSTINITRUN))
    done = 0;

  if(done)
    return 0;

  (void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
  done = 1;
  return quit("!Mendb02 finished");
}

Last edited by TipmyPip; 5 hours ago.

ZorroGPT - https://bit.ly/3Gbsm4S

TorchBridge Pixel Loom [Re: TipmyPip] #489287
6 hours ago 6 hours ago

Joined: Sep 2017
Posts: 276

TipmyPip

OP
Member

TipmyPip

OP
Member

Joined: Sep 2017
Posts: 276

TorchBridge Pixel Loom is a demonstration strategy that turns three separate worlds into a single continuous pipeline: a learning world that defines how a tiny neural network should behave, a compute world that applies that behavior at massive scale, and a graphics world that displays the results with minimal copying. The design is not about training a model in real time, and it is not primarily about trading signals. Instead, it is an engineering pattern that shows how a learning library can author numeric parameters, how a parallel compute engine can transform those parameters into per element decisions, and how a graphics engine can present the output in a window at interactive speed. The entire system is constructed as a bridge between libraries that were not originally meant to cooperate, and the code spends as much effort preventing conflicts as it spends performing work. The core mathematical relationship is that all three subsystems are manipulating the same kind of thing, a structured collection of numbers, but each subsystem wants that collection in a different form, with different rules about ownership, memory, and timing. The code is therefore a story about representation, transfer, and synchronization.

The first act is the integration discipline. The file begins by pulling in the learning library first. This is not cosmetic. LibTorch brings heavy template machinery and a long tail of macros and identifiers. Zorro brings its own macro definitions and naming habits, including a short identifier that collides with a major namespace used by LibTorch. If the includes are reversed, the compiler can be pushed into confusing or contradictory interpretations. The chosen include order makes the learning library set the ground rules first, then invites Zorro into that environment after the most dangerous names are neutralized. The code then performs a cleanup sweep that removes common macro landmines such as min, max, abs, and other short identifiers that can silently change the meaning of later code. This is a practical form of mathematical correctness. It is not about equations, it is about ensuring that function names and numeric operations mean what the author expects in every compilation unit.

The second act is the graphics world. OpenGL is used through a Win32 window and a WGL context. The graphics subsystem creates a pixel buffer object, which is a GPU backed container for raw pixel bytes. It also creates a texture that can be updated from that buffer. The pixel buffer object is sized for a full image in four channels per pixel. The texture is configured to display those pixels without filtering tricks that could blur or resample the data. The graphics pipeline here is deliberately simple: update the texture from the buffer, draw a full screen quad, and swap buffers. The goal is not advanced rendering but reliable presentation of a computed image.

The third act is the compute world. OpenCL is used as a parallel compute engine that can write to the same pixel buffer object that OpenGL uses. This is the most important relationship between OpenCL and OpenGL in the program. The code chooses a GPU device that explicitly supports the extension for sharing objects between compute and graphics. That extension makes it possible for the compute engine and the graphics engine to refer to the same underlying GPU memory without staging through the CPU. In symbolic terms, OpenGL owns a canvas, OpenCL is granted a pen that can draw directly onto that canvas, and the program carefully negotiates when the pen is allowed to touch the canvas. That negotiation is done through acquire and release calls. When compute begins, it acquires the shared object, which is like taking a lock. When compute ends, it releases the object, which is like returning the lock so graphics can read the new pixels. This lock style coordination is the heart of correctness. Without it, compute and graphics could act at the same time on the same memory, causing tearing, undefined behavior, or driver errors.

The fourth act is the learning world. LibTorch is used here not as an online training engine but as an authoring tool for weights and biases of a tiny neural network. The network is a small multilayer perceptron with two inputs, one hidden layer, and three outputs. It uses a smooth nonlinearity in each layer so that its responses vary continuously rather than stepping abruptly. The model is created and set into evaluation mode. A no gradient guard is used so that the library does not build training graphs or store extra history. The code then extracts the weight matrices and bias vectors from the layers, ensures they are contiguous in memory, ensures they are on the CPU, and copies them into plain arrays. This conversion step is the key relationship between LibTorch and OpenCL. LibTorch stores parameters as tensors with rich metadata and potential device placement. OpenCL wants raw buffers of floats. The code therefore performs a change of representation: from a high level tensor world into a flat array world. That conversion is the mathematical handshake between the learning library and the compute kernel. It is also the moment where the author ensures the tiny network in OpenCL is using the same parameters as the model defined by LibTorch.

Once the weights are in plain arrays, OpenCL buffers are created for each parameter block. Those buffers are marked read only because the kernel will not modify them. The arrays from LibTorch are then uploaded into the OpenCL buffers using write commands on the command queue. At this point, the learning system has effectively published a set of numeric constants into the GPU compute system. The relationship is one way: LibTorch produces, OpenCL consumes. In this demo, the parameters are built once at initialization. In a more advanced version, parameters could be refreshed periodically to reflect training updates, but that is not the goal here.

The next relationship is between the OpenCL kernel and the pixel buffer. The kernel is launched across a two dimensional grid matching the image width and height. Each work item corresponds to one pixel. For each pixel, the kernel first maps pixel coordinates to a normalized coordinate space and builds two input values from those coordinates and a moving phase variable. This phase acts like a clock signal that animates the image, ensuring that the pipeline is alive rather than static. The kernel then runs the tiny neural network forward pass. It computes hidden activations by combining the two inputs with the first layer weights and biases, applies the nonlinearity, then combines those hidden values with the second layer weights and biases, applies the nonlinearity again, and produces three output channels. Those channels are then mixed with additional procedural components such as stripes and a vignette effect, and finally converted to bytes. The resulting four channel pixel is written into the output buffer, which is the shared OpenGL pixel buffer object.

This is where the “mathematical relationship” between LibTorch and OpenCL becomes visible. LibTorch defines the numeric transformation embodied by the weights and biases. OpenCL executes that transformation at massive scale, once per pixel, on a GPU. The output is not just any calculation; it is the same functional shape that the LibTorch model represents, but applied in a different domain. Instead of being applied to training data, it is applied to spatial coordinates and a time phase. In other words, the neural network is used as a generative function, and OpenCL is the engine that evaluates that function for a whole image at once.

The relationship between OpenCL and OpenGL is equally structural. The output buffer is not copied back to the CPU. Instead, OpenGL updates a texture directly from the pixel buffer object, and the image is displayed. The shared object path avoids a major performance bottleneck. The acquire and release calls ensure that the buffer transitions cleanly between compute ownership and graphics ownership. The command queue finish call ensures that the compute kernel is fully complete before the graphics subsystem uploads and draws. That is a timing relationship, a synchronization contract that keeps the pipeline coherent frame after frame.

Finally, the code is embedded inside a Zorro oriented DLL entry. Zorro is not the star of this demo, but it provides a controlled host environment, lifecycle flags, and a consistent way to stop execution. The Win32 loop is allowed to close if Zorro signals exit, and a guard is added to avoid a known startup state combination that could trigger premature closure. The DLL main function forces a single cycle execution to prevent relaunch loops and then calls the WinMain routine once. This turns the whole program into a single run demonstration that can be launched from within a Zorro workflow, which is useful if the broader project is a trading system that wants a compute visualization or a diagnostic display.

In abstract terms, TorchBridge Pixel Loom is a three stage loom. LibTorch spins the thread by defining the tiny network parameters. OpenCL weaves the thread across a two dimensional fabric by evaluating the network for every pixel in parallel. OpenGL displays the woven fabric by texturing a screen aligned quad. The mathematics is not expressed through explicit formula writing; it is expressed through a consistent mapping of numeric state across subsystems: parameter tensors become float arrays, float arrays become device buffers, device buffers feed a kernel, the kernel writes pixels, and the pixels become an image without leaving the GPU. The value of the strategy is the pattern: it demonstrates how to connect a learning definition, a compute executor, and a graphics presenter into a single coherent system with clear ownership rules, minimal copying, and stable synchronization.

Code

// Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch.

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif

#define _CRT_SECURE_NO_WARNINGS

// ============================================================
// 1) Include LibTorch FIRST (like your working file)
//    Public/shareable variant: no machine-specific include paths.
// ============================================================
#if defined(__has_include)
  #if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
    #include <torch/torch.h>
    #include <torch/script.h>
  #else
    #error "LibTorch headers not found. Add LibTorch include paths to your build configuration."
  #endif
#else
  #include <torch/torch.h>
  #include <torch/script.h>
#endif

// (Optional) CUDA headers (safe pattern used by your working file)
// Keep them conditional so CPU-only LibTorch setups still compile.
#if defined(__has_include)
  #if __has_include(<torch/cuda.h>)
    #include <torch/cuda.h>
    #define HAVE_TORCH_CUDA_HEADER 1
  #else
    #define HAVE_TORCH_CUDA_HEADER 0
  #endif
  #if __has_include(<cuda_runtime_api.h>)
    #include <cuda_runtime_api.h>
    #define HAVE_CUDA_RUNTIME_API_HEADER 1
  #else
    #define HAVE_CUDA_RUNTIME_API_HEADER 0
  #endif
#else
  #define HAVE_TORCH_CUDA_HEADER 0
  #define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif

#if defined(__has_include)
  #if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
    #include <c10/cuda/CUDAGuard.h>
    #define HAVE_C10_CUDAGUARD 1
  #else
    #define HAVE_C10_CUDAGUARD 0
  #endif
#else
  #define HAVE_C10_CUDAGUARD 0
#endif

// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>

// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at' to avoid conflict
//    (exact pattern from your working file)
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at

// ============================================================
// 4) Cleanup macro landmines (exact style from your working file)
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif

#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif

// ============================================================
// OpenCL + OpenGL includes (after the macro cleanup is safest)
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h>     // cl_khr_gl_sharing
#include <CL/cl_gl_ext.h> // CL_GL_CONTEXT_KHR / CL_WGL_HDC_KHR
#include <GL/gl.h>

#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif

// ------------------------- Globals -------------------------
static HWND   gHwnd = 0;
static HDC    gHdc  = 0;
static HGLRC  gHgl  = 0;

static int    gW = 640;
static int    gH = 480;
static float  gPhase = 0.0f;

static int read_env_int(const char* key, int fallback)
{
  const char* s = getenv(key);
  if(!s || !*s) return fallback;
  int v = atoi(s);
  return (v > 0) ? v : fallback;
}

// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);

// ===========================================================
//                Minimal OpenGL function loading
// ===========================================================

#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif

#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif

typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);

static PFNGLGENBUFFERSPROC    p_glGenBuffers    = 0;
static PFNGLBINDBUFFERPROC    p_glBindBuffer    = 0;
static PFNGLBUFFERDATAPROC    p_glBufferData    = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;

static void* gl_get_proc(const char* name)
{
  void* p = (void*)wglGetProcAddress(name);
  if(!p) {
    HMODULE ogl = GetModuleHandleA("opengl32.dll");
    if(ogl) p = (void*)GetProcAddress(ogl, name);
  }
  return p;
}

static int gl_load_ext()
{
  p_glGenBuffers    = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
  p_glBindBuffer    = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
  p_glBufferData    = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
  p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");

  if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
    return 0;
  return 1;
}

// ===========================================================
//                       OpenGL objects
// ===========================================================

static GLuint gPBO = 0;
static GLuint gTex = 0;

static void gl_release_all()
{
  if(gTex) {
    glDeleteTextures(1, &gTex);
    gTex = 0;
  }
  if(gPBO) {
    if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
    gPBO = 0;
  }

  if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
  if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}

static int gl_init_wgl(HWND hwnd)
{
  gHwnd = hwnd;
  gHdc = GetDC(hwnd);
  if(!gHdc) return 0;

  PIXELFORMATDESCRIPTOR pfd;
  ZeroMemory(&pfd, sizeof(pfd));
  pfd.nSize      = sizeof(pfd);
  pfd.nVersion   = 1;
  pfd.dwFlags    = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
  pfd.iPixelType = PFD_TYPE_RGBA;
  pfd.cColorBits = 32;
  pfd.cDepthBits = 16;
  pfd.iLayerType = PFD_MAIN_PLANE;

  int pf = ChoosePixelFormat(gHdc, &pfd);
  if(pf == 0) return 0;
  if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;

  gHgl = wglCreateContext(gHdc);
  if(!gHgl) return 0;
  if(!wglMakeCurrent(gHdc, gHgl)) return 0;

  if(!gl_load_ext()) {
    printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
    return 0;
  }

  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, gW, gH);

  // Create PBO for RGBA pixels
  p_glGenBuffers(1, &gPBO);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  // Create texture
  glGenTextures(1, &gTex);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  glBindTexture(GL_TEXTURE_2D, 0);

  return 1;
}

// ===========================================================
//                  Tiny NN (LibTorch -> weights)
// ===========================================================

#define NN_IN 2
#define NN_H 16
#define NN_OUT 3

struct TinyMLPImpl : torch::nn::Module {
  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
  TinyMLPImpl() {
    fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
    fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
  }
  torch::Tensor forward(torch::Tensor x) {
    x = torch::tanh(fc1->forward(x));
    x = torch::tanh(fc2->forward(x));
    return x;
  }
};
TORCH_MODULE(TinyMLP);

static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
  if(!W1 || !b1 || !W2 || !b2) return 0;

  try {
    torch::NoGradGuard ng;
    torch::manual_seed(1);
    TinyMLP m;
    m->eval();

    auto w1  = m->fc1->weight.detach().contiguous().to(torch::kCPU);
    auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);

    auto w2  = m->fc2->weight.detach().contiguous().to(torch::kCPU);
    auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);

    memcpy(W1, w1.data_ptr<float>(),  sizeof(float)*NN_H*NN_IN);
    memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
    memcpy(W2, w2.data_ptr<float>(),  sizeof(float)*NN_OUT*NN_H);
    memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);

    return 1;
  }
  catch(const c10::Error& e) {
    printf("\n[LibTorch] Error: %s", e.what());
    return 0;
  }
  catch(...) {
    printf("\n[LibTorch] Unknown error.");
    return 0;
  }
}

// ===========================================================
//                    OpenCL (GL sharing)
// ===========================================================

static int gCL_Ready = 0;

static cl_platform_id   gCL_Platform = 0;
static cl_device_id     gCL_Device   = 0;
static cl_context       gCL_Context  = 0;
static cl_command_queue gCL_Queue    = 0;
static cl_program       gCL_Program  = 0;

static cl_kernel        gCL_K_NN     = 0;

static cl_mem gCL_PBO = 0; // CL view of GL PBO

static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;

#define STR2(x) #x
#define XSTR(x) STR2(x)

static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height,              \n"
"  __global const float* W1, __global const float* b1,                             \n"
"  __global const float* W2, __global const float* b2, float phase)                \n"
"{                                                                                 \n"
"  int xpix = (int)get_global_id(0);                                               \n"
"  int ypix = (int)get_global_id(1);                                               \n"
"  if(xpix >= width || ypix >= height) return;                                     \n"
"                                                                                  \n"
"  float x = ((float)xpix / (float)(width  - 1)) * 2.0f - 1.0f;                    \n"
"  float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f;                    \n"
"  float in0 = 2.8f*x + 0.7f*sin(3.0f*y + phase);                                  \n"
"  float in1 = -2.8f*y + 0.7f*cos(3.0f*x - 1.3f*phase);                            \n"
"                                                                                  \n"
"  float h[" XSTR(NN_H) "];                                                        \n"
"  for(int j=0;j<" XSTR(NN_H) ";j++){                                              \n"
"    float acc = b1[j];                                                            \n"
"    acc += in0 * W1[j*" XSTR(NN_IN) " + 0];                                       \n"
"    acc += in1 * W1[j*" XSTR(NN_IN) " + 1];                                       \n"
"    h[j] = tanh(acc);                                                             \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float o[" XSTR(NN_OUT) "];                                                      \n"
"  for(int k=0;k<" XSTR(NN_OUT) ";k++){                                            \n"
"    float acc = b2[k];                                                            \n"
"    for(int j=0;j<" XSTR(NN_H) ";j++){                                            \n"
"      acc += h[j] * W2[k*" XSTR(NN_H) " + j];                                     \n"
"    }                                                                             \n"
"    float s = 0.5f + 0.5f*tanh(acc);                                              \n"
"    if(s<0) s=0; if(s>1) s=1;                                                     \n"
"    o[k] = s;                                                                     \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float radial = sqrt(x*x + y*y);                                                 \n"
"  float vignette = clamp(1.15f - radial, 0.0f, 1.0f);                             \n"
"  float stripe = 0.5f + 0.5f*sin(10.0f*(x + y) + phase);                          \n"
"  float rcol = clamp(0.70f*o[0] + 0.30f*stripe, 0.0f, 1.0f) * vignette;           \n"
"  float gcol = clamp(0.85f*o[1] + 0.15f*(1.0f - stripe), 0.0f, 1.0f) * vignette;  \n"
"  float bcol = clamp(0.75f*o[2] + 0.25f*(0.5f + 0.5f*cos(8.0f*x - phase)),0.0f,1.0f);\n"
"  uchar r = (uchar)(255.0f*rcol);                                                 \n"
"  uchar g = (uchar)(255.0f*gcol);                                                 \n"
"  uchar b = (uchar)(255.0f*bcol);                                                 \n"
"  out[ypix*width + xpix] = (uchar4)(r,g,b,255);                                   \n"
"}                                                                                 \n";

static void cl_release_all()
{
  if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
  if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
  if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
  if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }

  if(gCL_PBO)    { clReleaseMemObject(gCL_PBO);    gCL_PBO = 0; }
  if(gCL_K_NN)   { clReleaseKernel(gCL_K_NN);      gCL_K_NN = 0; }
  if(gCL_Program){ clReleaseProgram(gCL_Program);  gCL_Program = 0; }
  if(gCL_Queue)  { clReleaseCommandQueue(gCL_Queue); gCL_Queue = 0; }
  if(gCL_Context){ clReleaseContext(gCL_Context);  gCL_Context = 0; }

  gCL_Device = 0;
  gCL_Platform = 0;
  gCL_Ready = 0;
}

static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
  cl_uint nPlatforms = 0;
  if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
    return 0;

  cl_platform_id platforms[8];
  if(nPlatforms > 8) nPlatforms = 8;
  if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
    return 0;

  for(cl_uint p=0; p<nPlatforms; p++)
  {
    cl_uint nDev = 0;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
      continue;

    cl_device_id devs[8];
    if(nDev > 8) nDev = 8;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
      continue;

    for(cl_uint d=0; d<nDev; d++)
    {
      char ext[8192];
      size_t sz = 0;
      if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
        continue;

      if(strstr(ext, "cl_khr_gl_sharing"))
      {
        *outP = platforms[p];
        *outD = devs[d];
        return 1;
      }
    }
  }

  return 0;
}

static int cl_init_glshare()
{
  cl_int err = CL_SUCCESS;

  cl_platform_id P = 0;
  cl_device_id   D = 0;

  if(!cl_pick_device_with_glshare(&P, &D)) {
    printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
    return 0;
  }

  gCL_Platform = P;
  gCL_Device   = D;

  cl_context_properties props[] = {
    CL_GL_CONTEXT_KHR,   (cl_context_properties)wglGetCurrentContext(),
    CL_WGL_HDC_KHR,      (cl_context_properties)wglGetCurrentDC(),
    CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
    0
  };

  gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
  if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }

  gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
  if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }

  gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
  if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }

  err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
  if(err != CL_SUCCESS)
  {
    char logbuf[8192];
    size_t logsz = 0;
    clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
    printf("\nOpenCL build failed:\n%s", logbuf);
    cl_release_all();
    return 0;
  }

  gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
  if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }

  gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
  if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
  gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
  gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
  gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
  if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }

  float hW1[NN_H*NN_IN];
  float hb1[NN_H];
  float hW2[NN_OUT*NN_H];
  float hb2[NN_OUT];

  if(!build_weights_from_libtorch(hW1, hb1, hW2, hb2)) {
    printf("\n[LibTorch] Failed to build weights.");
    cl_release_all();
    return 0;
  }

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, hW1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, hb1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, hW2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, hb2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }

  gCL_Ready = 1;
  printf("\nOpenCL: GL-sharing enabled. NN kernel ready.");
  return 1;
}

// ===========================================================
//                      Render (CL -> GL)
// ===========================================================

static void RenderFrame()
{
  if(!gCL_Ready) return;

  size_t global[2] = { (size_t)gW, (size_t)gH };
  size_t local[2]  = { 16, 16 };

  cl_int err = CL_SUCCESS;

  err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  if(err != CL_SUCCESS) return;

  int arg = 0;
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gW);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gH);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float),  &gPhase);

  err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
  if(err != CL_SUCCESS) {
    err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
  }

  clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  clFinish(gCL_Queue);

  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glClear(GL_COLOR_BUFFER_BIT);
  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, gTex);

  glBegin(GL_QUADS);
    glTexCoord2f(0,0); glVertex2f(-1,-1);
    glTexCoord2f(1,0); glVertex2f( 1,-1);
    glTexCoord2f(1,1); glVertex2f( 1, 1);
    glTexCoord2f(0,1); glVertex2f(-1, 1);
  glEnd();

  glBindTexture(GL_TEXTURE_2D, 0);
  SwapBuffers(gHdc);

  gPhase += 0.03f;
}

// ===========================================================
//                         WinMain
// ===========================================================

int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
  // 0 means no auto-close; window stays until user closes it.
  const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
  ULONGLONG startTick = GetTickCount64();

  const char* szClass = "Mendb02NNCLGLClass";
  UnregisterClassA(szClass, hInst);

  WNDCLASSEXA wc;
  ZeroMemory(&wc, sizeof(wc));
  wc.cbSize = sizeof(wc);
  wc.style = CS_HREDRAW | CS_VREDRAW;
  wc.lpfnWndProc = WndProc;
  wc.hInstance = hInst;
  wc.hCursor = LoadCursor(NULL, IDC_ARROW);
  wc.lpszClassName = szClass;
  RegisterClassExA(&wc);

  RECT r;
  r.left=0; r.top=0; r.right=gW; r.bottom=gH;
  AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);

  HWND hwnd = CreateWindowExA(
    0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
    WS_OVERLAPPEDWINDOW,
    100, 100, (r.right-r.left), (r.bottom-r.top),
    0, 0, hInst, 0);

  if(!hwnd) return 0;

  ShowWindow(hwnd, SW_SHOW);
  UpdateWindow(hwnd);

  if(!gl_init_wgl(hwnd))
  {
    MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
    gl_release_all();
    return 0;
  }

  if(!cl_init_glshare())
  {
    MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
    cl_release_all();
    gl_release_all();
    return 0;
  }

  MSG msg;
  ZeroMemory(&msg, sizeof(msg));

  while(msg.message != WM_QUIT)
  {
    while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
    {
      TranslateMessage(&msg);
      DispatchMessage(&msg);
    }

    // Allow Zorro STOP to close this Win32 loop cleanly, but ignore
    // the sticky FIRSTINITRUN+EXITRUN combo seen at startup.
    if(is(EXITRUN) && !is(FIRSTINITRUN)) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    if(!IsWindow(hwnd))
      break;

    if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    RenderFrame();
  }

  cl_release_all();
  gl_release_all();
  gHwnd = 0;
  return 0;
}

// ===========================================================
//                         Input
// ===========================================================

LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
  switch(msg)
  {
    case WM_CLOSE:
      DestroyWindow(hWnd);
      return 0;

    case WM_KEYDOWN:
      if(wParam == VK_ESCAPE || wParam == VK_F12) {
        PostMessage(hWnd, WM_CLOSE, 0, 0);
        return 0;
      }
      return 0;

    case WM_DESTROY:
      PostQuitMessage(0);
      return 0;
  }
  return DefWindowProc(hWnd, msg, wParam, lParam);
}

// ===========================================================
//                         Zorro DLL entry
// ===========================================================

DLLFUNC int main()
{
  // Force single-cycle execution in Zorro to avoid automatic relaunches.
  NumTotalCycles = 1;
  NumWFOCycles = 1;
  NumSampleCycles = 1;
  set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);

  static int done = 0;
  if(is(FIRSTINITRUN))
    done = 0;

  if(done)
    return 0;

  (void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
  done = 1;
  return quit("!Mendb02 finished");
}

Description on how to get Zorro to execute the code :

Code

# Mendb02 (Public) Build Setup Guide

This guide explains the required directory/layout and build settings for `Mendb02.cpp` and `Mendb02_viz.cpp` without machine-specific paths.

## 1) Required Components

- Zorro 64-bit C++ strategy environment (with `zorro.h` and `ZorroDLL.cpp`)
- Visual Studio Build Tools (x64 C++ compiler)
- OpenCL SDK/runtime headers and libs
- OpenGL system libs (Windows)
- LibTorch C++ distribution (CPU or CUDA build)

## 2) Recommended Folder Variables

Define these paths in your build script (or IDE project settings):

- `ZORRO_ROOT` - Zorro installation root
- `LIBTORCH_ROOT` - LibTorch root folder containing `include/` and `lib/`
- `CUDA_ROOT` - CUDA toolkit root (only if using CUDA-enabled LibTorch)

Your source file can stay anywhere (for example `Strategy/Mendb02_viz.cpp`).

## 3) Include Directories

Add these include directories:

- `<ZORRO_ROOT>/include`
- `<LIBTORCH_ROOT>/include`
- `<LIBTORCH_ROOT>/include/torch/csrc/api/include`
- `<CUDA_ROOT>/include` (optional; needed for CUDA headers)

## 4) Library Directories

Add these library directories:

- `<LIBTORCH_ROOT>/lib`
- `<CUDA_ROOT>/lib/x64` (if CUDA build)

## 5) Link Libraries

Minimum Windows/OpenCL/OpenGL libs:

- `OpenCL.lib`
- `OpenGL32.lib`
- `User32.lib`
- `Gdi32.lib`

LibTorch libs (CPU-only setup):

- `c10.lib`
- `torch.lib`
- `torch_cpu.lib`

LibTorch libs (CUDA setup):

- `c10.lib`
- `c10_cuda.lib`
- `torch.lib`
- `torch_cpu.lib`
- `torch_cuda.lib`
- `cudart.lib`

## 6) Runtime DLLs

At runtime, required DLLs must be discoverable by Windows loader (either next to strategy DLL or on `PATH`).

Typical requirement:

- All needed files from `<LIBTORCH_ROOT>/lib/*.dll`
- CUDA runtime DLLs (if CUDA-enabled build)

## 7) Compile Flags (Typical)

Recommended flags for this code style:

- `/MD` (dynamic runtime)
- `/EHa`
- `/O2`
- `/std:c++17`
- `/permissive`
- `/D _WINDLL`

## 8) Zorro compile64.bat Routing

If Zorro compiles strategies by filename dispatch, ensure each LibTorch-based file is routed to a LibTorch-enabled branch.

Example logic:

- `if /I "%SRCNAME%"=="Mendb02.cpp" goto :build_libtorch`
- `if /I "%SRCNAME%"=="Mendb02_viz.cpp" goto :build_libtorch`

If this mapping is missing, compilation may fail with missing `torch/torch.h`.

## 9) Common Failure Modes

- **C1189 / torch headers not found**: missing LibTorch include paths
- **LNK1104 on output DLL**: target DLL is locked by running process
- **OpenCL GL-sharing init failed**: kernel compile/runtime mismatch, unsupported GL-sharing device, or context mismatch
- **Runtime DLL load error**: required LibTorch/CUDA DLLs not on loader path

## 10) Publish-Safe Notes

The public source variants intentionally avoid hardcoded local absolute paths. Keep all machine-specific paths in build scripts or environment variables.

Last edited by TipmyPip; 5 hours ago.

ZorroGPT - https://bit.ly/3Gbsm4S

Stochastic TorchCanvas Bridge [Re: TipmyPip] #489288
5 hours ago 5 hours ago

Joined: Sep 2017
Posts: 276

TipmyPip

OP
Member

TipmyPip

OP
Member

Joined: Sep 2017
Posts: 276

Stochastic TorchCanvas Bridge is a hybrid rendering and compute demonstration that connects three different ecosystems into one coherent loop: a learning library that produces neural parameters, a parallel compute engine that applies those parameters across a two dimensional field, and a graphics engine that displays the computed field in real time. The code is structured as a practical integration template rather than a pure algorithmic showcase. Its central purpose is to prove that LibTorch, OpenCL, and OpenGL can share a consistent numerical story while operating under different rules for memory ownership, device access, and synchronization. The program creates a window on Windows using a classic Win32 message loop, builds an OpenGL context for rendering, builds an OpenCL context that is explicitly linked to that OpenGL context through resource sharing, and then runs a tight frame loop where OpenCL computes pixel values directly into a GPU buffer that OpenGL consumes without a CPU copy. The result is a continuously animated image whose visual structure comes from a tiny neural network forward pass combined with procedural shaping and noise.

The file begins with defensive engineering. It defines Windows and runtime macros to reduce header bloat and avoid common collisions. It then includes LibTorch before anything else. This is deliberate because LibTorch headers bring large template machinery and their own symbol expectations, and the project also includes Zorro headers which define short identifiers and macros that are known to conflict with the C plus plus tensor ecosystem. The strategy avoids those collisions by including LibTorch first, then including Zorro only after renaming an especially problematic identifier and clearing macro hazards. After Zorro is included, the code performs a cleanup pass that undefines several short macros such as min and max and abs that could otherwise rewrite expressions silently. This stage is less about performance and more about preserving semantic correctness. If the compiler sees the wrong macro expansions, numeric functions can behave differently, and in a system that mixes multiple libraries that all use generic names, correctness depends on controlling the preprocessor environment.

After the integration hygiene, the program sets up OpenGL. A window handle and device context are obtained from the Win32 system, a pixel format is selected that supports double buffering and RGBA output, and an OpenGL rendering context is created with WGL calls. Once the context is active, the program loads a small set of OpenGL buffer functions using wglGetProcAddress and a fallback to opengl32 exports. This is done to support a pixel buffer object and its data upload pathway. The OpenGL side creates a pixel buffer object sized for an image with four channels per pixel, and then creates a texture with matching dimensions. The pixel buffer object acts as a GPU resident staging area for pixels, while the texture is the actual object used for drawing. The draw step is intentionally simple. Each frame, the program updates the texture from the pixel buffer object and draws a single textured quad that covers the whole viewport. This keeps the rendering pipeline predictable and ensures that any complexity observed in the output is coming from compute rather than from rendering tricks.

The compute layer is OpenCL, and the code’s key technical move is enabling OpenCL and OpenGL resource sharing. The program scans available OpenCL platforms and looks for a GPU device that advertises the extension that allows sharing objects with OpenGL. Once a suitable device is found, the OpenCL context is created with context properties that reference the current OpenGL context and the current Windows device context. This binds OpenCL to the same GPU context that OpenGL is using. In practical terms, it allows OpenCL to treat the OpenGL pixel buffer object as an OpenCL memory object. This avoids copying pixels through host memory. It also introduces a synchronization contract: OpenCL must formally acquire the shared object before writing, and must release it after writing so OpenGL can read. That contract is enforced by explicit acquire and release calls on the command queue and completed with a finish call to guarantee that all compute work is done before OpenGL uploads and draws.

The neural component begins with a tiny multilayer perceptron defined using LibTorch modules. It has a small input dimension, a modest hidden layer, and three outputs. The network uses a smooth activation at each layer so that its output changes gradually rather than snapping. The network is not trained here. Instead it is initialized using LibTorch’s default parameter initialization routines. The program then extracts the layer weight tensors and bias tensors, detaches them from gradient tracking, ensures they are contiguous, ensures they reside on the CPU, and copies them into plain floating point arrays. This is the key representation conversion between LibTorch and OpenCL. LibTorch stores parameters as tensors with metadata and potential device placement. OpenCL kernels expect flat buffers. The code translates from the tensor representation into raw arrays that are then uploaded into OpenCL device buffers. Those buffers are created once during OpenCL initialization and are marked read only because the kernel will only read them.

The program’s non deterministic behavior is deliberate. The LibTorch seed is set using a combination of wall clock time and tick count, which causes the initialized network weights to differ between runs. This means the overall mapping from input coordinates to output colors changes when the program is restarted. In addition, the OpenCL kernel receives a seed value every frame that is derived from a high resolution performance counter and the system tick count. The kernel uses this seed along with pixel coordinates to generate a deterministic per pixel jitter value for that frame, but the seed changes across frames, so the jitter pattern evolves continuously. The combination of a time varying noise seed and a moving phase parameter produces animated textures that feel alive and slightly chaotic, even if the rest of the pipeline is stable. The neural network is therefore not acting as a static function; it is a parameterized transformer whose effective inputs are modulated by phase and by jitter.

The OpenCL kernel is written as a per pixel renderer. Each work item corresponds to a single pixel coordinate in a two dimensional grid. The kernel maps pixel coordinates into a normalized coordinate space and constructs two input values from those coordinates using trigonometric modulation and the phase parameter. It then injects jitter based on the per frame seed, producing a small randomized offset that perturbs the inputs. The kernel runs the neural network forward pass in plain OpenCL code. It computes the hidden activations by multiplying the inputs with the first layer weights, adding biases, and applying a tanh activation. It then computes the output activations by combining the hidden values with the second layer weights and biases and applying tanh again. The three output values are then post processed into color channels by blending with procedural patterns like stripes and a vignette effect. The final color is clamped to a valid range and written as an RGBA byte quadruple into the output buffer, which is the shared OpenGL pixel buffer object.

After the kernel finishes, the program releases the shared object and completes the queue. Then OpenGL binds the pixel buffer object as a pixel unpack buffer, binds the texture, and updates the texture content from the buffer. Because the buffer is GPU resident, the update does not require a host side pixel array. The program then clears the screen and draws the textured quad. The swap buffers call presents the image. The phase variable is advanced slightly each frame, which provides a smooth time axis that makes the animation continuous. The message loop continues until the user closes the window, presses an exit key, the environment timer triggers an auto close, or the host environment signals an exit.

The program also includes a Zorro oriented DLL entry point. That entry is used mainly to control lifecycle and to prevent automatic relaunch behavior typical in iterative backtest environments. It forces a single cycle configuration and disables standard test features so that the Win32 loop is not restarted by the host. A simple done flag prevents repeated execution in the same host session. The entry then calls WinMain to run the window loop. This makes the project usable as a visual diagnostic or demonstration component in a larger system that already uses Zorro as a host, but the rendering and compute logic is independent of trading logic.

In summary, Stochastic TorchCanvas Bridge is a three layer pipeline where LibTorch supplies neural parameters, OpenCL evaluates a neural plus procedural function per pixel in parallel, and OpenGL displays the computed result with shared GPU memory and explicit synchronization. The most important mathematical relationship is not a single formula but a consistent mapping of numeric representations across subsystems: neural layer parameters become device buffers, device buffers become kernel inputs, kernel outputs become pixels, and pixels become a displayed texture without leaving GPU memory. The stability of this mapping depends on careful include ordering, macro hygiene, device selection based on sharing support, and strict acquire and release synchronization to preserve correctness between compute and graphics.

Code

// Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch.

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif

#define _CRT_SECURE_NO_WARNINGS

// ============================================================
// 1) Include LibTorch FIRST (like your working file)
//    Public/shareable variant: no machine-specific include paths.
// ============================================================
#if defined(__has_include)
  #if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
    #include <torch/torch.h>
    #include <torch/script.h>
  #else
    #error "LibTorch headers not found. Add LibTorch include paths to your build configuration."
  #endif
#else
  #include <torch/torch.h>
  #include <torch/script.h>
#endif

// (Optional) CUDA headers (safe pattern used by your working file)
// Keep them conditional so CPU-only LibTorch setups still compile.
#if defined(__has_include)
  #if __has_include(<torch/cuda.h>)
    #include <torch/cuda.h>
    #define HAVE_TORCH_CUDA_HEADER 1
  #else
    #define HAVE_TORCH_CUDA_HEADER 0
  #endif
  #if __has_include(<cuda_runtime_api.h>)
    #include <cuda_runtime_api.h>
    #define HAVE_CUDA_RUNTIME_API_HEADER 1
  #else
    #define HAVE_CUDA_RUNTIME_API_HEADER 0
  #endif
#else
  #define HAVE_TORCH_CUDA_HEADER 0
  #define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif

#if defined(__has_include)
  #if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
    #include <c10/cuda/CUDAGuard.h>
    #define HAVE_C10_CUDAGUARD 1
  #else
    #define HAVE_C10_CUDAGUARD 0
  #endif
#else
  #define HAVE_C10_CUDAGUARD 0
#endif

// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at' to avoid conflict
//    (exact pattern from your working file)
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at

// ============================================================
// 4) Cleanup macro landmines (exact style from your working file)
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif

#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif

// ============================================================
// OpenCL + OpenGL includes (after the macro cleanup is safest)
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h>     // cl_khr_gl_sharing
#include <CL/cl_gl_ext.h> // CL_GL_CONTEXT_KHR / CL_WGL_HDC_KHR
#include <GL/gl.h>

#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif

// ------------------------- Globals -------------------------
static HWND   gHwnd = 0;
static HDC    gHdc  = 0;
static HGLRC  gHgl  = 0;

static int    gW = 640;
static int    gH = 480;
static float  gPhase = 0.0f;
static unsigned int gNoiseSeed = 1u;

static int read_env_int(const char* key, int fallback)
{
  const char* s = getenv(key);
  if(!s || !*s) return fallback;
  int v = atoi(s);
  return (v > 0) ? v : fallback;
}

// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);

// ===========================================================
//                Minimal OpenGL function loading
// ===========================================================

#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif

#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif

typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);

static PFNGLGENBUFFERSPROC    p_glGenBuffers    = 0;
static PFNGLBINDBUFFERPROC    p_glBindBuffer    = 0;
static PFNGLBUFFERDATAPROC    p_glBufferData    = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;

static void* gl_get_proc(const char* name)
{
  void* p = (void*)wglGetProcAddress(name);
  if(!p) {
    HMODULE ogl = GetModuleHandleA("opengl32.dll");
    if(ogl) p = (void*)GetProcAddress(ogl, name);
  }
  return p;
}

static int gl_load_ext()
{
  p_glGenBuffers    = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
  p_glBindBuffer    = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
  p_glBufferData    = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
  p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");

  if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
    return 0;
  return 1;
}

// ===========================================================
//                       OpenGL objects
// ===========================================================

static GLuint gPBO = 0;
static GLuint gTex = 0;

static void gl_release_all()
{
  if(gTex) {
    glDeleteTextures(1, &gTex);
    gTex = 0;
  }
  if(gPBO) {
    if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
    gPBO = 0;
  }

  if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
  if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}

static int gl_init_wgl(HWND hwnd)
{
  gHwnd = hwnd;
  gHdc = GetDC(hwnd);
  if(!gHdc) return 0;

  PIXELFORMATDESCRIPTOR pfd;
  ZeroMemory(&pfd, sizeof(pfd));
  pfd.nSize      = sizeof(pfd);
  pfd.nVersion   = 1;
  pfd.dwFlags    = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
  pfd.iPixelType = PFD_TYPE_RGBA;
  pfd.cColorBits = 32;
  pfd.cDepthBits = 16;
  pfd.iLayerType = PFD_MAIN_PLANE;

  int pf = ChoosePixelFormat(gHdc, &pfd);
  if(pf == 0) return 0;
  if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;

  gHgl = wglCreateContext(gHdc);
  if(!gHgl) return 0;
  if(!wglMakeCurrent(gHdc, gHgl)) return 0;

  if(!gl_load_ext()) {
    printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
    return 0;
  }

  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, gW, gH);

  // Create PBO for RGBA pixels
  p_glGenBuffers(1, &gPBO);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  // Create texture
  glGenTextures(1, &gTex);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  glBindTexture(GL_TEXTURE_2D, 0);

  return 1;
}

// ===========================================================
//                  Tiny NN (LibTorch -> weights)
// ===========================================================

#define NN_IN 2
#define NN_H 16
#define NN_OUT 3

struct TinyMLPImpl : torch::nn::Module {
  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
  TinyMLPImpl() {
    fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
    fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
  }
  torch::Tensor forward(torch::Tensor x) {
    x = torch::tanh(fc1->forward(x));
    x = torch::tanh(fc2->forward(x));
    return x;
  }
};
TORCH_MODULE(TinyMLP);

static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
  if(!W1 || !b1 || !W2 || !b2) return 0;

  try {
    torch::NoGradGuard ng;
    torch::manual_seed((uint64_t)time(NULL) ^ (uint64_t)GetTickCount64());
    TinyMLP m;
    m->eval();

    auto w1  = m->fc1->weight.detach().contiguous().to(torch::kCPU);
    auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);

    auto w2  = m->fc2->weight.detach().contiguous().to(torch::kCPU);
    auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);

    memcpy(W1, w1.data_ptr<float>(),  sizeof(float)*NN_H*NN_IN);
    memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
    memcpy(W2, w2.data_ptr<float>(),  sizeof(float)*NN_OUT*NN_H);
    memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);

    return 1;
  }
  catch(const c10::Error& e) {
    printf("\n[LibTorch] Error: %s", e.what());
    return 0;
  }
  catch(...) {
    printf("\n[LibTorch] Unknown error.");
    return 0;
  }
}

// ===========================================================
//                    OpenCL (GL sharing)
// ===========================================================

static int gCL_Ready = 0;

static cl_platform_id   gCL_Platform = 0;
static cl_device_id     gCL_Device   = 0;
static cl_context       gCL_Context  = 0;
static cl_command_queue gCL_Queue    = 0;
static cl_program       gCL_Program  = 0;

static cl_kernel        gCL_K_NN     = 0;

static cl_mem gCL_PBO = 0; // CL view of GL PBO

static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;

#define STR2(x) #x
#define XSTR(x) STR2(x)

static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height,              \n"
"  __global const float* W1, __global const float* b1,                             \n"
"  __global const float* W2, __global const float* b2, float phase, uint seed)    \n"
"{                                                                                 \n"
"  int xpix = (int)get_global_id(0);                                               \n"
"  int ypix = (int)get_global_id(1);                                               \n"
"  if(xpix >= width || ypix >= height) return;                                     \n"
"                                                                                  \n"
"  float x = ((float)xpix / (float)(width  - 1)) * 2.0f - 1.0f;                    \n"
"  float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f;                    \n"
"  uint n = (uint)(xpix*1973u) ^ (uint)(ypix*9277u) ^ (seed*26699u + 911u);        \n"
"  n = (n << 13) ^ n;                                                               \n"
"  uint m = (n * (n*n*15731u + 789221u) + 1376312589u);                            \n"
"  float jitter = ((float)(m & 0x00ffffffu) / 16777215.0f) * 2.0f - 1.0f;          \n"
"  float in0 = 2.8f*x + 0.7f*sin(3.0f*y + phase) + 0.35f*jitter;                   \n"
"  float in1 = -2.8f*y + 0.7f*cos(3.0f*x - 1.3f*phase) - 0.35f*jitter;             \n"
"                                                                                  \n"
"  float h[" XSTR(NN_H) "];                                                        \n"
"  for(int j=0;j<" XSTR(NN_H) ";j++){                                              \n"
"    float acc = b1[j];                                                            \n"
"    acc += in0 * W1[j*" XSTR(NN_IN) " + 0];                                       \n"
"    acc += in1 * W1[j*" XSTR(NN_IN) " + 1];                                       \n"
"    h[j] = tanh(acc);                                                             \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float o[" XSTR(NN_OUT) "];                                                      \n"
"  for(int k=0;k<" XSTR(NN_OUT) ";k++){                                            \n"
"    float acc = b2[k];                                                            \n"
"    for(int j=0;j<" XSTR(NN_H) ";j++){                                            \n"
"      acc += h[j] * W2[k*" XSTR(NN_H) " + j];                                     \n"
"    }                                                                             \n"
"    float s = 0.5f + 0.5f*tanh(acc);                                              \n"
"    if(s<0) s=0; if(s>1) s=1;                                                     \n"
"    o[k] = s;                                                                     \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float radial = sqrt(x*x + y*y);                                                 \n"
"  float vignette = clamp(1.15f - radial, 0.0f, 1.0f);                             \n"
"  float stripe = 0.5f + 0.5f*sin(10.0f*(x + y) + phase + 2.0f*jitter);            \n"
"  float rcol = clamp(0.70f*o[0] + 0.30f*stripe, 0.0f, 1.0f) * vignette;           \n"
"  float gcol = clamp(0.85f*o[1] + 0.15f*(1.0f - stripe), 0.0f, 1.0f) * vignette;  \n"
"  float bcol = clamp(0.75f*o[2] + 0.25f*(0.5f + 0.5f*cos(8.0f*x - phase)),0.0f,1.0f);\n"
"  uchar r = (uchar)(255.0f*rcol);                                                 \n"
"  uchar g = (uchar)(255.0f*gcol);                                                 \n"
"  uchar b = (uchar)(255.0f*bcol);                                                 \n"
"  out[ypix*width + xpix] = (uchar4)(r,g,b,255);                                   \n"
"}                                                                                 \n";

static void cl_release_all()
{
  if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
  if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
  if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
  if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }

  if(gCL_PBO)    { clReleaseMemObject(gCL_PBO);    gCL_PBO = 0; }
  if(gCL_K_NN)   { clReleaseKernel(gCL_K_NN);      gCL_K_NN = 0; }
  if(gCL_Program){ clReleaseProgram(gCL_Program);  gCL_Program = 0; }
  if(gCL_Queue)  { clReleaseCommandQueue(gCL_Queue); gCL_Queue = 0; }
  if(gCL_Context){ clReleaseContext(gCL_Context);  gCL_Context = 0; }

  gCL_Device = 0;
  gCL_Platform = 0;
  gCL_Ready = 0;
}

static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
  cl_uint nPlatforms = 0;
  if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
    return 0;

  cl_platform_id platforms[8];
  if(nPlatforms > 8) nPlatforms = 8;
  if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
    return 0;

  for(cl_uint p=0; p<nPlatforms; p++)
  {
    cl_uint nDev = 0;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
      continue;

    cl_device_id devs[8];
    if(nDev > 8) nDev = 8;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
      continue;

    for(cl_uint d=0; d<nDev; d++)
    {
      char ext[8192];
      size_t sz = 0;
      if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
        continue;

      if(strstr(ext, "cl_khr_gl_sharing"))
      {
        *outP = platforms[p];
        *outD = devs[d];
        return 1;
      }
    }
  }

  return 0;
}

static int cl_init_glshare()
{
  cl_int err = CL_SUCCESS;

  cl_platform_id P = 0;
  cl_device_id   D = 0;

  if(!cl_pick_device_with_glshare(&P, &D)) {
    printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
    return 0;
  }

  gCL_Platform = P;
  gCL_Device   = D;

  cl_context_properties props[] = {
    CL_GL_CONTEXT_KHR,   (cl_context_properties)wglGetCurrentContext(),
    CL_WGL_HDC_KHR,      (cl_context_properties)wglGetCurrentDC(),
    CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
    0
  };

  gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
  if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }

  gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
  if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }

  gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
  if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }

  err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
  if(err != CL_SUCCESS)
  {
    char logbuf[8192];
    size_t logsz = 0;
    clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
    printf("\nOpenCL build failed:\n%s", logbuf);
    cl_release_all();
    return 0;
  }

  gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
  if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }

  gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
  if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
  gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
  gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
  gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
  if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }

  float hW1[NN_H*NN_IN];
  float hb1[NN_H];
  float hW2[NN_OUT*NN_H];
  float hb2[NN_OUT];

  if(!build_weights_from_libtorch(hW1, hb1, hW2, hb2)) {
    printf("\n[LibTorch] Failed to build weights.");
    cl_release_all();
    return 0;
  }

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, hW1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, hb1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, hW2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, hb2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }

  gCL_Ready = 1;
  printf("\nOpenCL: GL-sharing enabled. NN kernel ready.");
  return 1;
}

// ===========================================================
//                      Render (CL -> GL)
// ===========================================================

static void RenderFrame()
{
  if(!gCL_Ready) return;

  size_t global[2] = { (size_t)gW, (size_t)gH };
  size_t local[2]  = { 16, 16 };

  cl_int err = CL_SUCCESS;

  err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  if(err != CL_SUCCESS) return;

  LARGE_INTEGER qpc;
  QueryPerformanceCounter(&qpc);
  gNoiseSeed = (unsigned int)(qpc.QuadPart ^ (qpc.QuadPart >> 32) ^ (LONGLONG)GetTickCount64());

  int arg = 0;
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gW);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gH);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float),  &gPhase);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(unsigned int), &gNoiseSeed);

  err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
  if(err != CL_SUCCESS) {
    err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
  }

  clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  clFinish(gCL_Queue);

  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glClear(GL_COLOR_BUFFER_BIT);
  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, gTex);

  glBegin(GL_QUADS);
    glTexCoord2f(0,0); glVertex2f(-1,-1);
    glTexCoord2f(1,0); glVertex2f( 1,-1);
    glTexCoord2f(1,1); glVertex2f( 1, 1);
    glTexCoord2f(0,1); glVertex2f(-1, 1);
  glEnd();

  glBindTexture(GL_TEXTURE_2D, 0);
  SwapBuffers(gHdc);

  gPhase += 0.03f;
}

// ===========================================================
//                         WinMain
// ===========================================================

int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
  // 0 means no auto-close; window stays until user closes it.
  const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
  ULONGLONG startTick = GetTickCount64();

  const char* szClass = "Mendb02NNCLGLClass";
  UnregisterClassA(szClass, hInst);

  WNDCLASSEXA wc;
  ZeroMemory(&wc, sizeof(wc));
  wc.cbSize = sizeof(wc);
  wc.style = CS_HREDRAW | CS_VREDRAW;
  wc.lpfnWndProc = WndProc;
  wc.hInstance = hInst;
  wc.hCursor = LoadCursor(NULL, IDC_ARROW);
  wc.lpszClassName = szClass;
  RegisterClassExA(&wc);

  RECT r;
  r.left=0; r.top=0; r.right=gW; r.bottom=gH;
  AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);

  HWND hwnd = CreateWindowExA(
    0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
    WS_OVERLAPPEDWINDOW,
    100, 100, (r.right-r.left), (r.bottom-r.top),
    0, 0, hInst, 0);

  if(!hwnd) return 0;

  ShowWindow(hwnd, SW_SHOW);
  UpdateWindow(hwnd);

  if(!gl_init_wgl(hwnd))
  {
    MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
    gl_release_all();
    return 0;
  }

  if(!cl_init_glshare())
  {
    MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
    cl_release_all();
    gl_release_all();
    return 0;
  }

  MSG msg;
  ZeroMemory(&msg, sizeof(msg));

  while(msg.message != WM_QUIT)
  {
    while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
    {
      TranslateMessage(&msg);
      DispatchMessage(&msg);
    }

    // Allow Zorro STOP to close this Win32 loop cleanly, but ignore
    // the sticky FIRSTINITRUN+EXITRUN combo seen at startup.
    if(is(EXITRUN) && !is(FIRSTINITRUN)) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    if(!IsWindow(hwnd))
      break;

    if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    RenderFrame();
  }

  cl_release_all();
  gl_release_all();
  gHwnd = 0;
  return 0;
}

// ===========================================================
//                         Input
// ===========================================================

LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
  switch(msg)
  {
    case WM_CLOSE:
      DestroyWindow(hWnd);
      return 0;

    case WM_KEYDOWN:
      if(wParam == VK_ESCAPE || wParam == VK_F12) {
        PostMessage(hWnd, WM_CLOSE, 0, 0);
        return 0;
      }
      return 0;

    case WM_DESTROY:
      PostQuitMessage(0);
      return 0;
  }
  return DefWindowProc(hWnd, msg, wParam, lParam);
}

// ===========================================================
//                         Zorro DLL entry
// ===========================================================

DLLFUNC int main()
{
  // Force single-cycle execution in Zorro to avoid automatic relaunches.
  NumTotalCycles = 1;
  NumWFOCycles = 1;
  NumSampleCycles = 1;
  set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);

  static int done = 0;
  if(is(FIRSTINITRUN))
    done = 0;

  if(done)
    return 0;

  (void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
  done = 1;
  return quit("!Mendb02 finished");
}

ZorroGPT - https://bit.ly/3Gbsm4S

NeuroWeave Render Bridge [Re: TipmyPip] #489289
5 hours ago 5 hours ago

Joined: Sep 2017
Posts: 276

TipmyPip

OP
Member

TipmyPip

OP
Member

Joined: Sep 2017
Posts: 276

NeuroWeave Render Bridge is a single-file demonstration that stitches together three normally separate domains into one continuous runtime loop: a neural modeling domain provided by LibTorch, a massively parallel compute domain provided by OpenCL, and a real-time display domain provided by OpenGL through the Win32 windowing system. The program’s purpose is not traditional training, and it is not a trading strategy in the usual sense. Instead, it is a proof-of-integration pattern: it shows how to safely combine a machine learning library with a compute kernel and a graphics pipeline inside the same binary, while also being compatible with Zorro’s DLL lifecycle.

The story begins with defensive integration. The file uses a strict include order: LibTorch comes first, Zorro comes after, and then macro cleanup happens before OpenCL and OpenGL headers are introduced. This ordering is a practical requirement because both LibTorch and Zorro bring global identifiers and macros that can collide. The code explicitly renames one of Zorro’s short identifiers before including the Zorro header, then restores it afterward. Immediately after that, it removes common macro definitions such as min, max, abs, and other short names that can silently rewrite later code. This part is not glamorous, but it is crucial: it ensures that when the program says “tanh” or “abs” or “min,” it gets the intended function and not an accidental macro substitution. In a hybrid system like this, “mathematical correctness” starts with compile-time hygiene.

Once the compilation environment is stabilized, the program constructs the display side using Win32 and OpenGL. It creates a window class, spawns a window, and then establishes a WGL context, which is the Windows pathway for binding OpenGL rendering to that window. The OpenGL configuration is intentionally minimal: no depth test, a fixed viewport, and a simple texture-based draw. Instead of drawing complex geometry, it draws a single textured quad that covers the screen. This keeps the display pipeline simple and reliable. The key OpenGL objects are a pixel buffer object and a texture. The pixel buffer object is a GPU-resident memory region sized to hold one frame of pixels in four channels. The texture is allocated to match the window size, and it is configured with nearest-neighbor filtering so the program’s pixel output appears crisp without interpolation artifacts. In this architecture, the texture is the final display surface, but the pixel buffer object is the intermediate staging region that can be shared with OpenCL.

The compute side is built around OpenCL with OpenGL sharing enabled. This is where the most important relationship between OpenCL and OpenGL appears. OpenCL and OpenGL can both operate on GPU memory, but they usually do so in separate ecosystems. Sharing is the mechanism that allows a buffer created in OpenGL to be directly visible to OpenCL, without copying data through the CPU. The program searches for a GPU device that advertises the OpenCL extension required for OpenGL interoperability. Once it finds a suitable device, it creates an OpenCL context that is explicitly linked to the active OpenGL context and the current device context. That linkage is established through context properties that pass the current OpenGL context and the window device context into OpenCL. Symbolically, this step is an agreement: OpenCL is allowed to work on objects that OpenGL created, but only under the rules of this shared context.

After the shared context is created, the program compiles an OpenCL kernel from source embedded as a string. The kernel is a per-pixel renderer that writes RGBA color values into an output buffer. That output buffer is not an ordinary OpenCL buffer in this design; it is a handle created by wrapping the OpenGL pixel buffer object as an OpenCL memory object. This is the heart of the bridge: the same physical memory region is treated as an OpenCL output surface during computation and as an OpenGL pixel source during rendering.

Next comes the learning side. LibTorch is used to define and initialize a tiny multilayer perceptron. The network is deliberately small: it accepts two inputs, produces a hidden representation of moderate size, and outputs three channels that will later be interpreted as color components. The model uses a smooth nonlinearity in each layer to produce continuous output. The important conceptual relationship between LibTorch and OpenCL is representation. LibTorch stores parameters as tensors with metadata and potential device placement. OpenCL wants raw arrays in contiguous memory blocks. The code therefore builds the model, switches it into evaluation mode, extracts the weight matrices and bias vectors, forces them into CPU memory and contiguous layout, and copies them into plain float arrays. Those arrays become the canonical parameter representation for the rest of the system.

The program then uploads those parameters into OpenCL buffers. Each parameter block is stored in its own OpenCL buffer and marked read-only, because the kernel treats them as constants during inference. This stage establishes the first half of the mathematical relationship between LibTorch and OpenCL: LibTorch authors a function by defining parameter values, and OpenCL consumes those values to evaluate the function at a much larger scale than a CPU loop could easily manage. In other words, LibTorch supplies the “shape” of the neural mapping through weights, while OpenCL supplies the “reach” by running the same mapping across a full two-dimensional grid of pixels.

This version extends the bridge by adding parameter evolution on the host. After the initial weights are produced by LibTorch and uploaded to OpenCL, the program continues to modify the parameters over time. It maintains host-side copies of all parameters in arrays and, on each frame, applies a small update step that nudges parameters based on neighboring parameter values, a slow oscillatory drift tied to the phase, and a small random disturbance derived from a per-frame seed. This evolution is not training in the machine learning sense; it is a procedural mutation rule that makes the network’s behavior shift gradually as the animation runs. The code packs all parameters into a single linear list, computes a new list by blending each parameter with its neighbors and adding controlled drift and noise, clamps the resulting values to keep them within a reasonable bound, and then writes them back into the structured parameter arrays. It then applies a secondary balancing step that pulls the means of different parameter groups toward each other, which prevents one part of the network from drifting too far away in magnitude compared to the others. This creates a self-stabilizing parameter motion that is visually interesting while remaining bounded.

The relationship between this evolving parameter process and OpenCL is straightforward: each frame, after host-side evolution runs, the updated parameter arrays are written into the OpenCL buffers again. This means the OpenCL kernel always sees a fresh set of weights and biases, which makes each frame’s neural inference slightly different. The writes are performed without blocking wherever possible, and they are synchronized before rendering completes through command queue finishing. This is a classic producer-consumer rhythm: the CPU produces new parameters, OpenCL consumes them to generate pixels, and OpenGL consumes those pixels to display the frame.

Inside the OpenCL kernel, the mapping from pixel location to neural inputs is done in a coordinate space normalized to a convenient range. The kernel derives two input values from the spatial coordinates, the phase, and a per-pixel jitter term. The jitter term comes from a deterministic hash-style mixing function seeded with a per-frame noise seed and pixel coordinates. That means the jitter is consistent for a given frame but changes across frames because the seed changes. The kernel evaluates the hidden layer by multiplying inputs by weights, adding biases, and applying the nonlinearity. It then evaluates the output layer similarly and produces three bounded output values. Those outputs are then mixed with simple procedural effects like stripes and a radial vignette to create a visually structured image. Finally, the kernel writes RGBA bytes into the shared output buffer.

The OpenCL and OpenGL relationship is protected by explicit ownership transfers. Before the kernel runs, the program acquires the shared OpenGL buffer for OpenCL use. After the kernel finishes, it releases the buffer back to OpenGL. This acquire and release sequence is the synchronization contract that prevents OpenGL from reading pixels while OpenCL is still writing them. After release and a final finish call, the OpenGL side updates the texture from the pixel buffer object and draws it to the screen. No CPU readback is needed. The GPU-to-GPU pathway remains intact throughout the loop, which is the principal performance benefit of the CL and GL sharing mechanism.

The program is also wrapped in a Zorro-friendly entry point. It forces Zorro to run only a single cycle to avoid repeated launches, and it watches for Zorro exit flags so that a stop request can cleanly close the window and release resources. This allows the demo to be launched from within a Zorro environment while still behaving like a normal Win32 graphical program.

In symbolic terms, NeuroWeave Render Bridge is a moving tapestry built from three looms. LibTorch defines the weave pattern by providing neural parameters. The host evolution logic slowly changes that pattern over time, like a hand adjusting threads. OpenCL performs the weaving at scale by evaluating the network for every pixel in parallel and writing the resulting colors into a shared canvas. OpenGL then displays the canvas with minimal overhead, completing the loop. The significance of the code lies in the disciplined interfaces between systems: clear naming hygiene, careful memory representation changes, explicit synchronization, and a predictable frame pipeline that can be controlled under a host application’s lifecycle.

Code

// Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch.

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif

#define _CRT_SECURE_NO_WARNINGS

// ============================================================
// 1) Include LibTorch FIRST (like your working file)
//    Public/shareable variant: no machine-specific include paths.
// ============================================================
#if defined(__has_include)
  #if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
    #include <torch/torch.h>
    #include <torch/script.h>
  #else
    #error "LibTorch headers not found. Add LibTorch include paths to your build configuration."
  #endif
#else
  #include <torch/torch.h>
  #include <torch/script.h>
#endif

// (Optional) CUDA headers (safe pattern used by your working file)
// Keep them conditional so CPU-only LibTorch setups still compile.
#if defined(__has_include)
  #if __has_include(<torch/cuda.h>)
    #include <torch/cuda.h>
    #define HAVE_TORCH_CUDA_HEADER 1
  #else
    #define HAVE_TORCH_CUDA_HEADER 0
  #endif
  #if __has_include(<cuda_runtime_api.h>)
    #include <cuda_runtime_api.h>
    #define HAVE_CUDA_RUNTIME_API_HEADER 1
  #else
    #define HAVE_CUDA_RUNTIME_API_HEADER 0
  #endif
#else
  #define HAVE_TORCH_CUDA_HEADER 0
  #define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif

#if defined(__has_include)
  #if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
    #include <c10/cuda/CUDAGuard.h>
    #define HAVE_C10_CUDAGUARD 1
  #else
    #define HAVE_C10_CUDAGUARD 0
  #endif
#else
  #define HAVE_C10_CUDAGUARD 0
#endif

// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at' to avoid conflict
//    (exact pattern from your working file)
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at

// ============================================================
// 4) Cleanup macro landmines (exact style from your working file)
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif

#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif

// ============================================================
// OpenCL + OpenGL includes (after the macro cleanup is safest)
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h>     // cl_khr_gl_sharing
#include <CL/cl_gl_ext.h> // CL_GL_CONTEXT_KHR / CL_WGL_HDC_KHR
#include <GL/gl.h>

#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif

// ------------------------- Globals -------------------------
static HWND   gHwnd = 0;
static HDC    gHdc  = 0;
static HGLRC  gHgl  = 0;

static int    gW = 640;
static int    gH = 480;
static float  gPhase = 0.0f;
static unsigned int gNoiseSeed = 1u;

static int read_env_int(const char* key, int fallback)
{
  const char* s = getenv(key);
  if(!s || !*s) return fallback;
  int v = atoi(s);
  return (v > 0) ? v : fallback;
}

// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);

// ===========================================================
//                Minimal OpenGL function loading
// ===========================================================

#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif

#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif

typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);

static PFNGLGENBUFFERSPROC    p_glGenBuffers    = 0;
static PFNGLBINDBUFFERPROC    p_glBindBuffer    = 0;
static PFNGLBUFFERDATAPROC    p_glBufferData    = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;

static void* gl_get_proc(const char* name)
{
  void* p = (void*)wglGetProcAddress(name);
  if(!p) {
    HMODULE ogl = GetModuleHandleA("opengl32.dll");
    if(ogl) p = (void*)GetProcAddress(ogl, name);
  }
  return p;
}

static int gl_load_ext()
{
  p_glGenBuffers    = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
  p_glBindBuffer    = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
  p_glBufferData    = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
  p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");

  if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
    return 0;
  return 1;
}

// ===========================================================
//                       OpenGL objects
// ===========================================================

static GLuint gPBO = 0;
static GLuint gTex = 0;

static void gl_release_all()
{
  if(gTex) {
    glDeleteTextures(1, &gTex);
    gTex = 0;
  }
  if(gPBO) {
    if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
    gPBO = 0;
  }

  if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
  if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}

static int gl_init_wgl(HWND hwnd)
{
  gHwnd = hwnd;
  gHdc = GetDC(hwnd);
  if(!gHdc) return 0;

  PIXELFORMATDESCRIPTOR pfd;
  ZeroMemory(&pfd, sizeof(pfd));
  pfd.nSize      = sizeof(pfd);
  pfd.nVersion   = 1;
  pfd.dwFlags    = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
  pfd.iPixelType = PFD_TYPE_RGBA;
  pfd.cColorBits = 32;
  pfd.cDepthBits = 16;
  pfd.iLayerType = PFD_MAIN_PLANE;

  int pf = ChoosePixelFormat(gHdc, &pfd);
  if(pf == 0) return 0;
  if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;

  gHgl = wglCreateContext(gHdc);
  if(!gHgl) return 0;
  if(!wglMakeCurrent(gHdc, gHgl)) return 0;

  if(!gl_load_ext()) {
    printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
    return 0;
  }

  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, gW, gH);

  // Create PBO for RGBA pixels
  p_glGenBuffers(1, &gPBO);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  // Create texture
  glGenTextures(1, &gTex);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  glBindTexture(GL_TEXTURE_2D, 0);

  return 1;
}

// ===========================================================
//                  Tiny NN (LibTorch -> weights)
// ===========================================================

#define NN_IN 2
#define NN_H 16
#define NN_OUT 3
#define NN_PARAM_COUNT (NN_H*NN_IN + NN_H + NN_OUT*NN_H + NN_OUT)

static float gHost_W1[NN_H*NN_IN];
static float gHost_b1[NN_H];
static float gHost_W2[NN_OUT*NN_H];
static float gHost_b2[NN_OUT];

struct TinyMLPImpl : torch::nn::Module {
  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
  TinyMLPImpl() {
    fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
    fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
  }
  torch::Tensor forward(torch::Tensor x) {
    x = torch::tanh(fc1->forward(x));
    x = torch::tanh(fc2->forward(x));
    return x;
  }
};
TORCH_MODULE(TinyMLP);

static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
  if(!W1 || !b1 || !W2 || !b2) return 0;

  try {
    torch::NoGradGuard ng;
    torch::manual_seed((uint64_t)time(NULL) ^ (uint64_t)GetTickCount64());
    TinyMLP m;
    m->eval();

    auto w1  = m->fc1->weight.detach().contiguous().to(torch::kCPU);
    auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);

    auto w2  = m->fc2->weight.detach().contiguous().to(torch::kCPU);
    auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);

    memcpy(W1, w1.data_ptr<float>(),  sizeof(float)*NN_H*NN_IN);
    memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
    memcpy(W2, w2.data_ptr<float>(),  sizeof(float)*NN_OUT*NN_H);
    memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);

    return 1;
  }
  catch(const c10::Error& e) {
    printf("\n[LibTorch] Error: %s", e.what());
    return 0;
  }
  catch(...) {
    printf("\n[LibTorch] Unknown error.");
    return 0;
  }
}

// ===========================================================
//                    OpenCL (GL sharing)
// ===========================================================

static int gCL_Ready = 0;

static cl_platform_id   gCL_Platform = 0;
static cl_device_id     gCL_Device   = 0;
static cl_context       gCL_Context  = 0;
static cl_command_queue gCL_Queue    = 0;
static cl_program       gCL_Program  = 0;

static cl_kernel        gCL_K_NN     = 0;

static cl_mem gCL_PBO = 0; // CL view of GL PBO

static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;

static void pack_params(float* theta);
static void unpack_params(const float* theta);
static void evolve_params_accumulated(float phase, unsigned int seed);

#define STR2(x) #x
#define XSTR(x) STR2(x)

static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height,              \n"
"  __global const float* W1, __global const float* b1,                             \n"
"  __global const float* W2, __global const float* b2, float phase, uint seed)    \n"
"{                                                                                 \n"
"  int xpix = (int)get_global_id(0);                                               \n"
"  int ypix = (int)get_global_id(1);                                               \n"
"  if(xpix >= width || ypix >= height) return;                                     \n"
"                                                                                  \n"
"  float x = ((float)xpix / (float)(width  - 1)) * 2.0f - 1.0f;                    \n"
"  float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f;                    \n"
"  uint n = (uint)(xpix*1973u) ^ (uint)(ypix*9277u) ^ (seed*26699u + 911u);        \n"
"  n = (n << 13) ^ n;                                                               \n"
"  uint m = (n * (n*n*15731u + 789221u) + 1376312589u);                            \n"
"  float jitter = ((float)(m & 0x00ffffffu) / 16777215.0f) * 2.0f - 1.0f;          \n"
"  float in0 = 2.8f*x + 0.7f*sin(3.0f*y + phase) + 0.35f*jitter;                   \n"
"  float in1 = -2.8f*y + 0.7f*cos(3.0f*x - 1.3f*phase) - 0.35f*jitter;             \n"
"                                                                                  \n"
"  float h[" XSTR(NN_H) "];                                                        \n"
"  for(int j=0;j<" XSTR(NN_H) ";j++){                                              \n"
"    float acc = b1[j];                                                            \n"
"    acc += in0 * W1[j*" XSTR(NN_IN) " + 0];                                       \n"
"    acc += in1 * W1[j*" XSTR(NN_IN) " + 1];                                       \n"
"    h[j] = tanh(acc);                                                             \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float o[" XSTR(NN_OUT) "];                                                      \n"
"  for(int k=0;k<" XSTR(NN_OUT) ";k++){                                            \n"
"    float acc = b2[k];                                                            \n"
"    for(int j=0;j<" XSTR(NN_H) ";j++){                                            \n"
"      acc += h[j] * W2[k*" XSTR(NN_H) " + j];                                     \n"
"    }                                                                             \n"
"    float s = 0.5f + 0.5f*tanh(acc);                                              \n"
"    if(s<0) s=0; if(s>1) s=1;                                                     \n"
"    o[k] = s;                                                                     \n"
"  }                                                                               \n"
"                                                                                  \n"
"  float radial = sqrt(x*x + y*y);                                                 \n"
"  float vignette = clamp(1.15f - radial, 0.0f, 1.0f);                             \n"
"  float stripe = 0.5f + 0.5f*sin(10.0f*(x + y) + phase + 2.0f*jitter);            \n"
"  float rcol = clamp(0.70f*o[0] + 0.30f*stripe, 0.0f, 1.0f) * vignette;           \n"
"  float gcol = clamp(0.85f*o[1] + 0.15f*(1.0f - stripe), 0.0f, 1.0f) * vignette;  \n"
"  float bcol = clamp(0.75f*o[2] + 0.25f*(0.5f + 0.5f*cos(8.0f*x - phase)),0.0f,1.0f);\n"
"  uchar r = (uchar)(255.0f*rcol);                                                 \n"
"  uchar g = (uchar)(255.0f*gcol);                                                 \n"
"  uchar b = (uchar)(255.0f*bcol);                                                 \n"
"  out[ypix*width + xpix] = (uchar4)(r,g,b,255);                                   \n"
"}                                                                                 \n";

static void cl_release_all()
{
  if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
  if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
  if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
  if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }

  if(gCL_PBO)    { clReleaseMemObject(gCL_PBO);    gCL_PBO = 0; }
  if(gCL_K_NN)   { clReleaseKernel(gCL_K_NN);      gCL_K_NN = 0; }
  if(gCL_Program){ clReleaseProgram(gCL_Program);  gCL_Program = 0; }
  if(gCL_Queue)  { clReleaseCommandQueue(gCL_Queue); gCL_Queue = 0; }
  if(gCL_Context){ clReleaseContext(gCL_Context);  gCL_Context = 0; }

  gCL_Device = 0;
  gCL_Platform = 0;
  gCL_Ready = 0;
}

static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
  cl_uint nPlatforms = 0;
  if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
    return 0;

  cl_platform_id platforms[8];
  if(nPlatforms > 8) nPlatforms = 8;
  if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
    return 0;

  for(cl_uint p=0; p<nPlatforms; p++)
  {
    cl_uint nDev = 0;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
      continue;

    cl_device_id devs[8];
    if(nDev > 8) nDev = 8;
    if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
      continue;

    for(cl_uint d=0; d<nDev; d++)
    {
      char ext[8192];
      size_t sz = 0;
      if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
        continue;

      if(strstr(ext, "cl_khr_gl_sharing"))
      {
        *outP = platforms[p];
        *outD = devs[d];
        return 1;
      }
    }
  }

  return 0;
}

static int cl_init_glshare()
{
  cl_int err = CL_SUCCESS;

  cl_platform_id P = 0;
  cl_device_id   D = 0;

  if(!cl_pick_device_with_glshare(&P, &D)) {
    printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
    return 0;
  }

  gCL_Platform = P;
  gCL_Device   = D;

  cl_context_properties props[] = {
    CL_GL_CONTEXT_KHR,   (cl_context_properties)wglGetCurrentContext(),
    CL_WGL_HDC_KHR,      (cl_context_properties)wglGetCurrentDC(),
    CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
    0
  };

  gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
  if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }

  gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
  if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }

  gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
  if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }

  err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
  if(err != CL_SUCCESS)
  {
    char logbuf[8192];
    size_t logsz = 0;
    clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
    printf("\nOpenCL build failed:\n%s", logbuf);
    cl_release_all();
    return 0;
  }

  gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
  if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }

  gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
  if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
  gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
  gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
  gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
  if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }

  if(!build_weights_from_libtorch(gHost_W1, gHost_b1, gHost_W2, gHost_b2)) {
    printf("\n[LibTorch] Failed to build weights.");
    cl_release_all();
    return 0;
  }

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, gHost_W1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, gHost_b1, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, gHost_W2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, gHost_b2, 0, 0, 0);
  if(err != CL_SUCCESS) { cl_release_all(); return 0; }

  gCL_Ready = 1;
  printf("\nOpenCL: GL-sharing enabled. NN kernel ready.");
  return 1;
}

// ===========================================================
//                      Render (CL -> GL)
// ===========================================================

static void RenderFrame()
{
  if(!gCL_Ready) return;

  size_t global[2] = { (size_t)gW, (size_t)gH };
  size_t local[2]  = { 16, 16 };

  cl_int err = CL_SUCCESS;

  err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  if(err != CL_SUCCESS) return;

  LARGE_INTEGER qpc;
  QueryPerformanceCounter(&qpc);
  gNoiseSeed = (unsigned int)(qpc.QuadPart ^ (qpc.QuadPart >> 32) ^ (LONGLONG)GetTickCount64());

  evolve_params_accumulated(gPhase, gNoiseSeed);

  size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
  size_t bytesb1 = sizeof(float)*(size_t)NN_H;
  size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
  size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;

  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_FALSE, 0, bytesW1, gHost_W1, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_FALSE, 0, bytesb1, gHost_b1, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_FALSE, 0, bytesW2, gHost_W2, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
  err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_FALSE, 0, bytesb2, gHost_b2, 0, 0, 0);
  if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }

  int arg = 0;
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gW);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(int),    &gH);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(float),  &gPhase);
  clSetKernelArg(gCL_K_NN, arg++, sizeof(unsigned int), &gNoiseSeed);

  err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
  if(err != CL_SUCCESS) {
    err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
  }

  clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
  clFinish(gCL_Queue);

  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
  glBindTexture(GL_TEXTURE_2D, gTex);
  glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glClear(GL_COLOR_BUFFER_BIT);
  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, gTex);

  glBegin(GL_QUADS);
    glTexCoord2f(0,0); glVertex2f(-1,-1);
    glTexCoord2f(1,0); glVertex2f( 1,-1);
    glTexCoord2f(1,1); glVertex2f( 1, 1);
    glTexCoord2f(0,1); glVertex2f(-1, 1);
  glEnd();

  glBindTexture(GL_TEXTURE_2D, 0);
  SwapBuffers(gHdc);

  gPhase += 0.03f;
}

static void pack_params(float* theta)
{
  int p = 0;
  for(int i=0;i<NN_H*NN_IN;i++) theta[p++] = gHost_W1[i];
  for(int i=0;i<NN_H;i++)       theta[p++] = gHost_b1[i];
  for(int i=0;i<NN_OUT*NN_H;i++)theta[p++] = gHost_W2[i];
  for(int i=0;i<NN_OUT;i++)     theta[p++] = gHost_b2[i];
}

static void unpack_params(const float* theta)
{
  int p = 0;
  for(int i=0;i<NN_H*NN_IN;i++) gHost_W1[i] = theta[p++];
  for(int i=0;i<NN_H;i++)       gHost_b1[i] = theta[p++];
  for(int i=0;i<NN_OUT*NN_H;i++)gHost_W2[i] = theta[p++];
  for(int i=0;i<NN_OUT;i++)     gHost_b2[i] = theta[p++];
}

static unsigned int mix_u32(unsigned int x)
{
  x ^= x >> 16;
  x *= 2246822519u;
  x ^= x >> 13;
  x *= 3266489917u;
  x ^= x >> 16;
  return x;
}

static void evolve_params_accumulated(float phase, unsigned int seed)
{
  float theta[NN_PARAM_COUNT];
  float nextv[NN_PARAM_COUNT];
  pack_params(theta);

  for(int i=0;i<NN_PARAM_COUNT;i++) {
    int l = (i == 0) ? (NN_PARAM_COUNT - 1) : (i - 1);
    int r = (i + 1) % NN_PARAM_COUNT;
    float coupled = 0.55f*theta[l] + 0.45f*theta[r];
    float drift = 0.015f*sinf(0.8f*phase + 0.17f*(float)i);

    unsigned int h = mix_u32(seed ^ (unsigned int)(i*747796405u + 2891336453u));
    float noise = (((float)(h & 0xFFFFu) / 65535.0f) * 2.0f - 1.0f) * 0.010f;

    float v = 0.982f*theta[i] + 0.022f*coupled + drift + noise;
    if(v > 3.0f) v = 3.0f;
    if(v < -3.0f) v = -3.0f;
    nextv[i] = v;
  }

  unpack_params(nextv);

  float mW1 = 0.0f, mb1 = 0.0f, mW2 = 0.0f, mb2 = 0.0f;
  for(int i=0;i<NN_H*NN_IN;i++) mW1 += gHost_W1[i];
  for(int i=0;i<NN_H;i++) mb1 += gHost_b1[i];
  for(int i=0;i<NN_OUT*NN_H;i++) mW2 += gHost_W2[i];
  for(int i=0;i<NN_OUT;i++) mb2 += gHost_b2[i];
  mW1 /= (float)(NN_H*NN_IN);
  mb1 /= (float)NN_H;
  mW2 /= (float)(NN_OUT*NN_H);
  mb2 /= (float)NN_OUT;

  for(int i=0;i<NN_H*NN_IN;i++) gHost_W1[i] += 0.003f*(mb1 - mW1);
  for(int i=0;i<NN_H;i++) gHost_b1[i] += 0.004f*(mW2 - mb1);
  for(int i=0;i<NN_OUT*NN_H;i++) gHost_W2[i] += 0.003f*(mb2 - mW2);
  for(int i=0;i<NN_OUT;i++) gHost_b2[i] += 0.004f*(mW1 - mb2);
}

// ===========================================================
//                         WinMain
// ===========================================================

int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
  // 0 means no auto-close; window stays until user closes it.
  const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
  ULONGLONG startTick = GetTickCount64();

  const char* szClass = "Mendb02NNCLGLClass";
  UnregisterClassA(szClass, hInst);

  WNDCLASSEXA wc;
  ZeroMemory(&wc, sizeof(wc));
  wc.cbSize = sizeof(wc);
  wc.style = CS_HREDRAW | CS_VREDRAW;
  wc.lpfnWndProc = WndProc;
  wc.hInstance = hInst;
  wc.hCursor = LoadCursor(NULL, IDC_ARROW);
  wc.lpszClassName = szClass;
  RegisterClassExA(&wc);

  RECT r;
  r.left=0; r.top=0; r.right=gW; r.bottom=gH;
  AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);

  HWND hwnd = CreateWindowExA(
    0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
    WS_OVERLAPPEDWINDOW,
    100, 100, (r.right-r.left), (r.bottom-r.top),
    0, 0, hInst, 0);

  if(!hwnd) return 0;

  ShowWindow(hwnd, SW_SHOW);
  UpdateWindow(hwnd);

  if(!gl_init_wgl(hwnd))
  {
    MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
    gl_release_all();
    return 0;
  }

  if(!cl_init_glshare())
  {
    MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
    cl_release_all();
    gl_release_all();
    return 0;
  }

  MSG msg;
  ZeroMemory(&msg, sizeof(msg));

  while(msg.message != WM_QUIT)
  {
    while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
    {
      TranslateMessage(&msg);
      DispatchMessage(&msg);
    }

    // Allow Zorro STOP to close this Win32 loop cleanly, but ignore
    // the sticky FIRSTINITRUN+EXITRUN combo seen at startup.
    if(is(EXITRUN) && !is(FIRSTINITRUN)) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    if(!IsWindow(hwnd))
      break;

    if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
      PostMessage(hwnd, WM_CLOSE, 0, 0);
    }

    RenderFrame();
  }

  cl_release_all();
  gl_release_all();
  gHwnd = 0;
  return 0;
}

// ===========================================================
//                         Input
// ===========================================================

LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
  switch(msg)
  {
    case WM_CLOSE:
      DestroyWindow(hWnd);
      return 0;

    case WM_KEYDOWN:
      if(wParam == VK_ESCAPE || wParam == VK_F12) {
        PostMessage(hWnd, WM_CLOSE, 0, 0);
        return 0;
      }
      return 0;

    case WM_DESTROY:
      PostQuitMessage(0);
      return 0;
  }
  return DefWindowProc(hWnd, msg, wParam, lParam);
}

// ===========================================================
//                         Zorro DLL entry
// ===========================================================

DLLFUNC int main()
{
  // Force single-cycle execution in Zorro to avoid automatic relaunches.
  NumTotalCycles = 1;
  NumWFOCycles = 1;
  NumSampleCycles = 1;
  set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);

  static int done = 0;
  if(is(FIRSTINITRUN))
    done = 0;

  if(done)
    return 0;

  (void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
  done = 1;
  return quit("!Mendb02 finished");
}

Last edited by TipmyPip; 5 hours ago.

ZorroGPT - https://bit.ly/3Gbsm4S

Page 21 of 21

…

Moderated by Petra