OP
Member
Joined: Sep 2017
Posts: 276
|
NeuroWeave Render Bridge is a single-file demonstration that stitches together three normally separate domains into one continuous runtime loop: a neural modeling domain provided by LibTorch, a massively parallel compute domain provided by OpenCL, and a real-time display domain provided by OpenGL through the Win32 windowing system. The program’s purpose is not traditional training, and it is not a trading strategy in the usual sense. Instead, it is a proof-of-integration pattern: it shows how to safely combine a machine learning library with a compute kernel and a graphics pipeline inside the same binary, while also being compatible with Zorro’s DLL lifecycle. The story begins with defensive integration. The file uses a strict include order: LibTorch comes first, Zorro comes after, and then macro cleanup happens before OpenCL and OpenGL headers are introduced. This ordering is a practical requirement because both LibTorch and Zorro bring global identifiers and macros that can collide. The code explicitly renames one of Zorro’s short identifiers before including the Zorro header, then restores it afterward. Immediately after that, it removes common macro definitions such as min, max, abs, and other short names that can silently rewrite later code. This part is not glamorous, but it is crucial: it ensures that when the program says “tanh” or “abs” or “min,” it gets the intended function and not an accidental macro substitution. In a hybrid system like this, “mathematical correctness” starts with compile-time hygiene. Once the compilation environment is stabilized, the program constructs the display side using Win32 and OpenGL. It creates a window class, spawns a window, and then establishes a WGL context, which is the Windows pathway for binding OpenGL rendering to that window. The OpenGL configuration is intentionally minimal: no depth test, a fixed viewport, and a simple texture-based draw. Instead of drawing complex geometry, it draws a single textured quad that covers the screen. This keeps the display pipeline simple and reliable. The key OpenGL objects are a pixel buffer object and a texture. The pixel buffer object is a GPU-resident memory region sized to hold one frame of pixels in four channels. The texture is allocated to match the window size, and it is configured with nearest-neighbor filtering so the program’s pixel output appears crisp without interpolation artifacts. In this architecture, the texture is the final display surface, but the pixel buffer object is the intermediate staging region that can be shared with OpenCL. The compute side is built around OpenCL with OpenGL sharing enabled. This is where the most important relationship between OpenCL and OpenGL appears. OpenCL and OpenGL can both operate on GPU memory, but they usually do so in separate ecosystems. Sharing is the mechanism that allows a buffer created in OpenGL to be directly visible to OpenCL, without copying data through the CPU. The program searches for a GPU device that advertises the OpenCL extension required for OpenGL interoperability. Once it finds a suitable device, it creates an OpenCL context that is explicitly linked to the active OpenGL context and the current device context. That linkage is established through context properties that pass the current OpenGL context and the window device context into OpenCL. Symbolically, this step is an agreement: OpenCL is allowed to work on objects that OpenGL created, but only under the rules of this shared context. After the shared context is created, the program compiles an OpenCL kernel from source embedded as a string. The kernel is a per-pixel renderer that writes RGBA color values into an output buffer. That output buffer is not an ordinary OpenCL buffer in this design; it is a handle created by wrapping the OpenGL pixel buffer object as an OpenCL memory object. This is the heart of the bridge: the same physical memory region is treated as an OpenCL output surface during computation and as an OpenGL pixel source during rendering. Next comes the learning side. LibTorch is used to define and initialize a tiny multilayer perceptron. The network is deliberately small: it accepts two inputs, produces a hidden representation of moderate size, and outputs three channels that will later be interpreted as color components. The model uses a smooth nonlinearity in each layer to produce continuous output. The important conceptual relationship between LibTorch and OpenCL is representation. LibTorch stores parameters as tensors with metadata and potential device placement. OpenCL wants raw arrays in contiguous memory blocks. The code therefore builds the model, switches it into evaluation mode, extracts the weight matrices and bias vectors, forces them into CPU memory and contiguous layout, and copies them into plain float arrays. Those arrays become the canonical parameter representation for the rest of the system. The program then uploads those parameters into OpenCL buffers. Each parameter block is stored in its own OpenCL buffer and marked read-only, because the kernel treats them as constants during inference. This stage establishes the first half of the mathematical relationship between LibTorch and OpenCL: LibTorch authors a function by defining parameter values, and OpenCL consumes those values to evaluate the function at a much larger scale than a CPU loop could easily manage. In other words, LibTorch supplies the “shape” of the neural mapping through weights, while OpenCL supplies the “reach” by running the same mapping across a full two-dimensional grid of pixels. This version extends the bridge by adding parameter evolution on the host. After the initial weights are produced by LibTorch and uploaded to OpenCL, the program continues to modify the parameters over time. It maintains host-side copies of all parameters in arrays and, on each frame, applies a small update step that nudges parameters based on neighboring parameter values, a slow oscillatory drift tied to the phase, and a small random disturbance derived from a per-frame seed. This evolution is not training in the machine learning sense; it is a procedural mutation rule that makes the network’s behavior shift gradually as the animation runs. The code packs all parameters into a single linear list, computes a new list by blending each parameter with its neighbors and adding controlled drift and noise, clamps the resulting values to keep them within a reasonable bound, and then writes them back into the structured parameter arrays. It then applies a secondary balancing step that pulls the means of different parameter groups toward each other, which prevents one part of the network from drifting too far away in magnitude compared to the others. This creates a self-stabilizing parameter motion that is visually interesting while remaining bounded. The relationship between this evolving parameter process and OpenCL is straightforward: each frame, after host-side evolution runs, the updated parameter arrays are written into the OpenCL buffers again. This means the OpenCL kernel always sees a fresh set of weights and biases, which makes each frame’s neural inference slightly different. The writes are performed without blocking wherever possible, and they are synchronized before rendering completes through command queue finishing. This is a classic producer-consumer rhythm: the CPU produces new parameters, OpenCL consumes them to generate pixels, and OpenGL consumes those pixels to display the frame. Inside the OpenCL kernel, the mapping from pixel location to neural inputs is done in a coordinate space normalized to a convenient range. The kernel derives two input values from the spatial coordinates, the phase, and a per-pixel jitter term. The jitter term comes from a deterministic hash-style mixing function seeded with a per-frame noise seed and pixel coordinates. That means the jitter is consistent for a given frame but changes across frames because the seed changes. The kernel evaluates the hidden layer by multiplying inputs by weights, adding biases, and applying the nonlinearity. It then evaluates the output layer similarly and produces three bounded output values. Those outputs are then mixed with simple procedural effects like stripes and a radial vignette to create a visually structured image. Finally, the kernel writes RGBA bytes into the shared output buffer. The OpenCL and OpenGL relationship is protected by explicit ownership transfers. Before the kernel runs, the program acquires the shared OpenGL buffer for OpenCL use. After the kernel finishes, it releases the buffer back to OpenGL. This acquire and release sequence is the synchronization contract that prevents OpenGL from reading pixels while OpenCL is still writing them. After release and a final finish call, the OpenGL side updates the texture from the pixel buffer object and draws it to the screen. No CPU readback is needed. The GPU-to-GPU pathway remains intact throughout the loop, which is the principal performance benefit of the CL and GL sharing mechanism. The program is also wrapped in a Zorro-friendly entry point. It forces Zorro to run only a single cycle to avoid repeated launches, and it watches for Zorro exit flags so that a stop request can cleanly close the window and release resources. This allows the demo to be launched from within a Zorro environment while still behaving like a normal Win32 graphical program. In symbolic terms, NeuroWeave Render Bridge is a moving tapestry built from three looms. LibTorch defines the weave pattern by providing neural parameters. The host evolution logic slowly changes that pattern over time, like a hand adjusting threads. OpenCL performs the weaving at scale by evaluating the network for every pixel in parallel and writing the resulting colors into a shared canvas. OpenGL then displays the canvas with minimal overhead, completing the loop. The significance of the code lies in the disciplined interfaces between systems: clear naming hygiene, careful memory representation changes, explicit synchronization, and a predictable frame pipeline that can be controlled under a host application’s lifecycle. // Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch.
#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif
#define _CRT_SECURE_NO_WARNINGS
// ============================================================
// 1) Include LibTorch FIRST (like your working file)
// Public/shareable variant: no machine-specific include paths.
// ============================================================
#if defined(__has_include)
#if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
#include <torch/torch.h>
#include <torch/script.h>
#else
#error "LibTorch headers not found. Add LibTorch include paths to your build configuration."
#endif
#else
#include <torch/torch.h>
#include <torch/script.h>
#endif
// (Optional) CUDA headers (safe pattern used by your working file)
// Keep them conditional so CPU-only LibTorch setups still compile.
#if defined(__has_include)
#if __has_include(<torch/cuda.h>)
#include <torch/cuda.h>
#define HAVE_TORCH_CUDA_HEADER 1
#else
#define HAVE_TORCH_CUDA_HEADER 0
#endif
#if __has_include(<cuda_runtime_api.h>)
#include <cuda_runtime_api.h>
#define HAVE_CUDA_RUNTIME_API_HEADER 1
#else
#define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif
#else
#define HAVE_TORCH_CUDA_HEADER 0
#define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif
#if defined(__has_include)
#if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
#include <c10/cuda/CUDAGuard.h>
#define HAVE_C10_CUDAGUARD 1
#else
#define HAVE_C10_CUDAGUARD 0
#endif
#else
#define HAVE_C10_CUDAGUARD 0
#endif
// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at' to avoid conflict
// (exact pattern from your working file)
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at
// ============================================================
// 4) Cleanup macro landmines (exact style from your working file)
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif
#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif
// ============================================================
// OpenCL + OpenGL includes (after the macro cleanup is safest)
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h> // cl_khr_gl_sharing
#include <CL/cl_gl_ext.h> // CL_GL_CONTEXT_KHR / CL_WGL_HDC_KHR
#include <GL/gl.h>
#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif
// ------------------------- Globals -------------------------
static HWND gHwnd = 0;
static HDC gHdc = 0;
static HGLRC gHgl = 0;
static int gW = 640;
static int gH = 480;
static float gPhase = 0.0f;
static unsigned int gNoiseSeed = 1u;
static int read_env_int(const char* key, int fallback)
{
const char* s = getenv(key);
if(!s || !*s) return fallback;
int v = atoi(s);
return (v > 0) ? v : fallback;
}
// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);
// ===========================================================
// Minimal OpenGL function loading
// ===========================================================
#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif
#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif
typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);
static PFNGLGENBUFFERSPROC p_glGenBuffers = 0;
static PFNGLBINDBUFFERPROC p_glBindBuffer = 0;
static PFNGLBUFFERDATAPROC p_glBufferData = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;
static void* gl_get_proc(const char* name)
{
void* p = (void*)wglGetProcAddress(name);
if(!p) {
HMODULE ogl = GetModuleHandleA("opengl32.dll");
if(ogl) p = (void*)GetProcAddress(ogl, name);
}
return p;
}
static int gl_load_ext()
{
p_glGenBuffers = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
p_glBindBuffer = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
p_glBufferData = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");
if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
return 0;
return 1;
}
// ===========================================================
// OpenGL objects
// ===========================================================
static GLuint gPBO = 0;
static GLuint gTex = 0;
static void gl_release_all()
{
if(gTex) {
glDeleteTextures(1, &gTex);
gTex = 0;
}
if(gPBO) {
if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
gPBO = 0;
}
if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}
static int gl_init_wgl(HWND hwnd)
{
gHwnd = hwnd;
gHdc = GetDC(hwnd);
if(!gHdc) return 0;
PIXELFORMATDESCRIPTOR pfd;
ZeroMemory(&pfd, sizeof(pfd));
pfd.nSize = sizeof(pfd);
pfd.nVersion = 1;
pfd.dwFlags = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
pfd.iPixelType = PFD_TYPE_RGBA;
pfd.cColorBits = 32;
pfd.cDepthBits = 16;
pfd.iLayerType = PFD_MAIN_PLANE;
int pf = ChoosePixelFormat(gHdc, &pfd);
if(pf == 0) return 0;
if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;
gHgl = wglCreateContext(gHdc);
if(!gHgl) return 0;
if(!wglMakeCurrent(gHdc, gHgl)) return 0;
if(!gl_load_ext()) {
printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
return 0;
}
glDisable(GL_DEPTH_TEST);
glViewport(0, 0, gW, gH);
// Create PBO for RGBA pixels
p_glGenBuffers(1, &gPBO);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
// Create texture
glGenTextures(1, &gTex);
glBindTexture(GL_TEXTURE_2D, gTex);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
glBindTexture(GL_TEXTURE_2D, 0);
return 1;
}
// ===========================================================
// Tiny NN (LibTorch -> weights)
// ===========================================================
#define NN_IN 2
#define NN_H 16
#define NN_OUT 3
#define NN_PARAM_COUNT (NN_H*NN_IN + NN_H + NN_OUT*NN_H + NN_OUT)
static float gHost_W1[NN_H*NN_IN];
static float gHost_b1[NN_H];
static float gHost_W2[NN_OUT*NN_H];
static float gHost_b2[NN_OUT];
struct TinyMLPImpl : torch::nn::Module {
torch::nn::Linear fc1{nullptr}, fc2{nullptr};
TinyMLPImpl() {
fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
}
torch::Tensor forward(torch::Tensor x) {
x = torch::tanh(fc1->forward(x));
x = torch::tanh(fc2->forward(x));
return x;
}
};
TORCH_MODULE(TinyMLP);
static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
if(!W1 || !b1 || !W2 || !b2) return 0;
try {
torch::NoGradGuard ng;
torch::manual_seed((uint64_t)time(NULL) ^ (uint64_t)GetTickCount64());
TinyMLP m;
m->eval();
auto w1 = m->fc1->weight.detach().contiguous().to(torch::kCPU);
auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);
auto w2 = m->fc2->weight.detach().contiguous().to(torch::kCPU);
auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);
memcpy(W1, w1.data_ptr<float>(), sizeof(float)*NN_H*NN_IN);
memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
memcpy(W2, w2.data_ptr<float>(), sizeof(float)*NN_OUT*NN_H);
memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);
return 1;
}
catch(const c10::Error& e) {
printf("\n[LibTorch] Error: %s", e.what());
return 0;
}
catch(...) {
printf("\n[LibTorch] Unknown error.");
return 0;
}
}
// ===========================================================
// OpenCL (GL sharing)
// ===========================================================
static int gCL_Ready = 0;
static cl_platform_id gCL_Platform = 0;
static cl_device_id gCL_Device = 0;
static cl_context gCL_Context = 0;
static cl_command_queue gCL_Queue = 0;
static cl_program gCL_Program = 0;
static cl_kernel gCL_K_NN = 0;
static cl_mem gCL_PBO = 0; // CL view of GL PBO
static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;
static void pack_params(float* theta);
static void unpack_params(const float* theta);
static void evolve_params_accumulated(float phase, unsigned int seed);
#define STR2(x) #x
#define XSTR(x) STR2(x)
static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height, \n"
" __global const float* W1, __global const float* b1, \n"
" __global const float* W2, __global const float* b2, float phase, uint seed) \n"
"{ \n"
" int xpix = (int)get_global_id(0); \n"
" int ypix = (int)get_global_id(1); \n"
" if(xpix >= width || ypix >= height) return; \n"
" \n"
" float x = ((float)xpix / (float)(width - 1)) * 2.0f - 1.0f; \n"
" float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f; \n"
" uint n = (uint)(xpix*1973u) ^ (uint)(ypix*9277u) ^ (seed*26699u + 911u); \n"
" n = (n << 13) ^ n; \n"
" uint m = (n * (n*n*15731u + 789221u) + 1376312589u); \n"
" float jitter = ((float)(m & 0x00ffffffu) / 16777215.0f) * 2.0f - 1.0f; \n"
" float in0 = 2.8f*x + 0.7f*sin(3.0f*y + phase) + 0.35f*jitter; \n"
" float in1 = -2.8f*y + 0.7f*cos(3.0f*x - 1.3f*phase) - 0.35f*jitter; \n"
" \n"
" float h[" XSTR(NN_H) "]; \n"
" for(int j=0;j<" XSTR(NN_H) ";j++){ \n"
" float acc = b1[j]; \n"
" acc += in0 * W1[j*" XSTR(NN_IN) " + 0]; \n"
" acc += in1 * W1[j*" XSTR(NN_IN) " + 1]; \n"
" h[j] = tanh(acc); \n"
" } \n"
" \n"
" float o[" XSTR(NN_OUT) "]; \n"
" for(int k=0;k<" XSTR(NN_OUT) ";k++){ \n"
" float acc = b2[k]; \n"
" for(int j=0;j<" XSTR(NN_H) ";j++){ \n"
" acc += h[j] * W2[k*" XSTR(NN_H) " + j]; \n"
" } \n"
" float s = 0.5f + 0.5f*tanh(acc); \n"
" if(s<0) s=0; if(s>1) s=1; \n"
" o[k] = s; \n"
" } \n"
" \n"
" float radial = sqrt(x*x + y*y); \n"
" float vignette = clamp(1.15f - radial, 0.0f, 1.0f); \n"
" float stripe = 0.5f + 0.5f*sin(10.0f*(x + y) + phase + 2.0f*jitter); \n"
" float rcol = clamp(0.70f*o[0] + 0.30f*stripe, 0.0f, 1.0f) * vignette; \n"
" float gcol = clamp(0.85f*o[1] + 0.15f*(1.0f - stripe), 0.0f, 1.0f) * vignette; \n"
" float bcol = clamp(0.75f*o[2] + 0.25f*(0.5f + 0.5f*cos(8.0f*x - phase)),0.0f,1.0f);\n"
" uchar r = (uchar)(255.0f*rcol); \n"
" uchar g = (uchar)(255.0f*gcol); \n"
" uchar b = (uchar)(255.0f*bcol); \n"
" out[ypix*width + xpix] = (uchar4)(r,g,b,255); \n"
"} \n";
static void cl_release_all()
{
if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }
if(gCL_PBO) { clReleaseMemObject(gCL_PBO); gCL_PBO = 0; }
if(gCL_K_NN) { clReleaseKernel(gCL_K_NN); gCL_K_NN = 0; }
if(gCL_Program){ clReleaseProgram(gCL_Program); gCL_Program = 0; }
if(gCL_Queue) { clReleaseCommandQueue(gCL_Queue); gCL_Queue = 0; }
if(gCL_Context){ clReleaseContext(gCL_Context); gCL_Context = 0; }
gCL_Device = 0;
gCL_Platform = 0;
gCL_Ready = 0;
}
static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
cl_uint nPlatforms = 0;
if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
return 0;
cl_platform_id platforms[8];
if(nPlatforms > 8) nPlatforms = 8;
if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
return 0;
for(cl_uint p=0; p<nPlatforms; p++)
{
cl_uint nDev = 0;
if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
continue;
cl_device_id devs[8];
if(nDev > 8) nDev = 8;
if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
continue;
for(cl_uint d=0; d<nDev; d++)
{
char ext[8192];
size_t sz = 0;
if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
continue;
if(strstr(ext, "cl_khr_gl_sharing"))
{
*outP = platforms[p];
*outD = devs[d];
return 1;
}
}
}
return 0;
}
static int cl_init_glshare()
{
cl_int err = CL_SUCCESS;
cl_platform_id P = 0;
cl_device_id D = 0;
if(!cl_pick_device_with_glshare(&P, &D)) {
printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
return 0;
}
gCL_Platform = P;
gCL_Device = D;
cl_context_properties props[] = {
CL_GL_CONTEXT_KHR, (cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR, (cl_context_properties)wglGetCurrentDC(),
CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
0
};
gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }
gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }
gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }
err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
if(err != CL_SUCCESS)
{
char logbuf[8192];
size_t logsz = 0;
clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
printf("\nOpenCL build failed:\n%s", logbuf);
cl_release_all();
return 0;
}
gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }
gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }
size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
size_t bytesb1 = sizeof(float)*(size_t)NN_H;
size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;
gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }
if(!build_weights_from_libtorch(gHost_W1, gHost_b1, gHost_W2, gHost_b2)) {
printf("\n[LibTorch] Failed to build weights.");
cl_release_all();
return 0;
}
err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, gHost_W1, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, gHost_b1, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, gHost_W2, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, gHost_b2, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
gCL_Ready = 1;
printf("\nOpenCL: GL-sharing enabled. NN kernel ready.");
return 1;
}
// ===========================================================
// Render (CL -> GL)
// ===========================================================
static void RenderFrame()
{
if(!gCL_Ready) return;
size_t global[2] = { (size_t)gW, (size_t)gH };
size_t local[2] = { 16, 16 };
cl_int err = CL_SUCCESS;
err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
if(err != CL_SUCCESS) return;
LARGE_INTEGER qpc;
QueryPerformanceCounter(&qpc);
gNoiseSeed = (unsigned int)(qpc.QuadPart ^ (qpc.QuadPart >> 32) ^ (LONGLONG)GetTickCount64());
evolve_params_accumulated(gPhase, gNoiseSeed);
size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
size_t bytesb1 = sizeof(float)*(size_t)NN_H;
size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;
err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_FALSE, 0, bytesW1, gHost_W1, 0, 0, 0);
if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_FALSE, 0, bytesb1, gHost_b1, 0, 0, 0);
if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_FALSE, 0, bytesW2, gHost_W2, 0, 0, 0);
if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_FALSE, 0, bytesb2, gHost_b2, 0, 0, 0);
if(err != CL_SUCCESS) { clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0); clFinish(gCL_Queue); return; }
int arg = 0;
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
clSetKernelArg(gCL_K_NN, arg++, sizeof(int), &gW);
clSetKernelArg(gCL_K_NN, arg++, sizeof(int), &gH);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);
clSetKernelArg(gCL_K_NN, arg++, sizeof(float), &gPhase);
clSetKernelArg(gCL_K_NN, arg++, sizeof(unsigned int), &gNoiseSeed);
err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
if(err != CL_SUCCESS) {
err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
}
clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
clFinish(gCL_Queue);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
glBindTexture(GL_TEXTURE_2D, gTex);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glClear(GL_COLOR_BUFFER_BIT);
glEnable(GL_TEXTURE_2D);
glBindTexture(GL_TEXTURE_2D, gTex);
glBegin(GL_QUADS);
glTexCoord2f(0,0); glVertex2f(-1,-1);
glTexCoord2f(1,0); glVertex2f( 1,-1);
glTexCoord2f(1,1); glVertex2f( 1, 1);
glTexCoord2f(0,1); glVertex2f(-1, 1);
glEnd();
glBindTexture(GL_TEXTURE_2D, 0);
SwapBuffers(gHdc);
gPhase += 0.03f;
}
static void pack_params(float* theta)
{
int p = 0;
for(int i=0;i<NN_H*NN_IN;i++) theta[p++] = gHost_W1[i];
for(int i=0;i<NN_H;i++) theta[p++] = gHost_b1[i];
for(int i=0;i<NN_OUT*NN_H;i++)theta[p++] = gHost_W2[i];
for(int i=0;i<NN_OUT;i++) theta[p++] = gHost_b2[i];
}
static void unpack_params(const float* theta)
{
int p = 0;
for(int i=0;i<NN_H*NN_IN;i++) gHost_W1[i] = theta[p++];
for(int i=0;i<NN_H;i++) gHost_b1[i] = theta[p++];
for(int i=0;i<NN_OUT*NN_H;i++)gHost_W2[i] = theta[p++];
for(int i=0;i<NN_OUT;i++) gHost_b2[i] = theta[p++];
}
static unsigned int mix_u32(unsigned int x)
{
x ^= x >> 16;
x *= 2246822519u;
x ^= x >> 13;
x *= 3266489917u;
x ^= x >> 16;
return x;
}
static void evolve_params_accumulated(float phase, unsigned int seed)
{
float theta[NN_PARAM_COUNT];
float nextv[NN_PARAM_COUNT];
pack_params(theta);
for(int i=0;i<NN_PARAM_COUNT;i++) {
int l = (i == 0) ? (NN_PARAM_COUNT - 1) : (i - 1);
int r = (i + 1) % NN_PARAM_COUNT;
float coupled = 0.55f*theta[l] + 0.45f*theta[r];
float drift = 0.015f*sinf(0.8f*phase + 0.17f*(float)i);
unsigned int h = mix_u32(seed ^ (unsigned int)(i*747796405u + 2891336453u));
float noise = (((float)(h & 0xFFFFu) / 65535.0f) * 2.0f - 1.0f) * 0.010f;
float v = 0.982f*theta[i] + 0.022f*coupled + drift + noise;
if(v > 3.0f) v = 3.0f;
if(v < -3.0f) v = -3.0f;
nextv[i] = v;
}
unpack_params(nextv);
float mW1 = 0.0f, mb1 = 0.0f, mW2 = 0.0f, mb2 = 0.0f;
for(int i=0;i<NN_H*NN_IN;i++) mW1 += gHost_W1[i];
for(int i=0;i<NN_H;i++) mb1 += gHost_b1[i];
for(int i=0;i<NN_OUT*NN_H;i++) mW2 += gHost_W2[i];
for(int i=0;i<NN_OUT;i++) mb2 += gHost_b2[i];
mW1 /= (float)(NN_H*NN_IN);
mb1 /= (float)NN_H;
mW2 /= (float)(NN_OUT*NN_H);
mb2 /= (float)NN_OUT;
for(int i=0;i<NN_H*NN_IN;i++) gHost_W1[i] += 0.003f*(mb1 - mW1);
for(int i=0;i<NN_H;i++) gHost_b1[i] += 0.004f*(mW2 - mb1);
for(int i=0;i<NN_OUT*NN_H;i++) gHost_W2[i] += 0.003f*(mb2 - mW2);
for(int i=0;i<NN_OUT;i++) gHost_b2[i] += 0.004f*(mW1 - mb2);
}
// ===========================================================
// WinMain
// ===========================================================
int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
// 0 means no auto-close; window stays until user closes it.
const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
ULONGLONG startTick = GetTickCount64();
const char* szClass = "Mendb02NNCLGLClass";
UnregisterClassA(szClass, hInst);
WNDCLASSEXA wc;
ZeroMemory(&wc, sizeof(wc));
wc.cbSize = sizeof(wc);
wc.style = CS_HREDRAW | CS_VREDRAW;
wc.lpfnWndProc = WndProc;
wc.hInstance = hInst;
wc.hCursor = LoadCursor(NULL, IDC_ARROW);
wc.lpszClassName = szClass;
RegisterClassExA(&wc);
RECT r;
r.left=0; r.top=0; r.right=gW; r.bottom=gH;
AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);
HWND hwnd = CreateWindowExA(
0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
WS_OVERLAPPEDWINDOW,
100, 100, (r.right-r.left), (r.bottom-r.top),
0, 0, hInst, 0);
if(!hwnd) return 0;
ShowWindow(hwnd, SW_SHOW);
UpdateWindow(hwnd);
if(!gl_init_wgl(hwnd))
{
MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
gl_release_all();
return 0;
}
if(!cl_init_glshare())
{
MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
cl_release_all();
gl_release_all();
return 0;
}
MSG msg;
ZeroMemory(&msg, sizeof(msg));
while(msg.message != WM_QUIT)
{
while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
{
TranslateMessage(&msg);
DispatchMessage(&msg);
}
// Allow Zorro STOP to close this Win32 loop cleanly, but ignore
// the sticky FIRSTINITRUN+EXITRUN combo seen at startup.
if(is(EXITRUN) && !is(FIRSTINITRUN)) {
PostMessage(hwnd, WM_CLOSE, 0, 0);
}
if(!IsWindow(hwnd))
break;
if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
PostMessage(hwnd, WM_CLOSE, 0, 0);
}
RenderFrame();
}
cl_release_all();
gl_release_all();
gHwnd = 0;
return 0;
}
// ===========================================================
// Input
// ===========================================================
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
switch(msg)
{
case WM_CLOSE:
DestroyWindow(hWnd);
return 0;
case WM_KEYDOWN:
if(wParam == VK_ESCAPE || wParam == VK_F12) {
PostMessage(hWnd, WM_CLOSE, 0, 0);
return 0;
}
return 0;
case WM_DESTROY:
PostQuitMessage(0);
return 0;
}
return DefWindowProc(hWnd, msg, wParam, lParam);
}
// ===========================================================
// Zorro DLL entry
// ===========================================================
DLLFUNC int main()
{
// Force single-cycle execution in Zorro to avoid automatic relaunches.
NumTotalCycles = 1;
NumWFOCycles = 1;
NumSampleCycles = 1;
set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);
static int done = 0;
if(is(FIRSTINITRUN))
done = 0;
if(done)
return 0;
(void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
done = 1;
return quit("!Mendb02 finished");
}
Last edited by TipmyPip; Yesterday at 18:13.
|