OP
Member
Joined: Sep 2017
Posts: 276
|
Stochastic TorchCanvas Bridge is a hybrid rendering and compute demonstration that connects three different ecosystems into one coherent loop: a learning library that produces neural parameters, a parallel compute engine that applies those parameters across a two dimensional field, and a graphics engine that displays the computed field in real time. The code is structured as a practical integration template rather than a pure algorithmic showcase. Its central purpose is to prove that LibTorch, OpenCL, and OpenGL can share a consistent numerical story while operating under different rules for memory ownership, device access, and synchronization. The program creates a window on Windows using a classic Win32 message loop, builds an OpenGL context for rendering, builds an OpenCL context that is explicitly linked to that OpenGL context through resource sharing, and then runs a tight frame loop where OpenCL computes pixel values directly into a GPU buffer that OpenGL consumes without a CPU copy. The result is a continuously animated image whose visual structure comes from a tiny neural network forward pass combined with procedural shaping and noise. The file begins with defensive engineering. It defines Windows and runtime macros to reduce header bloat and avoid common collisions. It then includes LibTorch before anything else. This is deliberate because LibTorch headers bring large template machinery and their own symbol expectations, and the project also includes Zorro headers which define short identifiers and macros that are known to conflict with the C plus plus tensor ecosystem. The strategy avoids those collisions by including LibTorch first, then including Zorro only after renaming an especially problematic identifier and clearing macro hazards. After Zorro is included, the code performs a cleanup pass that undefines several short macros such as min and max and abs that could otherwise rewrite expressions silently. This stage is less about performance and more about preserving semantic correctness. If the compiler sees the wrong macro expansions, numeric functions can behave differently, and in a system that mixes multiple libraries that all use generic names, correctness depends on controlling the preprocessor environment. After the integration hygiene, the program sets up OpenGL. A window handle and device context are obtained from the Win32 system, a pixel format is selected that supports double buffering and RGBA output, and an OpenGL rendering context is created with WGL calls. Once the context is active, the program loads a small set of OpenGL buffer functions using wglGetProcAddress and a fallback to opengl32 exports. This is done to support a pixel buffer object and its data upload pathway. The OpenGL side creates a pixel buffer object sized for an image with four channels per pixel, and then creates a texture with matching dimensions. The pixel buffer object acts as a GPU resident staging area for pixels, while the texture is the actual object used for drawing. The draw step is intentionally simple. Each frame, the program updates the texture from the pixel buffer object and draws a single textured quad that covers the whole viewport. This keeps the rendering pipeline predictable and ensures that any complexity observed in the output is coming from compute rather than from rendering tricks. The compute layer is OpenCL, and the code’s key technical move is enabling OpenCL and OpenGL resource sharing. The program scans available OpenCL platforms and looks for a GPU device that advertises the extension that allows sharing objects with OpenGL. Once a suitable device is found, the OpenCL context is created with context properties that reference the current OpenGL context and the current Windows device context. This binds OpenCL to the same GPU context that OpenGL is using. In practical terms, it allows OpenCL to treat the OpenGL pixel buffer object as an OpenCL memory object. This avoids copying pixels through host memory. It also introduces a synchronization contract: OpenCL must formally acquire the shared object before writing, and must release it after writing so OpenGL can read. That contract is enforced by explicit acquire and release calls on the command queue and completed with a finish call to guarantee that all compute work is done before OpenGL uploads and draws. The neural component begins with a tiny multilayer perceptron defined using LibTorch modules. It has a small input dimension, a modest hidden layer, and three outputs. The network uses a smooth activation at each layer so that its output changes gradually rather than snapping. The network is not trained here. Instead it is initialized using LibTorch’s default parameter initialization routines. The program then extracts the layer weight tensors and bias tensors, detaches them from gradient tracking, ensures they are contiguous, ensures they reside on the CPU, and copies them into plain floating point arrays. This is the key representation conversion between LibTorch and OpenCL. LibTorch stores parameters as tensors with metadata and potential device placement. OpenCL kernels expect flat buffers. The code translates from the tensor representation into raw arrays that are then uploaded into OpenCL device buffers. Those buffers are created once during OpenCL initialization and are marked read only because the kernel will only read them. The program’s non deterministic behavior is deliberate. The LibTorch seed is set using a combination of wall clock time and tick count, which causes the initialized network weights to differ between runs. This means the overall mapping from input coordinates to output colors changes when the program is restarted. In addition, the OpenCL kernel receives a seed value every frame that is derived from a high resolution performance counter and the system tick count. The kernel uses this seed along with pixel coordinates to generate a deterministic per pixel jitter value for that frame, but the seed changes across frames, so the jitter pattern evolves continuously. The combination of a time varying noise seed and a moving phase parameter produces animated textures that feel alive and slightly chaotic, even if the rest of the pipeline is stable. The neural network is therefore not acting as a static function; it is a parameterized transformer whose effective inputs are modulated by phase and by jitter. The OpenCL kernel is written as a per pixel renderer. Each work item corresponds to a single pixel coordinate in a two dimensional grid. The kernel maps pixel coordinates into a normalized coordinate space and constructs two input values from those coordinates using trigonometric modulation and the phase parameter. It then injects jitter based on the per frame seed, producing a small randomized offset that perturbs the inputs. The kernel runs the neural network forward pass in plain OpenCL code. It computes the hidden activations by multiplying the inputs with the first layer weights, adding biases, and applying a tanh activation. It then computes the output activations by combining the hidden values with the second layer weights and biases and applying tanh again. The three output values are then post processed into color channels by blending with procedural patterns like stripes and a vignette effect. The final color is clamped to a valid range and written as an RGBA byte quadruple into the output buffer, which is the shared OpenGL pixel buffer object. After the kernel finishes, the program releases the shared object and completes the queue. Then OpenGL binds the pixel buffer object as a pixel unpack buffer, binds the texture, and updates the texture content from the buffer. Because the buffer is GPU resident, the update does not require a host side pixel array. The program then clears the screen and draws the textured quad. The swap buffers call presents the image. The phase variable is advanced slightly each frame, which provides a smooth time axis that makes the animation continuous. The message loop continues until the user closes the window, presses an exit key, the environment timer triggers an auto close, or the host environment signals an exit. The program also includes a Zorro oriented DLL entry point. That entry is used mainly to control lifecycle and to prevent automatic relaunch behavior typical in iterative backtest environments. It forces a single cycle configuration and disables standard test features so that the Win32 loop is not restarted by the host. A simple done flag prevents repeated execution in the same host session. The entry then calls WinMain to run the window loop. This makes the project usable as a visual diagnostic or demonstration component in a larger system that already uses Zorro as a host, but the rendering and compute logic is independent of trading logic. In summary, Stochastic TorchCanvas Bridge is a three layer pipeline where LibTorch supplies neural parameters, OpenCL evaluates a neural plus procedural function per pixel in parallel, and OpenGL displays the computed result with shared GPU memory and explicit synchronization. The most important mathematical relationship is not a single formula but a consistent mapping of numeric representations across subsystems: neural layer parameters become device buffers, device buffers become kernel inputs, kernel outputs become pixels, and pixels become a displayed texture without leaving GPU memory. The stability of this mapping depends on careful include ordering, macro hygiene, device selection based on sharing support, and strict acquire and release synchronization to preserve correctness between compute and graphics. // Mendb02.cpp
// Win32 + WGL(OpenGL) display + OpenCL compute (CL/GL sharing)
// + Tiny Neural Net inference per pixel (OpenCL kernel) using weights from LibTorch.
#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif
#define _CRT_SECURE_NO_WARNINGS
// ============================================================
// 1) Include LibTorch FIRST (like your working file)
// Public/shareable variant: no machine-specific include paths.
// ============================================================
#if defined(__has_include)
#if __has_include(<torch/torch.h>) && __has_include(<torch/script.h>)
#include <torch/torch.h>
#include <torch/script.h>
#else
#error "LibTorch headers not found. Add LibTorch include paths to your build configuration."
#endif
#else
#include <torch/torch.h>
#include <torch/script.h>
#endif
// (Optional) CUDA headers (safe pattern used by your working file)
// Keep them conditional so CPU-only LibTorch setups still compile.
#if defined(__has_include)
#if __has_include(<torch/cuda.h>)
#include <torch/cuda.h>
#define HAVE_TORCH_CUDA_HEADER 1
#else
#define HAVE_TORCH_CUDA_HEADER 0
#endif
#if __has_include(<cuda_runtime_api.h>)
#include <cuda_runtime_api.h>
#define HAVE_CUDA_RUNTIME_API_HEADER 1
#else
#define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif
#else
#define HAVE_TORCH_CUDA_HEADER 0
#define HAVE_CUDA_RUNTIME_API_HEADER 0
#endif
#if defined(__has_include)
#if __has_include(<c10/cuda/CUDAGuard.h>) && __has_include(<c10/cuda/impl/cuda_cmake_macros.h>)
#include <c10/cuda/CUDAGuard.h>
#define HAVE_C10_CUDAGUARD 1
#else
#define HAVE_C10_CUDAGUARD 0
#endif
#else
#define HAVE_C10_CUDAGUARD 0
#endif
// ============================================================
// 2) Standard headers
// ============================================================
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
// ============================================================
// 3) Include Zorro AFTER torch, rename Zorro's 'at' to avoid conflict
// (exact pattern from your working file)
// ============================================================
#define at zorro_at
#ifdef LOG
#undef LOG
#endif
#include <zorro.h>
#undef at
// ============================================================
// 4) Cleanup macro landmines (exact style from your working file)
// ============================================================
#ifdef min
#undef min
#endif
#ifdef max
#undef max
#endif
#ifdef ref
#undef ref
#endif
#ifdef swap
#undef swap
#endif
#ifdef abs
#undef abs
#endif
#ifdef NTF
#undef NTF
#endif
#ifdef LOOKBACK
#undef LOOKBACK
#endif
#ifdef BINS
#undef BINS
#endif
// ============================================================
// OpenCL + OpenGL includes (after the macro cleanup is safest)
// ============================================================
#include <CL/cl.h>
#include <CL/cl_gl.h> // cl_khr_gl_sharing
#include <CL/cl_gl_ext.h> // CL_GL_CONTEXT_KHR / CL_WGL_HDC_KHR
#include <GL/gl.h>
#ifndef GL_RGBA8
#define GL_RGBA8 0x8058
#endif
// ------------------------- Globals -------------------------
static HWND gHwnd = 0;
static HDC gHdc = 0;
static HGLRC gHgl = 0;
static int gW = 640;
static int gH = 480;
static float gPhase = 0.0f;
static unsigned int gNoiseSeed = 1u;
static int read_env_int(const char* key, int fallback)
{
const char* s = getenv(key);
if(!s || !*s) return fallback;
int v = atoi(s);
return (v > 0) ? v : fallback;
}
// ------------------------- WinProc forward -------------------------
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);
// ===========================================================
// Minimal OpenGL function loading
// ===========================================================
#ifndef GL_ARRAY_BUFFER
#define GL_ARRAY_BUFFER 0x8892
#endif
#ifndef GL_PIXEL_UNPACK_BUFFER
#define GL_PIXEL_UNPACK_BUFFER 0x88EC
#endif
#ifndef GL_DYNAMIC_DRAW
#define GL_DYNAMIC_DRAW 0x88E8
#endif
#ifndef APIENTRY
#define APIENTRY __stdcall
#endif
#ifndef APIENTRYP
#define APIENTRYP APIENTRY *
#endif
typedef void (APIENTRYP PFNGLGENBUFFERSPROC)(GLsizei, GLuint*);
typedef void (APIENTRYP PFNGLBINDBUFFERPROC)(GLenum, GLuint);
typedef void (APIENTRYP PFNGLBUFFERDATAPROC)(GLenum, ptrdiff_t, const void*, GLenum);
typedef void (APIENTRYP PFNGLDELETEBUFFERSPROC)(GLsizei, const GLuint*);
static PFNGLGENBUFFERSPROC p_glGenBuffers = 0;
static PFNGLBINDBUFFERPROC p_glBindBuffer = 0;
static PFNGLBUFFERDATAPROC p_glBufferData = 0;
static PFNGLDELETEBUFFERSPROC p_glDeleteBuffers = 0;
static void* gl_get_proc(const char* name)
{
void* p = (void*)wglGetProcAddress(name);
if(!p) {
HMODULE ogl = GetModuleHandleA("opengl32.dll");
if(ogl) p = (void*)GetProcAddress(ogl, name);
}
return p;
}
static int gl_load_ext()
{
p_glGenBuffers = (PFNGLGENBUFFERSPROC)gl_get_proc("glGenBuffers");
p_glBindBuffer = (PFNGLBINDBUFFERPROC)gl_get_proc("glBindBuffer");
p_glBufferData = (PFNGLBUFFERDATAPROC)gl_get_proc("glBufferData");
p_glDeleteBuffers = (PFNGLDELETEBUFFERSPROC)gl_get_proc("glDeleteBuffers");
if(!p_glGenBuffers || !p_glBindBuffer || !p_glBufferData || !p_glDeleteBuffers)
return 0;
return 1;
}
// ===========================================================
// OpenGL objects
// ===========================================================
static GLuint gPBO = 0;
static GLuint gTex = 0;
static void gl_release_all()
{
if(gTex) {
glDeleteTextures(1, &gTex);
gTex = 0;
}
if(gPBO) {
if(p_glDeleteBuffers) p_glDeleteBuffers(1, &gPBO);
gPBO = 0;
}
if(gHgl) { wglMakeCurrent(NULL, NULL); wglDeleteContext(gHgl); gHgl = 0; }
if(gHdc && gHwnd) { ReleaseDC(gHwnd, gHdc); gHdc = 0; }
}
static int gl_init_wgl(HWND hwnd)
{
gHwnd = hwnd;
gHdc = GetDC(hwnd);
if(!gHdc) return 0;
PIXELFORMATDESCRIPTOR pfd;
ZeroMemory(&pfd, sizeof(pfd));
pfd.nSize = sizeof(pfd);
pfd.nVersion = 1;
pfd.dwFlags = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
pfd.iPixelType = PFD_TYPE_RGBA;
pfd.cColorBits = 32;
pfd.cDepthBits = 16;
pfd.iLayerType = PFD_MAIN_PLANE;
int pf = ChoosePixelFormat(gHdc, &pfd);
if(pf == 0) return 0;
if(!SetPixelFormat(gHdc, pf, &pfd)) return 0;
gHgl = wglCreateContext(gHdc);
if(!gHgl) return 0;
if(!wglMakeCurrent(gHdc, gHgl)) return 0;
if(!gl_load_ext()) {
printf("\nOpenGL buffer functions not available (need VBO/PBO support).");
return 0;
}
glDisable(GL_DEPTH_TEST);
glViewport(0, 0, gW, gH);
// Create PBO for RGBA pixels
p_glGenBuffers(1, &gPBO);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
p_glBufferData(GL_PIXEL_UNPACK_BUFFER, (ptrdiff_t)(gW * gH * 4), 0, GL_DYNAMIC_DRAW);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
// Create texture
glGenTextures(1, &gTex);
glBindTexture(GL_TEXTURE_2D, gTex);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, gW, gH, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
glBindTexture(GL_TEXTURE_2D, 0);
return 1;
}
// ===========================================================
// Tiny NN (LibTorch -> weights)
// ===========================================================
#define NN_IN 2
#define NN_H 16
#define NN_OUT 3
struct TinyMLPImpl : torch::nn::Module {
torch::nn::Linear fc1{nullptr}, fc2{nullptr};
TinyMLPImpl() {
fc1 = register_module("fc1", torch::nn::Linear(NN_IN, NN_H));
fc2 = register_module("fc2", torch::nn::Linear(NN_H, NN_OUT));
}
torch::Tensor forward(torch::Tensor x) {
x = torch::tanh(fc1->forward(x));
x = torch::tanh(fc2->forward(x));
return x;
}
};
TORCH_MODULE(TinyMLP);
static int build_weights_from_libtorch(float* W1, float* b1, float* W2, float* b2)
{
if(!W1 || !b1 || !W2 || !b2) return 0;
try {
torch::NoGradGuard ng;
torch::manual_seed((uint64_t)time(NULL) ^ (uint64_t)GetTickCount64());
TinyMLP m;
m->eval();
auto w1 = m->fc1->weight.detach().contiguous().to(torch::kCPU);
auto bb1 = m->fc1->bias.detach().contiguous().to(torch::kCPU);
auto w2 = m->fc2->weight.detach().contiguous().to(torch::kCPU);
auto bb2 = m->fc2->bias.detach().contiguous().to(torch::kCPU);
memcpy(W1, w1.data_ptr<float>(), sizeof(float)*NN_H*NN_IN);
memcpy(b1, bb1.data_ptr<float>(), sizeof(float)*NN_H);
memcpy(W2, w2.data_ptr<float>(), sizeof(float)*NN_OUT*NN_H);
memcpy(b2, bb2.data_ptr<float>(), sizeof(float)*NN_OUT);
return 1;
}
catch(const c10::Error& e) {
printf("\n[LibTorch] Error: %s", e.what());
return 0;
}
catch(...) {
printf("\n[LibTorch] Unknown error.");
return 0;
}
}
// ===========================================================
// OpenCL (GL sharing)
// ===========================================================
static int gCL_Ready = 0;
static cl_platform_id gCL_Platform = 0;
static cl_device_id gCL_Device = 0;
static cl_context gCL_Context = 0;
static cl_command_queue gCL_Queue = 0;
static cl_program gCL_Program = 0;
static cl_kernel gCL_K_NN = 0;
static cl_mem gCL_PBO = 0; // CL view of GL PBO
static cl_mem gCL_W1 = 0;
static cl_mem gCL_b1 = 0;
static cl_mem gCL_W2 = 0;
static cl_mem gCL_b2 = 0;
#define STR2(x) #x
#define XSTR(x) STR2(x)
static const char* gCL_Source =
"__kernel void nn_render(__global uchar4* out, int width, int height, \n"
" __global const float* W1, __global const float* b1, \n"
" __global const float* W2, __global const float* b2, float phase, uint seed) \n"
"{ \n"
" int xpix = (int)get_global_id(0); \n"
" int ypix = (int)get_global_id(1); \n"
" if(xpix >= width || ypix >= height) return; \n"
" \n"
" float x = ((float)xpix / (float)(width - 1)) * 2.0f - 1.0f; \n"
" float y = ((float)ypix / (float)(height - 1)) * 2.0f - 1.0f; \n"
" uint n = (uint)(xpix*1973u) ^ (uint)(ypix*9277u) ^ (seed*26699u + 911u); \n"
" n = (n << 13) ^ n; \n"
" uint m = (n * (n*n*15731u + 789221u) + 1376312589u); \n"
" float jitter = ((float)(m & 0x00ffffffu) / 16777215.0f) * 2.0f - 1.0f; \n"
" float in0 = 2.8f*x + 0.7f*sin(3.0f*y + phase) + 0.35f*jitter; \n"
" float in1 = -2.8f*y + 0.7f*cos(3.0f*x - 1.3f*phase) - 0.35f*jitter; \n"
" \n"
" float h[" XSTR(NN_H) "]; \n"
" for(int j=0;j<" XSTR(NN_H) ";j++){ \n"
" float acc = b1[j]; \n"
" acc += in0 * W1[j*" XSTR(NN_IN) " + 0]; \n"
" acc += in1 * W1[j*" XSTR(NN_IN) " + 1]; \n"
" h[j] = tanh(acc); \n"
" } \n"
" \n"
" float o[" XSTR(NN_OUT) "]; \n"
" for(int k=0;k<" XSTR(NN_OUT) ";k++){ \n"
" float acc = b2[k]; \n"
" for(int j=0;j<" XSTR(NN_H) ";j++){ \n"
" acc += h[j] * W2[k*" XSTR(NN_H) " + j]; \n"
" } \n"
" float s = 0.5f + 0.5f*tanh(acc); \n"
" if(s<0) s=0; if(s>1) s=1; \n"
" o[k] = s; \n"
" } \n"
" \n"
" float radial = sqrt(x*x + y*y); \n"
" float vignette = clamp(1.15f - radial, 0.0f, 1.0f); \n"
" float stripe = 0.5f + 0.5f*sin(10.0f*(x + y) + phase + 2.0f*jitter); \n"
" float rcol = clamp(0.70f*o[0] + 0.30f*stripe, 0.0f, 1.0f) * vignette; \n"
" float gcol = clamp(0.85f*o[1] + 0.15f*(1.0f - stripe), 0.0f, 1.0f) * vignette; \n"
" float bcol = clamp(0.75f*o[2] + 0.25f*(0.5f + 0.5f*cos(8.0f*x - phase)),0.0f,1.0f);\n"
" uchar r = (uchar)(255.0f*rcol); \n"
" uchar g = (uchar)(255.0f*gcol); \n"
" uchar b = (uchar)(255.0f*bcol); \n"
" out[ypix*width + xpix] = (uchar4)(r,g,b,255); \n"
"} \n";
static void cl_release_all()
{
if(gCL_b2) { clReleaseMemObject(gCL_b2); gCL_b2 = 0; }
if(gCL_W2) { clReleaseMemObject(gCL_W2); gCL_W2 = 0; }
if(gCL_b1) { clReleaseMemObject(gCL_b1); gCL_b1 = 0; }
if(gCL_W1) { clReleaseMemObject(gCL_W1); gCL_W1 = 0; }
if(gCL_PBO) { clReleaseMemObject(gCL_PBO); gCL_PBO = 0; }
if(gCL_K_NN) { clReleaseKernel(gCL_K_NN); gCL_K_NN = 0; }
if(gCL_Program){ clReleaseProgram(gCL_Program); gCL_Program = 0; }
if(gCL_Queue) { clReleaseCommandQueue(gCL_Queue); gCL_Queue = 0; }
if(gCL_Context){ clReleaseContext(gCL_Context); gCL_Context = 0; }
gCL_Device = 0;
gCL_Platform = 0;
gCL_Ready = 0;
}
static int cl_pick_device_with_glshare(cl_platform_id* outP, cl_device_id* outD)
{
cl_uint nPlatforms = 0;
if(clGetPlatformIDs(0, 0, &nPlatforms) != CL_SUCCESS || nPlatforms == 0)
return 0;
cl_platform_id platforms[8];
if(nPlatforms > 8) nPlatforms = 8;
if(clGetPlatformIDs(nPlatforms, platforms, &nPlatforms) != CL_SUCCESS)
return 0;
for(cl_uint p=0; p<nPlatforms; p++)
{
cl_uint nDev = 0;
if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, 0, 0, &nDev) != CL_SUCCESS || nDev == 0)
continue;
cl_device_id devs[8];
if(nDev > 8) nDev = 8;
if(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_GPU, nDev, devs, &nDev) != CL_SUCCESS)
continue;
for(cl_uint d=0; d<nDev; d++)
{
char ext[8192];
size_t sz = 0;
if(clGetDeviceInfo(devs[d], CL_DEVICE_EXTENSIONS, sizeof(ext), ext, &sz) != CL_SUCCESS)
continue;
if(strstr(ext, "cl_khr_gl_sharing"))
{
*outP = platforms[p];
*outD = devs[d];
return 1;
}
}
}
return 0;
}
static int cl_init_glshare()
{
cl_int err = CL_SUCCESS;
cl_platform_id P = 0;
cl_device_id D = 0;
if(!cl_pick_device_with_glshare(&P, &D)) {
printf("\nOpenCL: no GPU device with cl_khr_gl_sharing found.");
return 0;
}
gCL_Platform = P;
gCL_Device = D;
cl_context_properties props[] = {
CL_GL_CONTEXT_KHR, (cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR, (cl_context_properties)wglGetCurrentDC(),
CL_CONTEXT_PLATFORM, (cl_context_properties)gCL_Platform,
0
};
gCL_Context = clCreateContext(props, 1, &gCL_Device, 0, 0, &err);
if(err != CL_SUCCESS || !gCL_Context) { cl_release_all(); return 0; }
gCL_Queue = clCreateCommandQueue(gCL_Context, gCL_Device, 0, &err);
if(err != CL_SUCCESS || !gCL_Queue) { cl_release_all(); return 0; }
gCL_Program = clCreateProgramWithSource(gCL_Context, 1, &gCL_Source, 0, &err);
if(err != CL_SUCCESS || !gCL_Program) { cl_release_all(); return 0; }
err = clBuildProgram(gCL_Program, 1, &gCL_Device, 0, 0, 0);
if(err != CL_SUCCESS)
{
char logbuf[8192];
size_t logsz = 0;
clGetProgramBuildInfo(gCL_Program, gCL_Device, CL_PROGRAM_BUILD_LOG, sizeof(logbuf), logbuf, &logsz);
printf("\nOpenCL build failed:\n%s", logbuf);
cl_release_all();
return 0;
}
gCL_K_NN = clCreateKernel(gCL_Program, "nn_render", &err);
if(err != CL_SUCCESS || !gCL_K_NN) { cl_release_all(); return 0; }
gCL_PBO = clCreateFromGLBuffer(gCL_Context, CL_MEM_WRITE_ONLY, gPBO, &err);
if(err != CL_SUCCESS || !gCL_PBO) { cl_release_all(); return 0; }
size_t bytesW1 = sizeof(float)*(size_t)NN_H*(size_t)NN_IN;
size_t bytesb1 = sizeof(float)*(size_t)NN_H;
size_t bytesW2 = sizeof(float)*(size_t)NN_OUT*(size_t)NN_H;
size_t bytesb2 = sizeof(float)*(size_t)NN_OUT;
gCL_W1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW1, 0, &err);
gCL_b1 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb1, 0, &err);
gCL_W2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesW2, 0, &err);
gCL_b2 = clCreateBuffer(gCL_Context, CL_MEM_READ_ONLY, bytesb2, 0, &err);
if(err != CL_SUCCESS || !gCL_W1 || !gCL_b1 || !gCL_W2 || !gCL_b2) { cl_release_all(); return 0; }
float hW1[NN_H*NN_IN];
float hb1[NN_H];
float hW2[NN_OUT*NN_H];
float hb2[NN_OUT];
if(!build_weights_from_libtorch(hW1, hb1, hW2, hb2)) {
printf("\n[LibTorch] Failed to build weights.");
cl_release_all();
return 0;
}
err = clEnqueueWriteBuffer(gCL_Queue, gCL_W1, CL_TRUE, 0, bytesW1, hW1, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_b1, CL_TRUE, 0, bytesb1, hb1, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_W2, CL_TRUE, 0, bytesW2, hW2, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
err = clEnqueueWriteBuffer(gCL_Queue, gCL_b2, CL_TRUE, 0, bytesb2, hb2, 0, 0, 0);
if(err != CL_SUCCESS) { cl_release_all(); return 0; }
gCL_Ready = 1;
printf("\nOpenCL: GL-sharing enabled. NN kernel ready.");
return 1;
}
// ===========================================================
// Render (CL -> GL)
// ===========================================================
static void RenderFrame()
{
if(!gCL_Ready) return;
size_t global[2] = { (size_t)gW, (size_t)gH };
size_t local[2] = { 16, 16 };
cl_int err = CL_SUCCESS;
err = clEnqueueAcquireGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
if(err != CL_SUCCESS) return;
LARGE_INTEGER qpc;
QueryPerformanceCounter(&qpc);
gNoiseSeed = (unsigned int)(qpc.QuadPart ^ (qpc.QuadPart >> 32) ^ (LONGLONG)GetTickCount64());
int arg = 0;
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_PBO);
clSetKernelArg(gCL_K_NN, arg++, sizeof(int), &gW);
clSetKernelArg(gCL_K_NN, arg++, sizeof(int), &gH);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W1);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b1);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_W2);
clSetKernelArg(gCL_K_NN, arg++, sizeof(cl_mem), &gCL_b2);
clSetKernelArg(gCL_K_NN, arg++, sizeof(float), &gPhase);
clSetKernelArg(gCL_K_NN, arg++, sizeof(unsigned int), &gNoiseSeed);
err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, local, 0, 0, 0);
if(err != CL_SUCCESS) {
err = clEnqueueNDRangeKernel(gCL_Queue, gCL_K_NN, 2, 0, global, 0, 0, 0, 0);
}
clEnqueueReleaseGLObjects(gCL_Queue, 1, &gCL_PBO, 0, 0, 0);
clFinish(gCL_Queue);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gPBO);
glBindTexture(GL_TEXTURE_2D, gTex);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, gW, gH, GL_RGBA, GL_UNSIGNED_BYTE, 0);
p_glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glClear(GL_COLOR_BUFFER_BIT);
glEnable(GL_TEXTURE_2D);
glBindTexture(GL_TEXTURE_2D, gTex);
glBegin(GL_QUADS);
glTexCoord2f(0,0); glVertex2f(-1,-1);
glTexCoord2f(1,0); glVertex2f( 1,-1);
glTexCoord2f(1,1); glVertex2f( 1, 1);
glTexCoord2f(0,1); glVertex2f(-1, 1);
glEnd();
glBindTexture(GL_TEXTURE_2D, 0);
SwapBuffers(gHdc);
gPhase += 0.03f;
}
// ===========================================================
// WinMain
// ===========================================================
int WINAPI WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, int)
{
// 0 means no auto-close; window stays until user closes it.
const int maxSeconds = read_env_int("MENDB02_MAX_SECONDS", 0);
ULONGLONG startTick = GetTickCount64();
const char* szClass = "Mendb02NNCLGLClass";
UnregisterClassA(szClass, hInst);
WNDCLASSEXA wc;
ZeroMemory(&wc, sizeof(wc));
wc.cbSize = sizeof(wc);
wc.style = CS_HREDRAW | CS_VREDRAW;
wc.lpfnWndProc = WndProc;
wc.hInstance = hInst;
wc.hCursor = LoadCursor(NULL, IDC_ARROW);
wc.lpszClassName = szClass;
RegisterClassExA(&wc);
RECT r;
r.left=0; r.top=0; r.right=gW; r.bottom=gH;
AdjustWindowRect(&r, WS_OVERLAPPEDWINDOW, FALSE);
HWND hwnd = CreateWindowExA(
0, szClass, "NN Render (LibTorch weights + OpenCL + OpenGL)",
WS_OVERLAPPEDWINDOW,
100, 100, (r.right-r.left), (r.bottom-r.top),
0, 0, hInst, 0);
if(!hwnd) return 0;
ShowWindow(hwnd, SW_SHOW);
UpdateWindow(hwnd);
if(!gl_init_wgl(hwnd))
{
MessageBoxA(hwnd, "OpenGL init failed", "Error", MB_OK);
gl_release_all();
return 0;
}
if(!cl_init_glshare())
{
MessageBoxA(hwnd, "OpenCL GL-sharing init failed", "Error", MB_OK);
cl_release_all();
gl_release_all();
return 0;
}
MSG msg;
ZeroMemory(&msg, sizeof(msg));
while(msg.message != WM_QUIT)
{
while(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
{
TranslateMessage(&msg);
DispatchMessage(&msg);
}
// Allow Zorro STOP to close this Win32 loop cleanly, but ignore
// the sticky FIRSTINITRUN+EXITRUN combo seen at startup.
if(is(EXITRUN) && !is(FIRSTINITRUN)) {
PostMessage(hwnd, WM_CLOSE, 0, 0);
}
if(!IsWindow(hwnd))
break;
if(maxSeconds > 0 && (GetTickCount64() - startTick) >= (ULONGLONG)maxSeconds * 1000ULL) {
PostMessage(hwnd, WM_CLOSE, 0, 0);
}
RenderFrame();
}
cl_release_all();
gl_release_all();
gHwnd = 0;
return 0;
}
// ===========================================================
// Input
// ===========================================================
LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
switch(msg)
{
case WM_CLOSE:
DestroyWindow(hWnd);
return 0;
case WM_KEYDOWN:
if(wParam == VK_ESCAPE || wParam == VK_F12) {
PostMessage(hWnd, WM_CLOSE, 0, 0);
return 0;
}
return 0;
case WM_DESTROY:
PostQuitMessage(0);
return 0;
}
return DefWindowProc(hWnd, msg, wParam, lParam);
}
// ===========================================================
// Zorro DLL entry
// ===========================================================
DLLFUNC int main()
{
// Force single-cycle execution in Zorro to avoid automatic relaunches.
NumTotalCycles = 1;
NumWFOCycles = 1;
NumSampleCycles = 1;
set(TESTNOW|OFF,ALLCYCLES|OFF,PARAMETERS|OFF,FACTORS|OFF,RULES|OFF);
static int done = 0;
if(is(FIRSTINITRUN))
done = 0;
if(done)
return 0;
(void)WinMain(GetModuleHandleA(NULL), NULL, GetCommandLineA(), SW_SHOWDEFAULT);
done = 1;
return quit("!Mendb02 finished");
}
|