Have you done a performance test with bmap_blit? I think HeelX code should do the job perfectly.
I designed a stress test to discover the time consumed by both cases bmap_blit (CPU) and -process (GPU). I create the same amount of copies of bitmaps with both methods and count the consumed time with dtimer(). For each iteration I create a new bitmap to avoid engine-optimizations like not overwriting the source image pointer for the shader because the image pointer has not been changed since the last call of bmap_process.
The following test calculates the mean processing time for both methods for 100 iterations for a 512x512x24 image:
#include <acknex.h>
#include <stdio.h>
BMAP* bmap_clone (BMAP* src);
BMAP* bmap_cloneGPU (BMAP* src);
int main()
{
wait(1);
int iterations = 100;
int width = 512, height = 512, format = 24;
double t_cpu = 0;
int i;
for (i = 0; i < iterations; i++)
{
BMAP* b = bmap_createblack(width,height,format); // src
dtimer();
BMAP* c = bmap_clone(b);
t_cpu += dtimer();
ptr_remove(b);
ptr_remove(c);
}
t_cpu /= (double)iterations;
double t_gpu = 0;
int i;
for (i = 0; i < iterations; i++)
{
BMAP* b = bmap_createblack(width,height,format); // src
dtimer();
BMAP* c = bmap_cloneGPU(b);
t_gpu += dtimer();
ptr_remove(b);
ptr_remove(c);
}
t_gpu /= (double)iterations;
char log [512];
sprintf(log, "bitmap = %dx%dx%d\niterations = %d\nbmap_clone (CPU) ~ %f microsecs\nbmap_clone (GPU) ~ %f microsecs", width, height, format, iterations, t_cpu, t_gpu);
error(log);
}
BMAP* bmap_clone (BMAP* src)
{
// no source?
if (src == NULL)
return(NULL);
var format = bmap_lock(src, 0);
// invalid bitmap or unknown format?
if (format == 0)
return(NULL);
else
bmap_unlock(src);
BMAP* r = bmap_createblack(src->width, src->height, format);
if (r != NULL)
bmap_blit(r, src, NULL, NULL);
return(r);
}
MATERIAL* mtlCopy =
{
effect = "
Texture TargetMap; // src
sampler2D smpSrc = sampler_state
{
texture = <TargetMap>;
};
float4 PS (float2 Tex: TEXCOORD0): COLOR
{
return(tex2D(smpSrc, Tex.xy));
}
technique t
{
pass p
{
AlphaBlendEnable = false;
PixelShader = compile ps_2_0 PS();
}
}
";
}
BMAP* bmap_cloneGPU (BMAP* src)
{
// no source?
if (src == NULL)
return(NULL);
var format = bmap_lock(src, 0);
// invalid bitmap or unknown format?
if (format == 0)
return(NULL);
else
bmap_unlock(src);
BMAP* r = bmap_createblack(src->width, src->height, format);
if (r != NULL)
{
bmap_process(r, src, mtlCopy);
}
return(r);
}
Here are my results for a 512x512x24 bitmap on my system:
iterations = 1
bmap_clone (CPU) ~ 14110.593750 microsecs
bmap_clone (GPU) ~ 19934.037109 microsecs
iterations = 5
bmap_clone (CPU) ~ 14125.698242 microsecs
bmap_clone (GPU) ~ 8372.354492 microsecs
iterations = 100
bmap_clone (CPU) ~ 14442.785156 microsecs
bmap_clone (GPU) ~ 4864.554688
iterations = 1000
bmap_clone (CPU) ~ 14664.015625 microsecs
bmap_clone (GPU) ~ 4767.796875 microsecs
You can see, that working with bmap_blit is always around 14k microsecs. Using bmap_process has obviously an initial overhead, because after using it first time, it takes even longer than with bmap_blit. The mean time consumed by taking bmap_process to copy the bitmap is being dramatically reduced after the very first iterations and reduced to a mean time of around 5k microsecs or less. Over the time, bmap_process consumes roughly 1/3 of the time which is needed for copying with bmap_blit.
There are still open questions, like how much time is consumed by bmap_lock to get the format in order to create a new bitmap in correspondence to the source image and if the mean time of bmap_process will increase if another shader is used in between (possible overhead for changing shader code).
Conclusion: If you need performance and do lots of bitmap operations which can be done in a fragment shader, go with bmap_process!