Originally Posted By: Schubido
Have you done a performance test with bmap_blit? I think HeelX code should do the job perfectly.
I designed a stress test to discover the time consumed by both cases bmap_blit (CPU) and -process (GPU). I create the same amount of copies of bitmaps with both methods and count the consumed time with dtimer(). For each iteration I create a new bitmap to avoid engine-optimizations like not overwriting the source image pointer for the shader because the image pointer has not been changed since the last call of bmap_process.

The following test calculates the mean processing time for both methods for 100 iterations for a 512x512x24 image:

Code:
#include <acknex.h>
#include <stdio.h>

BMAP* bmap_clone (BMAP* src);
BMAP* bmap_cloneGPU (BMAP* src);

int main()
{
   wait(1);
   
   int iterations = 100;
   int width = 512, height = 512, format = 24;
   
   double t_cpu = 0;
   
      int i;
      for (i = 0; i < iterations; i++)
      {
         BMAP* b = bmap_createblack(width,height,format); // src
         
         dtimer();
         BMAP* c = bmap_clone(b);
         t_cpu += dtimer();
         
         ptr_remove(b);
         ptr_remove(c);       
      }

   t_cpu /= (double)iterations;
   
   double t_gpu = 0;
   
      int i;
      for (i = 0; i < iterations; i++)
      {
         BMAP* b = bmap_createblack(width,height,format); // src
         
         dtimer();
         BMAP* c = bmap_cloneGPU(b);
         t_gpu += dtimer();
         
         ptr_remove(b);
         ptr_remove(c);
      }
   
   t_gpu /= (double)iterations;

   char log [512];
   sprintf(log, "bitmap = %dx%dx%d\niterations = %d\nbmap_clone (CPU) ~ %f microsecs\nbmap_clone (GPU) ~ %f microsecs", width, height, format, iterations, t_cpu, t_gpu);
   error(log);
}

BMAP* bmap_clone (BMAP* src)
{
   // no source?
   if (src == NULL)
      return(NULL);
      
   var format = bmap_lock(src, 0);
   
   // invalid bitmap or unknown format?
   if (format == 0)
      return(NULL);
   else  
      bmap_unlock(src);
   
   BMAP* r = bmap_createblack(src->width, src->height, format);
   
   if (r != NULL)
      bmap_blit(r, src, NULL, NULL);
   
   return(r);
}

MATERIAL* mtlCopy =
{
   effect = "
      
      Texture TargetMap; // src
      sampler2D smpSrc = sampler_state
      {
         texture = <TargetMap>;
      };

      float4 PS (float2 Tex: TEXCOORD0): COLOR
      {
         return(tex2D(smpSrc, Tex.xy));
      }

      technique t
      {
         pass p
         {
            AlphaBlendEnable = false;
            PixelShader = compile ps_2_0 PS();
         }
      }
   ";
}

BMAP* bmap_cloneGPU (BMAP* src)
{
   // no source?
   if (src == NULL)
      return(NULL);
      
   var format = bmap_lock(src, 0);
   
   // invalid bitmap or unknown format?
   if (format == 0)
      return(NULL);
   else  
      bmap_unlock(src);
   
   BMAP* r = bmap_createblack(src->width, src->height, format);
   
   if (r != NULL)
   {
      bmap_process(r, src, mtlCopy);
   }
   
   return(r);
}



Here are my results for a 512x512x24 bitmap on my system:

iterations = 1
bmap_clone (CPU) ~ 14110.593750 microsecs
bmap_clone (GPU) ~ 19934.037109 microsecs

iterations = 5
bmap_clone (CPU) ~ 14125.698242 microsecs
bmap_clone (GPU) ~ 8372.354492 microsecs

iterations = 100
bmap_clone (CPU) ~ 14442.785156 microsecs
bmap_clone (GPU) ~ 4864.554688

iterations = 1000
bmap_clone (CPU) ~ 14664.015625 microsecs
bmap_clone (GPU) ~ 4767.796875 microsecs

You can see, that working with bmap_blit is always around 14k microsecs. Using bmap_process has obviously an initial overhead, because after using it first time, it takes even longer than with bmap_blit. The mean time consumed by taking bmap_process to copy the bitmap is being dramatically reduced after the very first iterations and reduced to a mean time of around 5k microsecs or less. Over the time, bmap_process consumes roughly 1/3 of the time which is needed for copying with bmap_blit.

There are still open questions, like how much time is consumed by bmap_lock to get the format in order to create a new bitmap in correspondence to the source image and if the mean time of bmap_process will increase if another shader is used in between (possible overhead for changing shader code).

Conclusion: If you need performance and do lots of bitmap operations which can be done in a fragment shader, go with bmap_process!