Shaders aren't normally compiled until the first time they're used. I think that's likely the biggest part of the initial overhead. My test results varied a lot, although even with lower iterations GPU beat CPU (that's just because my GPU's a beast). But if I called the bmap_cloneGPU function early so that the shader would be compiled, even just one iteration took much less time.

Jibb


Formerly known as JulzMighty.
I made KarBOOM!