0.001 seconds seems very slow even for copying a huge memory area, regardless of the method.

Anyway, the reason of the performance problem is the CPU cache. When you access a memory location, a cache line is filled. The next time you hit that area, the CPU reads from the cache instead of the memory.

The more different memory areas are accessed, the more cache lines are created. When the cache is full, old lines are removed and new lines filled. This causes a severe performance hit. That's why in time critical programs, especially 3D engines, the programmers avoid reading from different allocated memory areas at the same time. Of course this effect is the worse, the bigger the memory areas are and the more fragmented the memory is.

Of course there are more problems with that method - for instance, you'll risk a crash when you initialize that construct with a memset call, as you would do with an array. I would not use that method for non time critical applications either.