No, the genetic algorithm is an order of magnitude slower, especially the Multicharts variant, simply because it needs more test cycles than the current method. The advantage is not speed, it is finding more local maxima.

Distributing the training process on several cores would require that no algo or asset can use variables from other algos and assets. That would not work for many systems, for instance not for Z12. And it has no real advantage, as you normally have less CPU cores than WFO cycles.

CUDA is a C dialect for the Open64 compiler and has nothing to do with Python. You can not "run something in CUDA", you must specifically program it in CUDA for taking advantage of the GPU.