most of the time there's one bottleneck, be it bandwith or vertex processing power. sometimes it's the scripts but then you've got either an incredibly time-consuming simulation or you're simply not good at it. there are hundreds of articles out there describing how to determine the bottleneck, F11 being one of them. there has also been some program (by nvidia?) out there which could measure performance of single draw calls; we once asked jcl to enable the engine for this tool (you have to compile the application with a specific driver), but i'm not sure if that has been done or not.