Alright, so far, so good. We are able to pump a lot of instructions through the CPU in very little clock cycles, however, there are two big issues: The first one is that RAM is slow. Like, really slow, it can take upwards of 100+ clock cycles for RAM access to complete, which is obviously not ideal, because you will probably end up stalling your neat little pipeline. Luckily, just like there is branch predictors, there are predictors to figure out what memory to prefetch and make available. Also, the CPU has various caches that are very fast and the results from RAM fetches are stored there temporarily so the next time they are accessed it won’t take long to retrieve them. Of course the space in these caches is limited, so the CPU has to be smart about what it keeps in there and what it discards, so that in an ideal world it never has to stall because of memory access. Also, usually you will end up with a memory access followed by a memory access around where you just accessed memory. For example when reading values from a struct, they are all close to each other and most likely you will read multiple values at once. This is why the prefetcher will always fetch a whole cache line, which is usually 64 byte in size. Even if the memory access is only 1 byte, there is always more memory fetched “just in case”.

The second problem is that 32bit x86 only has 8 registers. And most instructions will either read or write from one of those registers, so there is a bit of contention there. That’s why modern CPUs have hundreds of registers and a register renamer! Instead of EAX actually being associated with one physically register, the CPU just associates it with a register and can freely change whichever register it means when talking about EAX. Of course, since all calculations still have to appear as if EAX was one register it can’t just willy nilly switch what it sees as EAX, but it can break up dependency chains very easily that way. For example, imagine a chain of instructions doing some calculations that uses EAX as temporary storage followed by another chain which first zeroes EAX out and then does some calculations of its own. The CPU can now go ahead and change what physical register the second chain of instruction refers to and both chains can start in and run in parallel. Even more so, the register renaming station can already zero out the register, the zeroing of the EAX register becomes technically a no-op.

Wrap up for now

So, let’s see... A modern CPU pipeline looks something like: Branch predictor -> Instruction fetcher -> Instruction decoder -> Re-Order buffer -> Register Renamer -> Execution engine -> Retirement station. 7. Not quite the 14 we were told the CPU would have, and in fact does have. But that ought to be enough for a very basic overview just to get a rough idea about what the CPU will do. Anyone still with me so far? Question? Anything/Everything unclear? I’ll write a follow up which goes away from just generic let’s look at how it’s done in general and get to know each other, to actually takes a look at a real microarchitecture. Interest?

Edit: Wrong forum, eh? Can a mod move this somewhere more appropriate?

Last edited by WretchedSid; 11/18/15 19:09.

Shitlord by trade and passion. Graphics programmer at Laminar Research.
I write blog posts at feresignum.com