Alright, this is something that I wanted to write for quite some time already. Mostly to establish nerd cred. It’s about the inner workings of a modern CPU, which I find to be an incredibly interesting topic and it helps writing performant code. I’ll be focusing on Intel and the x86 architecture, because that is what I mainly know about. CPUs are divided in their architecture, in this case that’s the x86 architecture (ie. each CPU can execute the x86 instruction set), and their microarchitecture, which is the actual implementation detail of how the CPU goes about running the architecture. Intel follows the Tick-Tock model, where one generation they come out with a new microarchitecture which greatly changes how the CPU performs and the next generation is practically the same but much smaller (from 45nm to 30nm for example). Of course, I don’t want to spark a race of premature optimisations, but it helps designing code from the get go to be at least not non-efficient. Anyway, with that being said, modern CPUs are massive beasts of performance and you can do things inefficiently and still get away with it.

General overview

The reason you can get away with it, is because CPUs a complex as hell. Supposedly there used to be a time where a CPU would read an instruction and then execute it all in the same clock cycle. Eg, the throughput of your instructions was dependent on the clock frequency of the CPU. This hasn’t be the case in a very very long time. Nowadays CPUs are pipelined, meaning that the whole reading and then doing something with an instruction is broken up into small stages, and each individual part of the pipeline is run on every clock cycle. A simple pipeline would be to fetch an instruction, decode the instruction and then to execute the instructions. Three stages in total, so an instruction will take at least three clock cycles to complete. First clock cycle it’s fetched, second clock cycle it’s decoded and third clock cycle it’s actually executed. Of course, now the CPU can work on three instructions at once.

The reason for this is that it keeps the circuitry simpler, and less space is needed. Also it allows running the CPU at higher clock rates, because a new clock cycle can only ever begin once the previous clock cycle has completed. The longer the whole circuit, the longer the signal time. If you can keep the whole thing simple and divided into small steps, a clock cycle can be a lot shorter and a new cycle can begin much sooner.

Now, a modern CPU has way more than just three pipeline stages. The Core 2 microarchitecture has a 14 stages long pipeline. Long pipelines are undesirable because there is a huge penalty in case the pipeline has to be rolled back. The Pentium 4 (NetBurst microarchitecture) had an up to 31 stages long pipeline, which allowed for insanely high clock frequencies at the time, but at the cost of having to toss up to 31 instructions away in case the pipeline had to be rolled back (why that might have to be done I’ll get to in a minute). The long story short is that NetBurst was a flop in terms of achieving high performance, however, it turned out to be great at generating a ton of heat.

Another thing is that modern CPUs no longer actually execute the instructions directly but instead translate the instructions. x86 is the big CISC architecture out there, CISC standing for Complex Instruction Set Computing, meaning that the instruction set doesn’t just have a set of basic instructions and complex behaviour has to be composed on top of them (by the programmer or compiler), but instead x86 has instructions for virtually everything. In fact, many of the instructions themselves have so many nuances and “power” that they themselves are turing complete, i.e. a compiler could compile code using only one instruction! Of course, that would be quite bad for performance, but hey, it gives you a nice overview of just how complex the instruction set really is. Anyway, as it turns out, that’s a rather stupid idea if you want to keep your circuitry simple. So a modern x86 CPU will just translate instructions into so called micro-ops. For example, in x86 you can add a value from memory to a register. Executing this would be rather complex, because now the execution engine for integer math suddenly has to be also able to load from memory. So instead, the CPU will go ahead and translate it into two micro-ops, one to fetch the value from memory and store it somewhere temporarily, and a second that adds the value to the register.

So, the CPU will fetch an instruction and then decode it into micro-ops, which are then executed. Great, except, the CPU needs to know what to even fetch in the first place! Imagine a 14 stages long pipeline, and now the CPU decodes an instruction with a conditional jump based on a value of an instruction that is currently at step 9 of the pipeline stage. This could be as simple as an if based on a computed value, the CPU doesn’t know yet if the jump is taken or not, because the instruction still has 5 stages left in the pipeline to complete and the value to be available. Of course now you could just halt everything and wait for the instruction to complete, but that is exactly the opposite of what you want! CPUs need high throughput, very very high throughput. There is a physical limitation to clock frequency, not to mention and economical limit and the way to overcome this is to keep the same frequency but increase the throughput of the CPU. So, stalling the pipeline is not an option. Instead, the CPU has a so called branch predictor, which sole job it is to predict wether a branch is taken or not. The branch predictor will tell the CPU which instruction to fetch next, based on Voodoo and tarot cards. Not really, but predicting the future is hard. If the branch is taken randomly, the CPU will have a very very hard time predicting this, however, usually programs have certain patterns to them, which the branch predictor can detect and then make guesstimate. For example, you may have a loop with a condition that makes the loop repeat 100 times, maybe you are iterating over an array. The branch in this case is at the end of the loop where it’s checked wether the loop is complete or if it should be repeated and the branch predictor will soon detect the pattern that the loops is repeated over and over again, so it will predict the branch to be taken. Of course, this will fail once the condition stops being true, but in the best case you end up with the branch predictor being right 99 times and wrong once! This is great because it can now keep the pipeline highly saturated. So what happens when the branch predictor is wrong? Well, the CPU went ahead and began the process of executing code that turned out to be wrong, which means the pipeline has to be rolled back and the CPU has to start again. This is expensive, because worst case you just emptied out 13 stages of your pipeline and it will take another 14 clock cycles since the next instruction completes. It’s a huge penalty to pay, which is why it is always good to play nice with the CPU and help it out with creating predictable branches.

Now, what is a partially executed instruction anyway? It’s nothing, the CPU doesn’t have to perform any work to actually roll back the instructions themselves because they are not committed yet. Sure, the computation was done for nothing and the pipeline is empty, but the instruction didn’t have any side effects yet. The last stage in the pipeline is the retirement station, where completed instructions are committed. Their values are actually written and the side effects, eg. changes to the flags register are performed. This is important because otherwise the CPU would have to do a huge amount of work when rolling back the pipeline to reverse the work already done, which sometimes isn’t even possible! Zeroing out a register for example is not reversible unless you keep the previous value somewhere just in case.

Another thing the retirement station does is to retire instructions in order. Imagine the following, you have an an instruction that performs some calculations with the EAX and EDI register followed by one that simply zeroes out the ECX register. Let’s say the first one does a multiplication which will actually take 4 clock cycles to complete in the ALU (arithmetic logical unit, the chip that actually does the calculations). Great, now what, you just started one instruction and wait for it to complete so you can start the next one? 4 clock cycles is a long time to wait, especially because the CPU could technically zero out the register already since it has a different execution unit available for that. So that’s what the CPU does, it already starts the next instruction that zeroes out the register, which might complete in one clock cycle already. So now the second instruction makes it to the retirement station before the first one, because the CPU started to perform work out of order. And now the retirement station will put the second instruction on the bench until it’s ready to entire it, which is when all previous instructions have retired. And because the engineers were smart, they made the retirement station able to to retire multiple instructions at once, so once the time has come, you will end up with both instructions retiring at the same time.

Obviously that was a rather simple example, in reality you have a lot of different instructions taking a different amount of time to actually execute, and the CPU will start to perform those instructions out of order to keep the throughput high. As long as instructions don’t depend on each other, the CPU is able to perform them out of order. Of course if you end up with a long chain of instructions that all depend on the previous one, the CPU won’t be able to pull this magic off.

Also, the CPU has multiple execution ports! The execution ports are where the instructions are actually executed, and the CPU has ports for working with memory, integer arithmetic, floating point arithmetic, SIMD instructions etc. Some of them are even duplicated, or at least partially duplicated. You will usually find multiple ALUs, although not all of them are able to perform all instructions. The CPU will try to feed as many instructions to as many ports as it can, so it can do as much work as possible at once. This also means that the previous pipeline stages that prepare the instructions for execution can work with multiple instructions at once. The CPU always fetches and decodes multiple instructions at once, and if it can, it will also feel multiple instructions at once to as many execution ports as it can. And then afterwards, it will retire as many instructions as it can, leading to well optimised code being able to achieve a throughput of up 4 instructions per clock cycle on modern Intel CPUs.


Shitlord by trade and passion. Graphics programmer at Laminar Research.
I write blog posts at feresignum.com