Kent's Blog
EC Container 5

Ordering

The CPU pipeline design I just laid out in the previous post was the pinnacle of 1989 RISC design. Mostly.

The simple pipeline has a very nice feature: each instruction appears to execute and complete in the order they pass through the EX stage. Earlier instructions before the current instruction are complete, and instructions not yet in EX have not done any work yet. The CPU is pipelined, so many instructions are in various stages of execution. But to software, the single EX pipeline stage makes it appear as if the instructions are executing in order. And a much stronger order than that: data accesses are strongly ordered.

Most programmers have a simplistic model for a CPU (if they have a model at all), and most expect it to execute instructions one at a time in the order of the assembly language instructions. This is a pretty reasonable mental model, and most CPU architectures go to great lengths to try to achieve it.

Every architecture (excluding Alpha, which fortunately is dead now) makes it appear that the instructions executed on a single CPU core fully execute in order. Storing to address A and immediately loading from address A in the next instruction always gets the right data. Loading from address B and then storing different data to address B never causes the earlier load to see the later store data. With no funny synchronization instructions needed.

So basically, all architectures have some EX pipeline stage which they order all instructions through, from a single core’s perspective. So why aren’t all CPU architectures strongly-ordered, and what are the other ordering models?

Unfortunately, there’s terrible nomenclature around CPU ordering, and even worse, technical descriptions tend to get long and very abstract. To put it simply, there are roughly 3 levels of ordering: Strongly Ordered, Store-Ordered (SPARC called this Total-Store-Ordering, and I like the acronym TSO), and Weakly-Ordered. What’s happening is architectures are breaking certain rules just laid down for the EX stage to try to get higher performance. So I think it’s best to think about what rule is being broken to understand what’s going on.

Let’s look at way to move our simple CPU into the 1990’s. One step was “dual-issue”, where multiple instructions could execute at once. This generally doesn’t affect ordering, so I’ll ignore it. Another step, which is an ordering issue, was called “hit-under-miss”.

Previously, if a Load instruction was a cache miss, we’d stall the pipeline (in the MEM or EX stage) and wait for data to return. Once data returned, the pipeline would restart, and subsequent instructions would then make it to the EX stage.

A very very good way to think about a CPU’s performance is to look at stalls. Stalls are any pipeline stalls where no useful work is done. With this CPU design, each load and store miss stalls the pipeline for the latency to main memory, which can be a fairly long time. The idea of Hit-Under-Miss is to allow one miss to be pending without stalling the pipeline. So, if there are no misses currently pending, when a Load instruction misses the cache, let it go through EX and MEM without a pipeline stall. Instead, mark its target register as “busy”, and stall any instruction in EX if it tries to read the busy register. Hardware on the side (not in the main pipeline) waits for the data from main memory to return. But instructions after the Load can execute and complete, as long as they don’t touch the “busy” register. For simplicity, if another Load or Store misses the cache, we’ll then stall at EX/MEM.

This is a nice speed boost. Let’s assume a code sequence which causes a Cache Miss every 6 cycles (which we’ll assume is 6 instructions), and main memory latency is 20 cycles. And let’s assume we can, on average, execute 3 instructions (3 cycles) after a Cache Miss before causing a stall (either because of using the Loaded value, or causing another miss).

Without Hit-Under-Miss, executing 6 instructions will take 6 cycles plus 1 miss of 20 clocks, for a total of 26 cycles. With Hit-Under-Miss, after 6 instructions, there will be a miss. But we can execute 3 instructions in the shadow of the miss, then stall, waiting for the data to come back. Then, restart and execute 3 more instructions, then miss again. Then execute 3 more instructions during the memory access, then stall waiting for the memory to come back. Repeat this pattern, and you can see a miss is started every 23 cycles. Effectively, the 3 instructions done while waiting for main memory are “free”, so 6 instructions take just 23 cycles. Even with high miss traffic, and only able to execute a few instructions before stalling again, Hit-Under-Miss helps noticeably. (In CPU design, a 10% improvement is pretty big).

Hit-Under-Miss doesn’t affect a single-CPU core’s view of the ordering of its instructions, but it does change the multiprocessor view of main memory.

CPU pipelines in 4 easy stages

I’m not a CPU designer, but I know how CPUs work. And since I have opinions on everything, I have opinions on CPUs. I want to get to an opinion on system architecture involving CPUs, but before I can get to that, it’s going to take several posts of background first.

Imagine every possible CPU design. I’ll wait. Got it yet?

Forget that. I’m going to simplify every possible CPU design to a simple in-order RISC pipeline. I know, your favorite CPU is much more complex. But for the point I want to make, all CPUs effectively have an EX pipeline stage, where all the magic happens.

The classic RISC pipeline consists of the stages IF, ID, EX, MEM, WB. IF is Instruction Fetch, ID is Instruction Decode, EX is Execute, MEM is memory access, and WB is WriteBack. The way I like to view it is EX is the main stage, and the other stages are preparation or cleanup for EX. ID is the important preparation stage, getting the registers ready and handling bypass results from EX. And MEM and WB exists so EX can be chock full of work, so writing results to the register file is pushed off so the clock frequency can be as high as possible.

I should make a drawing here, but that’s too much work for me. Since a RISC CPU instructions generally have 2 input registers and one output register, picture ID ending with flip-flops holding the two input values. ID is the stage where the register file is read to get data ready. Then in EX, an ADD instruction (for example) will take the two input values, add them together, and flop the result at the end of EX. It also will feed that result back to ID’s flip-flop inputs (the bypass path), in case the very next instruction wants to use the result value.

The thing to note is EX is where the instructions are effectively ordered. Instructions in IF or ID haven’t occurred yet, and if the instruction in EX needs to take a trap or exception, those instructions currently in IF or ID will be tossed. And the instruction in WB is committed--nothing the instruction in EX can do will make that instruction disappear. It’s already effectively occurred (even though it’s still technically in the pipeline). (Yes, I realize, a more complex pipeline could have many more stages, and even chase down and kill very late in the pipeline instructions due to certain reasons, and do instructions out-of-order...but that’s just complexity, all pipelines eventually have a commit point, let’s just call it EX).

But ADD instructions are not that interesting: loads and stores are interesting. Let’s assume loads and stores are done in two parts across the EX and MEM stage. Following the rule that before an instruction can get to the WB stage, it has to be committed, we’ll force the rule that any possible exception must be handled in EX. So TLB misses and TLB protection issues must be resolved in the EX stage.

But how are cache misses handled?

Let’s look at how cache accesses work. A load or store has two separate yet closely related lookups to do: one is to access the tag array to see if the data is valid in the data cache; the second is to actually access that data. At the level-1 data cache level, generally the load data access can begin without needing the tag lookup to complete (basically, address bits not in the tag are used to index into the data array). If the cache is associative, the tag results which arrive around the same time as the data results then select which of the associative data entries to choose. So, do the TLB lookup in EX, start the tag read and the data array read in EX, but let them complete in MEM. For stores, let the data write occur in MEM, after the tag read results are known. So exceptions are handled in EX, but actually handling the returned data is done in EX and MEM. And this is why the original RISC architectures had a load-to-use delay of one clock cycle: LOAD followed by an ADD using the loaded data would have a 1-cycle stall.

For a single-CPU simple design, cache misses would probably stall in MEM for loads and stores. If a load or store missed the cache (let’s assume write-back write-allocate caching only), the CPU would fetch the cacheline, then do the access again. Note that the stall is past the commit point--it will be done, it just has to wait on memory. This keeps the design simple, and achieves another important effect: the CPU appears to be strongly-ordered.
EC Container 6