Kent's Blog
EC Container 5

Ordering

The CPU pipeline design I just laid out in the previous post was the pinnacle of 1989 RISC design. Mostly.

The simple pipeline has a very nice feature: each instruction appears to execute and complete in the order they pass through the EX stage. Earlier instructions before the current instruction are complete, and instructions not yet in EX have not done any work yet. The CPU is pipelined, so many instructions are in various stages of execution. But to software, the single EX pipeline stage makes it appear as if the instructions are executing in order. And a much stronger order than that: data accesses are strongly ordered.

Most programmers have a simplistic model for a CPU (if they have a model at all), and most expect it to execute instructions one at a time in the order of the assembly language instructions. This is a pretty reasonable mental model, and most CPU architectures go to great lengths to try to achieve it.

Every architecture (excluding Alpha, which fortunately is dead now) makes it appear that the instructions executed on a single CPU core fully execute in order. Storing to address A and immediately loading from address A in the next instruction always gets the right data. Loading from address B and then storing different data to address B never causes the earlier load to see the later store data. With no funny synchronization instructions needed.

So basically, all architectures have some EX pipeline stage which they order all instructions through, from a single core’s perspective. So why aren’t all CPU architectures strongly-ordered, and what are the other ordering models?

Unfortunately, there’s terrible nomenclature around CPU ordering, and even worse, technical descriptions tend to get long and very abstract. To put it simply, there are roughly 3 levels of ordering: Strongly Ordered, Store-Ordered (SPARC called this Total-Store-Ordering, and I like the acronym TSO), and Weakly-Ordered. What’s happening is architectures are breaking certain rules just laid down for the EX stage to try to get higher performance. So I think it’s best to think about what rule is being broken to understand what’s going on.

Let’s look at way to move our simple CPU into the 1990’s. One step was “dual-issue”, where multiple instructions could execute at once. This generally doesn’t affect ordering, so I’ll ignore it. Another step, which is an ordering issue, was called “hit-under-miss”.

Previously, if a Load instruction was a cache miss, we’d stall the pipeline (in the MEM or EX stage) and wait for data to return. Once data returned, the pipeline would restart, and subsequent instructions would then make it to the EX stage.

A very very good way to think about a CPU’s performance is to look at stalls. Stalls are any pipeline stalls where no useful work is done. With this CPU design, each load and store miss stalls the pipeline for the latency to main memory, which can be a fairly long time. The idea of Hit-Under-Miss is to allow one miss to be pending without stalling the pipeline. So, if there are no misses currently pending, when a Load instruction misses the cache, let it go through EX and MEM without a pipeline stall. Instead, mark its target register as “busy”, and stall any instruction in EX if it tries to read the busy register. Hardware on the side (not in the main pipeline) waits for the data from main memory to return. But instructions after the Load can execute and complete, as long as they don’t touch the “busy” register. For simplicity, if another Load or Store misses the cache, we’ll then stall at EX/MEM.

This is a nice speed boost. Let’s assume a code sequence which causes a Cache Miss every 6 cycles (which we’ll assume is 6 instructions), and main memory latency is 20 cycles. And let’s assume we can, on average, execute 3 instructions (3 cycles) after a Cache Miss before causing a stall (either because of using the Loaded value, or causing another miss).

Without Hit-Under-Miss, executing 6 instructions will take 6 cycles plus 1 miss of 20 clocks, for a total of 26 cycles. With Hit-Under-Miss, after 6 instructions, there will be a miss. But we can execute 3 instructions in the shadow of the miss, then stall, waiting for the data to come back. Then, restart and execute 3 more instructions, then miss again. Then execute 3 more instructions during the memory access, then stall waiting for the memory to come back. Repeat this pattern, and you can see a miss is started every 23 cycles. Effectively, the 3 instructions done while waiting for main memory are “free”, so 6 instructions take just 23 cycles. Even with high miss traffic, and only able to execute a few instructions before stalling again, Hit-Under-Miss helps noticeably. (In CPU design, a 10% improvement is pretty big).

Hit-Under-Miss doesn’t affect a single-CPU core’s view of the ordering of its instructions, but it does change the multiprocessor view of main memory.
EC Container 6