Kent's Blog
EC Container 5

Weak Ordering

Sorry for the 2.5 year hiatus on this blog.

I was working up to Weakly Ordered with Hit-Under-Miss. By letting a CPU complete instructions while a load miss is pending, we’re now weakly ordered.

Here’s why: Assume CPU #0 does a Load to address A, which misses the cache. CPU #0 continues executing, and does a Load to address B, which hits the cache, and so that instruction retires. Then let’s assume there’s another miss, and now CPU #0 is stalled. Meanwhile, CPU #1 does a Store to address B, which misses in its cache, then a Store to address A which is a hit. We’ll assume A starts in CPU #1’s cache, and B starts in CPU #0’s cache.

Effectively, the above all happens in parallel at once. CPUs are allowed to operate mostly independently, with delayed resolution of coherency. A simple way to view coherency is that it’s resolved at the DDR memory controller in an arbitrary order (generally, coherency is managed elsewhere, but there’s always an ordering point somewhere, so let’s just imagine it’s at the DDR controller). So at our DDR memory controller, we’re going to get a Load to A and a Store to B. Let’s just do them in that order. To give CPU #0 the A data, DDR needs to “snoop” CPU #1, and make it give up that cacheline first. So DDR asks CPU #1 for the line, and CPU #1 gives DDR back the new modified data (its Store to A has committed). DDR then gives this data to CPU #0, which will then allow CPU #0 to complete is Load to A.

DDR then processes the Store to B miss from CPU #1. DDR needs to snoop CPU #0, and CPU #0 just gives up the line (or provides modified data if it was previously modified). DDR can then return B to CPU #1.

(How does DDR know where the valid data is? It either keeps track of it in some sort of structure; or it just asks all CPUs all the time; or some combination of the two. This is an interesting problem.).

Coherency is maintained, our instructions are complete.

In a Strongly-ordered system, all instructions across all CPUs would have a logical ordering. There are only so many combinations of CPU #0’s Load A; Load B can be ordered with respect to CPU #1’s Store B; Store A. It’s tedious to list them all, but here are some example: 0: Load A; 0: Load B; 1: Store B; 1: Store A (just assume CPU #0’s instructions all happened first); or: 1: Store B; 1: Store A; 0:Load A; 0:Load B (just assume CPU #1’s instruction all happened first); or perfectly interleaved: 0: Load A; 1: Store B; 0: Load B; 1: Store A.

What’s interesting is to describe an impossible Strongly-Ordered case: 0: Load B; 1: Store A; 0: Load A; 1: Store B. CPU #1’s instructions must be ordered Store B then Store A, so a global order of CPU #1’s stores cannot be Store A then Store B.

But I’ve just given the results our DDR controller and current pipeline would give. CPU #0 sees an old value for its Load B since it hits in the cache. But CPU #0’s Load A does get the new value of A from CPU #1, since the Store A executed first on CPU #1 (it hit in the cache). If CPU #1 was writing to B to write some info; and then wrote to A to say, “Info is valid”, and if CPU #0 was reading A to see if “Info is valid”, and reading B to get the info, then we’ve just seriously broken the software ordering. Because CPU #0 thinks “Info is valid”, but it read an old stale value.

This is exactly what Weakly Ordered systems do. They do not give a guarantee of global ordering semantics UNLESS you put in a barrier instruction to indicate that you care about ordering of instructions. Let’s call this MBAR, for Memory BARrier.

In this case, both CPU #0 and CPU #1 need MBAR. CPU #0’s sequence would be Load A; MBAR; Load B, and CPU #1’s sequence would be Store B; MBAR; Store A. What MBAR does is stop the execution of later instructions until all previous instructions are committed to be ordered as completed. This fixes this case.

But what does MBAR do? It basically stops hit-under-miss. So, if you care about ordering, you have to sprinkle in MBARs to stop hit-under-miss. But if you don’t care about ordering, you don’t need MBAR and you can take advantage of hit-under-miss.

You’re probably thinking the compiler inserts these MBARs when necessary and programmers can just ignore this stuff. Unfortunately, that’s generally not true.


The CPU pipeline design I just laid out in the previous post was the pinnacle of 1989 RISC design. Mostly.

The simple pipeline has a very nice feature: each instruction appears to execute and complete in the order they pass through the EX stage. Earlier instructions before the current instruction are complete, and instructions not yet in EX have not done any work yet. The CPU is pipelined, so many instructions are in various stages of execution. But to software, the single EX pipeline stage makes it appear as if the instructions are executing in order. And a much stronger order than that: data accesses are strongly ordered.

Most programmers have a simplistic model for a CPU (if they have a model at all), and most expect it to execute instructions one at a time in the order of the assembly language instructions. This is a pretty reasonable mental model, and most CPU architectures go to great lengths to try to achieve it.

Every architecture (excluding Alpha, which fortunately is dead now) makes it appear that the instructions executed on a single CPU core fully execute in order. Storing to address A and immediately loading from address A in the next instruction always gets the right data. Loading from address B and then storing different data to address B never causes the earlier load to see the later store data. With no funny synchronization instructions needed.

So basically, all architectures have some EX pipeline stage which they order all instructions through, from a single core’s perspective. So why aren’t all CPU architectures strongly-ordered, and what are the other ordering models?

Unfortunately, there’s terrible nomenclature around CPU ordering, and even worse, technical descriptions tend to get long and very abstract. To put it simply, there are roughly 3 levels of ordering: Strongly Ordered, Store-Ordered (SPARC called this Total-Store-Ordering, and I like the acronym TSO), and Weakly-Ordered. What’s happening is architectures are breaking certain rules just laid down for the EX stage to try to get higher performance. So I think it’s best to think about what rule is being broken to understand what’s going on.

Let’s look at way to move our simple CPU into the 1990’s. One step was “dual-issue”, where multiple instructions could execute at once. This generally doesn’t affect ordering, so I’ll ignore it. Another step, which is an ordering issue, was called “hit-under-miss”.

Previously, if a Load instruction was a cache miss, we’d stall the pipeline (in the MEM or EX stage) and wait for data to return. Once data returned, the pipeline would restart, and subsequent instructions would then make it to the EX stage.

A very very good way to think about a CPU’s performance is to look at stalls. Stalls are any pipeline stalls where no useful work is done. With this CPU design, each load and store miss stalls the pipeline for the latency to main memory, which can be a fairly long time. The idea of Hit-Under-Miss is to allow one miss to be pending without stalling the pipeline. So, if there are no misses currently pending, when a Load instruction misses the cache, let it go through EX and MEM without a pipeline stall. Instead, mark its target register as “busy”, and stall any instruction in EX if it tries to read the busy register. Hardware on the side (not in the main pipeline) waits for the data from main memory to return. But instructions after the Load can execute and complete, as long as they don’t touch the “busy” register. For simplicity, if another Load or Store misses the cache, we’ll then stall at EX/MEM.

This is a nice speed boost. Let’s assume a code sequence which causes a Cache Miss every 6 cycles (which we’ll assume is 6 instructions), and main memory latency is 20 cycles. And let’s assume we can, on average, execute 3 instructions (3 cycles) after a Cache Miss before causing a stall (either because of using the Loaded value, or causing another miss).

Without Hit-Under-Miss, executing 6 instructions will take 6 cycles plus 1 miss of 20 clocks, for a total of 26 cycles. With Hit-Under-Miss, after 6 instructions, there will be a miss. But we can execute 3 instructions in the shadow of the miss, then stall, waiting for the data to come back. Then, restart and execute 3 more instructions, then miss again. Then execute 3 more instructions during the memory access, then stall waiting for the memory to come back. Repeat this pattern, and you can see a miss is started every 23 cycles. Effectively, the 3 instructions done while waiting for main memory are “free”, so 6 instructions take just 23 cycles. Even with high miss traffic, and only able to execute a few instructions before stalling again, Hit-Under-Miss helps noticeably. (In CPU design, a 10% improvement is pretty big).

Hit-Under-Miss doesn’t affect a single-CPU core’s view of the ordering of its instructions, but it does change the multiprocessor view of main memory.

CPU pipelines in 4 easy stages

I’m not a CPU designer, but I know how CPUs work. And since I have opinions on everything, I have opinions on CPUs. I want to get to an opinion on system architecture involving CPUs, but before I can get to that, it’s going to take several posts of background first.

Imagine every possible CPU design. I’ll wait. Got it yet?

Forget that. I’m going to simplify every possible CPU design to a simple in-order RISC pipeline. I know, your favorite CPU is much more complex. But for the point I want to make, all CPUs effectively have an EX pipeline stage, where all the magic happens.

The classic RISC pipeline consists of the stages IF, ID, EX, MEM, WB. IF is Instruction Fetch, ID is Instruction Decode, EX is Execute, MEM is memory access, and WB is WriteBack. The way I like to view it is EX is the main stage, and the other stages are preparation or cleanup for EX. ID is the important preparation stage, getting the registers ready and handling bypass results from EX. And MEM and WB exists so EX can be chock full of work, so writing results to the register file is pushed off so the clock frequency can be as high as possible.

I should make a drawing here, but that’s too much work for me. Since a RISC CPU instructions generally have 2 input registers and one output register, picture ID ending with flip-flops holding the two input values. ID is the stage where the register file is read to get data ready. Then in EX, an ADD instruction (for example) will take the two input values, add them together, and flop the result at the end of EX. It also will feed that result back to ID’s flip-flop inputs (the bypass path), in case the very next instruction wants to use the result value.

The thing to note is EX is where the instructions are effectively ordered. Instructions in IF or ID haven’t occurred yet, and if the instruction in EX needs to take a trap or exception, those instructions currently in IF or ID will be tossed. And the instruction in WB is committed--nothing the instruction in EX can do will make that instruction disappear. It’s already effectively occurred (even though it’s still technically in the pipeline). (Yes, I realize, a more complex pipeline could have many more stages, and even chase down and kill very late in the pipeline instructions due to certain reasons, and do instructions out-of-order...but that’s just complexity, all pipelines eventually have a commit point, let’s just call it EX).

But ADD instructions are not that interesting: loads and stores are interesting. Let’s assume loads and stores are done in two parts across the EX and MEM stage. Following the rule that before an instruction can get to the WB stage, it has to be committed, we’ll force the rule that any possible exception must be handled in EX. So TLB misses and TLB protection issues must be resolved in the EX stage.

But how are cache misses handled?

Let’s look at how cache accesses work. A load or store has two separate yet closely related lookups to do: one is to access the tag array to see if the data is valid in the data cache; the second is to actually access that data. At the level-1 data cache level, generally the load data access can begin without needing the tag lookup to complete (basically, address bits not in the tag are used to index into the data array). If the cache is associative, the tag results which arrive around the same time as the data results then select which of the associative data entries to choose. So, do the TLB lookup in EX, start the tag read and the data array read in EX, but let them complete in MEM. For stores, let the data write occur in MEM, after the tag read results are known. So exceptions are handled in EX, but actually handling the returned data is done in EX and MEM. And this is why the original RISC architectures had a load-to-use delay of one clock cycle: LOAD followed by an ADD using the loaded data would have a 1-cycle stall.

For a single-CPU simple design, cache misses would probably stall in MEM for loads and stores. If a load or store missed the cache (let’s assume write-back write-allocate caching only), the CPU would fetch the cacheline, then do the access again. Note that the stall is past the commit point--it will be done, it just has to wait on memory. This keeps the design simple, and achieves another important effect: the CPU appears to be strongly-ordered.

Spending $300 to save $10

I live in New England, where frugalness is prized. Well, to some extent, at least. How about: I don’t like to waste money needlessly, although I do value my own time.

When my HP CP2025DN color Laserjet printer ran out of its initial black toner cartridge, I knew how to get more toner out of it. Lots of years with HP black&white laser printers taught me there’s a LOT of toner still in there when the printer thinks it’s “out”, and the failure mechanism of actually running out of toner is the text on the printouts just gets lighter and lighter. When that happens, just shake the toner cartridge, and get another hundred pages or so.

The printer has 4 cartridges, the usual black, magenta, yellow, and cyan. Each cost about $100 from HP, a little less if you want refilled no-name cartridges. I bought 4 new cartridges since I’ll need them all eventually, but didn’t put the black in until I noticed the output fading. The color cartridges were more than 2/3rd full when the block toner initially ran “out”. So after a few hundred pages, I thought the printouts were looking a little less black, and put in a new black cartridge.

And...the printer complained it was still out of toner. What it had done was to drain the color cartridges, probably using all the ink to approximate black. And so all the color toner cartridges were empty.

So to use the last $10 or so of one black cartridge, I’d drained $300 of color cartridges.

But I still won’t replace the color cartridges--I’m using the new black cartridge, but not the new color cartridges. I still get color that looks fine to me even though the printer says all of the color cartridges are completely empty. But what do I have to lose now? Now, I’m running a new experiment, to see how long the printer will print color even with “empty” color cartridges. And to see how HP will try to rip me off this time. I guess it could needlessly drain the black toner, but where would the black ink go?

Moral of the story: Don’t run with scissors? Never start a land war in Asia? Well, I’m not happy HP feels the need to do something underhanded (who would really want to use expensive color ink to approximate black?) to make me buy more toner. I didn’t mention the fact that the printer by default stops and refuses to print when it thinks it still has 6% toner left, although that setting can be overridden on a menu setting (Toner Low Override: On).

I had previously bought a Minolta MC2550 color laser printer, and it did colors great (better than HP), but its small black text was blurry, and since 90% of what I want is to print really small text (2-up duplex text), that was just unacceptable. Then the Minolta began jamming I never did buy new toner cartridges for it, I just traded it in towards the HP instead when Staples ran a deal on trade-ins, after I’d run the toner down to nearly empty. Which was a bad deal for Staples since I would have been happy to pay someone to take the Minolta away.

But now I no longer trust HP printers, which is probably going to cost HP more than $300 in the long run.
EC Container 6