Kent's Blog
EC Container 5

Weak Ordering

Sorry for the 2.5 year hiatus on this blog.

I was working up to Weakly Ordered with Hit-Under-Miss. By letting a CPU complete instructions while a load miss is pending, we’re now weakly ordered.

Here’s why: Assume CPU #0 does a Load to address A, which misses the cache. CPU #0 continues executing, and does a Load to address B, which hits the cache, and so that instruction retires. Then let’s assume there’s another miss, and now CPU #0 is stalled. Meanwhile, CPU #1 does a Store to address B, which misses in its cache, then a Store to address A which is a hit. We’ll assume A starts in CPU #1’s cache, and B starts in CPU #0’s cache.

Effectively, the above all happens in parallel at once. CPUs are allowed to operate mostly independently, with delayed resolution of coherency. A simple way to view coherency is that it’s resolved at the DDR memory controller in an arbitrary order (generally, coherency is managed elsewhere, but there’s always an ordering point somewhere, so let’s just imagine it’s at the DDR controller). So at our DDR memory controller, we’re going to get a Load to A and a Store to B. Let’s just do them in that order. To give CPU #0 the A data, DDR needs to “snoop” CPU #1, and make it give up that cacheline first. So DDR asks CPU #1 for the line, and CPU #1 gives DDR back the new modified data (its Store to A has committed). DDR then gives this data to CPU #0, which will then allow CPU #0 to complete is Load to A.

DDR then processes the Store to B miss from CPU #1. DDR needs to snoop CPU #0, and CPU #0 just gives up the line (or provides modified data if it was previously modified). DDR can then return B to CPU #1.

(How does DDR know where the valid data is? It either keeps track of it in some sort of structure; or it just asks all CPUs all the time; or some combination of the two. This is an interesting problem).

Coherency is maintained, our instructions are complete.

In a Strongly-ordered system, all instructions across all CPUs would have a logical ordering. There are only so many combinations of CPU #0’s Load A; Load B can be ordered with respect to CPU #1’s Store B; Store A. It’s tedious to list them all, but here are some example: 0: Load A; 0: Load B; 1: Store B; 1: Store A (just assume CPU #0’s instructions all happened first); or: 1: Store B; 1: Store A; 0:Load A; 0:Load B (just assume CPU #1’s instruction all happened first); or perfectly interleaved: 0: Load A; 1: Store B; 0: Load B; 1: Store A.

What’s interesting is to describe an impossible Strongly-Ordered case: 0: Load B; 1: Store A; 0: Load A; 1: Store B. CPU #1’s instructions must be ordered Store B then Store A, so a global order of CPU #1’s stores cannot be Store A then Store B.

But I’ve just given the results our DDR controller and current pipeline would give. CPU #0 sees an old value for its Load B since it hits in the cache. But CPU #0’s Load A does get the new value of A from CPU #1, since the Store A executed first on CPU #1 (it hit in the cache). If CPU #1 was writing to B to write some info; and then wrote to A to say, “Info is valid”, and if CPU #0 was reading A to see if “Info is valid”, and reading B to get the info, then we’ve just seriously broken the software ordering. Because CPU #0 thinks “Info is valid”, but it read an old stale value.

This is exactly what Weakly Ordered systems do. They do not give a guarantee of global ordering semantics UNLESS you put in a barrier instruction to indicate that you care about ordering of instructions. Let’s call this MBAR, for Memory BARrier.

In this case, both CPU #0 and CPU #1 need MBAR. CPU #0’s sequence would be Load A; MBAR; Load B, and CPU #1’s sequence would be Store B; MBAR; Store A. What MBAR does is stop the execution of later instructions until all previous instructions are committed to be ordered as completed. This fixes this case.

But what does MBAR do? It basically stops hit-under-miss. So, if you care about ordering, you have to sprinkle in MBARs to stop hit-under-miss. But if you don’t care about ordering, you don’t need MBAR and you can take advantage of hit-under-miss.

You’re probably thinking the compiler inserts these MBARs when necessary and programmers can just ignore this stuff. Unfortunately, that’s generally not true.
EC Container 6