Weak Ordering is a bad idea

03/07/20 22:31

Sorry for the 6 year hiatus. Oh boy.

I'm going to jump to the conclusion I wanted to give 8 years ago: Weakly ordered systems are bad. They make writing high-performance multithreaded software much more difficult. And hardware doesn't gain much performance (or simplification) by being weakly ordered.

The first problem with multithreaded code is it's hard to write. C is notoriously difficult to get the semantics right, but most languages have more problems than they would like to admit.

Even if the user writes up an algorithm which is correct, and codes it in C correctly, then they have one more problem to solve: handling weak ordering. This can actually be very difficult to handle with "lockless" algorithms--the barriers have to be placed in many places, and it's hard to get this right.

Most systems fit into one of three categories: Strongly Ordered; Total-Store-Ordered (TSO, a name Sun gave to their ordering, and it's a good name, and applies to many more systems); and Weakly Ordered.

Basically, Strongly Ordered and Total-Store-Ordered mean (almost) all algorithms, including lockless ones, just work without needing any barrier instructions. But Weakly Ordered needs barriers.

To define the ordering briefly: Strongly Ordered CPU cores hold each load and store which misses the cache until it receives ownership for that line, then the load or store completes. Later instructions do not appear to have completed. So a Store to A which misses the cache, followed by a Store to B which hits, stalls the store to B until the A cacheline returns (or is otherwise completed first). All other CPU agents see the stores done by each CPU in the order they are done. A load to A which misses the cache, followed by a load to B which hits, stalls the load to B until the load to A gets its correct data. If B gets snooped out in the meantime, then the CPU will then refetch the B data to get the correct new version.

Total-Store-Ordered means the loads still stall as I indicated above for Strongly Ordered, but stores are handled slightly differently. All stores get placed into a FIFO, and drain to the cache in the order they were placed in the FIFO. So a Store to A doesn't even have to check the cache--it just goes right to the FIFO and the instruction retires. A following store to B goes right to the FIFO, and that instruction retires. Later loads from this CPU check the entire store buffer, and can hit any older store (even if the cacheline is not present). This maintains the illusion that this one thread of instructions executed in order. But as long as the stores drain from the store buffer in FIFO order, this maintains almost all of the properties of Strongly Ordered. Basic Producer-Consumer coding models just work with no barriers, and almost any locking algorithm works with no barriers (although some will not). The only issues arise due to the fact that stores which are not yet visible to other CPUs are visible to later loads on this same CPU can cause some really tricky algorithms to break. However, these cases are generally easy to avoid.

x86 is TSO (they don't call it that, however). Any architecture which has weaker semantics than x86 has a software problem--x86 algorithms don't just translate over, barriers have to be inserted. This is hard work. This is also very difficult work.

Weakly ordered means loads and stores can execute and complete out of order in appearance to other CPUs, unless barriers are used.

I look at it this way: the hardware, goes to great lengths to make it appear that single-process code runs in linear order, with lots of hardware applied to the problem of making it go fast but still appear to be done in instruction order. The fact that this isn't done for multithreaded code shows a lack of concern for that type of code.

But the real problem is Weakly Ordered systems don't gain much. As I was trying to get at with my low-level pipeline discussions, a simple model can divide CPU complexity into 3 tiers: Basic in-order pipeline; Complex multi-issue sort-of-inorder pipeline; Great-Big-Out-of-Order (GBOoO) pipelines. All "high-end" machines (including cell phones) are GBOoO.

For Basic In-Order, the pipeline is so simple it's just Strongly Ordered. No barriers needed.

For GBOoO, there's so much hardware working so hard to make the single thread code appear to be done in order that implementing full Strongly Ordered or TSO costs basically nothing, either in performance or in area/power. It takes some additional care to get TSO/Strongly Ordered right, but the hardware resources are already there for other reasons,

The problem is the mid-tier Complex multi-issue cores. As I showed with Hit-Under-Miss, this can give a small performance boost (2-5%) at the cost of breaking Strongly Ordered or TSO ordering rules. And this is why Weakly Ordered architectures persist.

TSO/Strongly Ordered CPUs have hardware to enforce ordering. If there are no snoops breaking up the code sequence, then the ordering generally has no performance penalty at all. If there are snoops which need special handling, then hardware generally resolves ordering about as fast as possible, and then goes back to full speed.

But, here's the kicker--it's almost certain that code with explicit software barriers runs slower than code without the barrier, even if there are no ordering issues to resolve. The cores which need the barriers lack the hardware to properly order instructions (otherwise, it would just be TSO/Strongly Ordered). So the barrier has to do something stronger, like stall the pipeline until earlier operations complete. And it has to do this even when there is no ordering issue to resolve, because it doesn't have the hardware to "see" the ordering issue. So the barriers actually DO something, and so cost performance. If they cost 2%, then Weakly Ordering doesn't even pay off on the mid-Tier CPU design which wanted it in the first place!

It gets worse. Unless the architects of the GBOoO decide to enforce full Strongly Ordered or TSO semantics and treat barriers as No-Ops, then even though they could see ordering issues and resolve them quickly in hardware, chances are the barriers do some types of stalls even on the GBOoO cores. Now this is bad--any barriers which slow down GBOoO costs high-end performance. This is a big mistake.

So, the best answer is TSO (or even a little stronger), and just eat the slight performance loss on the mid-tier range. It will be made up for by easier software porting, and better high-end performance.

And writing software for weakly ordered systems is extremely difficult. I suspect only a very small minority of programmers even fully comprehend the problem, let alone are able to code it correctly.