How Is a Memory Barrier in Linux Kernel Is Used

how is a memory barrier in linux kernel is used

The key missing point is the mistaken assumption that for the sequence:

LOAD C (gets &B)
LOAD *C (reads B)

the first load has to precede the second load. A weakly ordered architectures can act "as if" the following happened:

LOAD B (reads B)  
LOAD C (reads &B)
if( C!=&B ) 
    LOAD *C
else
    Congratulate self on having already loaded *C

The speculative "LOAD B" can happen, for example, because B was on the same cache line as some other variable of earlier interest or hardware prefetching grabbed it.

Why do we need both read and write barriers?

The real answer is: because x86's memory model is already strong enough that blocking compile-time reordering is sufficient for load or store ordering; runtime reordering is already blocked by hardware.

Those are just generic compile-time barriers made through a piece of inline assembly that, if used, prevents GCC from reordering instructions. It's explained pretty well in this other post. What can be achieved using this "trick" is usually also possible using the C volatile qualifier.

Note that the Linux kernel does not use those specific macros anywhere in the code, those are just two macros defined for io_uring userspace test tools. It definitely uses asm volatile ("" ::: "memory") where needed, but under different names (e.g. smp_rmb(), smp_wmb()).

x86's memory model makes sfence and lfence entirely useless for communication between CPUs; blocking compile-time reordering is sufficient: see Does the Intel Memory Model make SFENCE and LFENCE redundant?

smp_mb() is a full barrier and does need an actual asm instruction, as well as blocking compile-time reordering.

x86 does have some memory barrier asm instructions for read-only and write-only "real" (runtime) memory barriers. Those are sfence (store fence), lfence (load fence) and mfence (memory fence = full barrier).

mfence serializes both read and writes (full barrier) while the others only serialize one of the two (reads OR writes a.k.a loads OR stores). The wikipedia page on memory ordering does a decent job of explaining the meaning of those. lfence actually blocks LoadStore reordering, not just LoadLoad, for weakly-ordered movntdqa loads from WC memory. Reordering of other kinds of loads from other memory types are already disallowed so there's almost never any reason to actually use lfence for memory ordering, instead of its other effect of blocking out-of-order exec.

The kernel uses those actual asm instructions for memory barriers in I/O code, for example mb(), rmb() and wmb() which expand exactly to mfence, lfence, sfence, and others (example).

sfence and lfence are probably overkill in most cases, for example around MMIO to strongly-ordered UC memory. Writing to WC memory could actually need an sfence. But they're not too slow compared to I/O, and there might be some cases that would be a problem otherwise, so Linux takes the safe approach.

In addition to this, x86 has different kind of read/write barriers which may also be faster (such as the one I linked above). See the following answers for more about full barriers (what C11 calls sequential consistency) with either mfence or a dummy locked instruction:

Which is a better write barrier on x86: lock+addl or xchgl?
Does lock xchg have the same behavior as mfence?
What does the "lock" instruction mean in x86 assembly?
How does Linux kernel flush_write_buffers() work on x86?

Why we do not use barriers in User space

Short answer

Memory barriers are used less frequently in user mode code than kernel mode code because user mode code tends to use higher level abstractions (for example pthread synchronization operations).

Additional details

There are two things to consider when analyzing the possible ordering of operations:

What order the thread that is executing the code will see the operations in
What order other threads will see the operations in

In your example the compiler cannot reorder b=20 to occur after c=add(a,b) because the c=add(a,b) operation uses the results of b=20. However, it may be possible for the compiler to reorder these operations so that other threads see the memory location associated with c change before the memory location associated with b changes.

Whether this would actually happen or not depends on the memory consistency model that is implemented by the hardware.

As for when the compiler might do reordering you could imagine adding another variable as follows:

b = 0;
main(){

a = 10;
b = 20;
d = 30;
c = add(a,b);

}

In this case the compiler would be free to move the d=30 assignment to occur after c=add(a,b).

However, this entire example is too simplistic. The program doesn't do anything and the compiler can eliminate all the operations and does not need to write anything to memory.

Addendum: Memory reordering example

In a multiprocessor environment multiple threads can see memory operations occur in different orders. The Intel Software Developer's Manual has some examples in Volume 3 section 8.2.3. I've copied a screenshot below that shows an example where loads and stores can be reordered.
There is also a good blog post that provides some more detail about this example.

Loads reordered with earlier store to different locations

What is a memory fence?

For performance gains modern CPUs often execute instructions out of order to make maximum use of the available silicon (including memory read/writes). Because the hardware enforces instructions integrity you never notice this in a single thread of execution. However for multiple threads or environments with volatile memory (memory mapped I/O for example) this can lead to unpredictable behavior.

A memory fence/barrier is a class of instructions that mean memory read/writes occur in the order you expect. For example a 'full fence' means all read/writes before the fence are comitted before those after the fence.

Note memory fences are a hardware concept. In higher level languages we are used to dealing with mutexes and semaphores - these may well be implemented using memory fences at the low level and explicit use of memory barriers are not necessary. Use of memory barriers requires a careful study of the hardware architecture and more commonly found in device drivers than application code.

The CPU reordering is different from compiler optimisations - although the artefacts can be similar. You need to take separate measures to stop the compiler reordering your instructions if that may cause undesirable behaviour (e.g. use of the volatile keyword in C).

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I'll give a summary here. Still, good question, it's useful to collect this all in one place.

On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the "Memory Order Buffer".

Weakly-ordered ISAs don't have to speculate, they can just load in any order.

x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it's physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).

Avoiding LoadStore is basically free: a load can't retire until it's executed (taken data from the cache). A store can't commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is "graduated" and ready for commit.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads as well as tracking them in the ROB: let them retire once they're known to be non-faulting but, even if the data hasn't arrived.

This seems more likely on an in-order core, but IDK. So you could have a load that's retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That's why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.

Barrier instructions

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

C++ How is release-and-acquire achieved on x86 only using MOV?
How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
How many memory barriers instructions does an x86 CPU have?
How can I experience "LFENCE or SFENCE can not pass earlier read/write"
Does lock xchg have the same behavior as mfence?
Does the Intel Memory Model make SFENCE and LFENCE redundant?
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven't used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).

Why is there barrier() in KCOV code in Linux kernel?

Without barrier(), the compiler would be free to access t->kcov_area before t->kcov_mode. It's unlikely to want to do that in practice, but that's not the point. Without some kind of barrier, C rules allow the compiler to create asm that doesn't do what we want. (The C11 memory model has no ordering guarantees beyond what you impose explicitly; in C11 via stdatomic or in Linux / GNU C via barriers like barrier() or smp_rb().)

As described in the comment, barrier() is creating an acquire-load wrt. code running on the same core, which is all you need for interrupts.

    mode = READ_ONCE(t->kcov_mode);
    if (mode == KCOV_MODE_TRACE) {
        ...
        barrier();
        area = t->kcov_area;
        ...

I'm not familiar with kcov in general, but it looks like seeing a certain value in t->kcov_mode with an acquire load makes it safe to read t->kcov_area. (Because whatever code writes that object writes kcov_area first, then does a release-store to kcov_mode.)

https://preshing.com/20120913/acquire-and-release-semantics/ explains acq / rel synchronization in general.

Why isn't smp_rb() required? (Even on weakly-ordered ISAs where acquire ordering would need a fence instruction to guarantee seeing other stores done by another core.)

An interrupt handler runs on the same core that was doing the other operations, just like a signal handler interrupts a thread and runs in its context. struct task_struct *t = current means that the data we're looking at is local to a single task. This is equivalent to something within a single thread in user-space. (Kernel pre-emption leading to re-scheduling on a different core will use whatever memory barriers are necessary to preserve correct execution of a single thread when that other core accesses the memory this task had been using).

The user-space C11 stdatomic equivalent of this barrier is atomic_signal_fence(memory_order_acquire). Signal fences only have to block compile-time reordering (like Linux barrier()), unlike atomic_thread_fence that has to emit a memory barrier asm instruction.

Out-of-order CPUs do reorder things internally, but the cardinal rule of OoO exec is to preserve the illusion of instructions running one at a time, in order for the core running the instructions. This is why you don't need a memory barrier for the asm equivalent of a = 1; b = a; to correctly load the 1 you just stored; hardware preserves the illusion of serial execution¹ in program order. (Typically via having loads snoop the store buffer and store-forward from stores to loads for stores that haven't committed to L1d cache yet.)

Instructions in an interrupt handler logically run after the point where the interrupt happened (as per the interrupt-return address). Therefore we just need the asm instructions in the right order (barrier()), and hardware will make everything work.

Footnote 1: There are some explicitly-parallel ISAs like IA-64 and the Mill, but they provide rules that asm can follow to be sure that one instruction sees the effect of another earlier one. Same for classic MIPS I load delay slots and stuff like that. Compilers take care of this for compiled C.

How Is a Memory Barrier in Linux Kernel Is Used