Can Modern X86 Hardware Not Store a Single Byte to Memory

Can modern x86 hardware not store a single byte to memory?

TL:DR: On every modern ISA that has byte-store instructions (including x86), they're atomic and don't disturb surrounding bytes. (I'm not aware of any older ISAs where byte-store instructions could "invent writes" to neighbouring bytes either.)

The actual implementation mechanism (in non-x86 CPUs) is sometimes an internal RMW cycle to modify a whole word in a cache line, but that's done "invisibly" inside a core while it has exclusive ownership of the cache line so it's only ever a performance problem, not correctness. (And merging in the store buffer can sometimes turn byte-store instructions into an efficient full-word commit to L1d cache.)

About Stroustrup's phrasing

I don't think it's a very accurate, clear or useful statement. It would be more accurate to say that modern CPUs can't load or store anything smaller than a cache line. (Although that's not true for uncacheable memory regions, e.g. for MMIO.)

It probably would have been better just to make a hypothetical example to talk about memory models, rather than implying that real hardware is like this. But if we try, we can maybe find an interpretation that isn't as obviously or totally wrong, which might have been what Stroustrup was thinking when he wrote this to introduce the topic of memory models. (Sorry this answer is so long; I ended up writing a lot while guessing what he might have meant and about related topics...)

Or maybe this is another case of high-level language designers not being hardware experts, or at least occasionally making mis-statements.

I think Stroustrup is talking about how CPUs work internally to implement byte-store instructions. He's suggesting that a CPU without a well-defined and reasonable memory model might implement a byte-store with a non-atomic RMW of the containing word in a cache line, or in memory for a CPU without cache.

Even this weaker claim about internal (not externally visible) behaviour is not true for high-performance x86 CPUs. Modern Intel CPUs have no throughput penalty for byte stores, or even unaligned word or vector stores that don't cross a cache-line boundary. AMD is similar.

If byte or unaligned stores had to do a RMW cycle as the store committed to L1D cache, it would interfere with store and/or load instruction/uop throughput in a way we could measure with performance counters. (In a carefully designed experiment that avoids the possibility of store coalescing in the store buffer before commit to L1d cache hiding the cost, because the store execution unit(s) can only run 1 store per clock on current CPUs.)

However, some high performance designs for non-x86 ISAs do use an atomic RMW cycle to internally commit stores to L1d cache. Are there any modern CPUs where a cached byte store is actually slower than a word store? The cache line stays in MESI Exclusive/Modified state the whole time, so it can't introduce any correctness problems, only a small performance hit. This is very different from doing something that could step on stores from other CPUs. (The arguments below about that not happening still apply, but my update may have missed some stuff that still argues that atomic cache-RMW is unlikely.)

(On many non-x86 ISAs, unaligned stores are not supported at all, or are used more rarely than in x86 software. And weakly-ordered ISAs allow more coalescing in store buffers, so not as many byte store instructions actually result in single-byte commit to L1d. Without these motivations for fancy (power hungry) cache-access hardware, word RMW for scattered byte stores is an acceptable tradeoff in some designs.)

Alpha AXP, a high-performance RISC design from 1992, famously (and uniquely among modern non-DSP ISAs) omitted byte load/store instructions until Alpha 21164A (EV56) in 1996. Apparently they didn't consider word-RMW a viable option for implementing byte stores, because one of the cited advantages for implementing only 32-bit and 64-bit aligned stores was more efficient ECC for the L1D cache. "Traditional SECDED ECC would require 7 extra bits over 32-bit granules (22% overhead) versus 4 extra bits over 8-bit granules (50% overhead)." (@Paul A. Clayton's answer about word vs. byte addressing has some other interesting computer-architecture stuff.) If byte stores were implemented with word-RMW, you could still do error detection/correction with word-granularity.

Current Intel CPUs only use parity (not ECC) in L1D for this reason. (At least some older Xeons could run with L1d in ECC mode at half capacity instead of the normal 32KiB, as discussed on RWT. It's not clear if anything's changed, e.g. in terms of Intel now using ECC for L1d). See also this Q&A about hardware (not) eliminating "silent stores": checking the old contents of cache before the write to avoid marking the line dirty if it matched would require a RMW instead of just a store, and that's a major obstacle.

It turns out some high-perf pipelined designs do use atomic word-RMW to commit to L1d, despite it stalling the memory pipeline, but (as I argue below) it's much less likely that any do an externally-visible RMW to RAM.

Word-RMW isn't a useful option for MMIO byte stores either, so unless you have an architecture that doesn't need sub-word stores for IO, you'd need some kind of special handling for IO (like Alpha's sparse I/O space where word load/stores were mapped to byte load/stores so it could use commodity PCI cards instead of needing special hardware with no byte IO registers).

As @Margaret points out, DDR3 memory controllers can do byte stores by setting control signals that mask out other bytes of a burst. The same mechanisms that get this information to the memory controller (for uncached stores) could also get that information passed along with a load or store to MMIO space. So there are hardware mechanisms for really doing
a byte store even on burst-oriented memory systems, and it's highly likely that modern CPUs will use that instead of implementing an RMW, because it's probably simpler and is much better for MMIO correctness.

How many and what size cycles will be needed to perform longword transferred to the CPU shows how a ColdFire microcontroller signals the transfer size (byte/word/longword/16-byte line) with external signal lines, letting it do byte loads/stores even if 32-bit-wide memory was hooked up to its 32-bit data bus. Something like this is presumably typical for most memory bus setups (but I don't know). The ColdFire example is complicated by also being configurable to use 16 or 8-bit memory, taking extra cycles for wider transfers. But nevermind that, the important point is that it has external signaling for the transfer size, to tell the memory HW which byte it's actually writing.

Stroustrup's next paragraph is

"The C++ memory model guarantees that two threads of execution can update and access separate memory locations without interfering with each other. This is exactly what we would naively expect. It is the compiler’s job to protect us from the sometimes very strange and subtle behaviors of modern hardware. How a compiler and hardware combination achieves that is up to the compiler. ..."

So apparently he thinks that real modern hardware may not provide "safe" byte load/store. The people who design hardware memory models agree with the C/C++ people, and realize that byte store instructions would not be very useful to programmers / compilers if they could step on neighbouring bytes.

All modern (non-DSP) architectures except early Alpha AXP have byte store and load instructions, and AFAIK these are all architecturally defined to not affect neighbouring bytes. However they accomplish that in hardware, software doesn't need to care about correctness. Even the very first version of MIPS (in 1983) had byte and half-word loads/stores, and it's a very word-oriented ISA.

However, he doesn't actually claim that most modern hardware needs any special compiler support to implement this part of the C++ memory model, just that some might. Maybe he really is only talking about word-addressable DSPs in that 2nd paragraph (where C and C++ implementations often use 16 or 32-bit char as exactly the kind of compiler workaround Stroustrup was talking about.)

Most "modern" CPUs (including all x86) have an L1D cache. They will fetch whole cache lines (typically 64 bytes) and track dirty / not-dirty on a per-cache-line basis. So two adjacent bytes are pretty much exactly the same as two adjacent words, if they're both in the same cache line. Writing one byte or word will result in a fetch of the whole line, and eventually a write-back of the whole line. See Ulrich Drepper's What Every Programmer Should Know About Memory. You're correct that MESI (or a derivative like MESIF/MOESI) makes sure this isn't a problem. (But again, this is because hardware implements a sane memory model.)

A store can only commit to L1D cache while the line is in the Modified state (of MESI). So even if the internal hardware implementation is slow for bytes and takes extra time to merge the byte into the containing word in the cache line, it's effectively an atomic read modify write as long as it doesn't allow the line to be invalidated and re-acquired between the read and the write. (While this cache has the line in Modified state, no other cache can have a valid copy). See @old_timer's comment making the same point (but also for RMW in a memory controller).

This is easier than e.g. an atomic xchg or add from a register that also needs an ALU and register access, since all the HW involved is in the same pipeline stage, which can simply stall for an extra cycle or two. That's obviously bad for performance and takes extra hardware to allow that pipeline stage to signal that it's stalling. This doesn't necessarily conflict with Stroustrup's first claim, because he was talking about a hypothetical ISA without a memory model, but it's still a stretch.

On a single-core microcontroller, internal word-RMW for cached byte stores would be more plausible, since there won't be Invalidate requests coming in from other cores that they'd have to delay responding to during an atomic RMW cache-word update. But that doesn't help for I/O to uncacheable regions. I say microcontroller because other single-core CPU designs typically support some kind of multi-socket SMP.

Many RISC ISAs don't support unaligned-word loads/stores with a single instruction, but that's a separate issue (the difficulty is handling the case when a load spans two cache lines or even pages, which can't happen with bytes or aligned half-words). More and more ISAs are adding guaranteed support for unaligned load/store in recent versions, though. (e.g. MIPS32/64 Release 6 in 2014, and I think AArch64 and recent 32-bit ARM).

The 4th edition of Stroustrup's book was published in 2013 when Alpha had been dead for years. The first edition was published in 1985, when RISC was the new big idea (e.g. Stanford MIPS in 1983, according to Wikipedia's timeline of computing HW, but "modern" CPUs at that time were byte-addressable with byte stores. Cyber CDC 6600 was word-addressable and probably still around, but couldn't be called modern.

Even very word-oriented RISC machines like MIPS and SPARC have byte store and byte load (with sign or zero extension) instructions. They don't support unaligned word loads, simplifying the cache (or memory access if there is no cache) and load ports, but you can load any single byte with one instruction, and more importantly store a byte without any architecturally-visible non-atomic rewrite of the surrounding bytes. (Although cached stores can

I suppose C++11 (which introduces a thread-aware memory model to the language) on Alpha would need to use 32-bit char if targeting a version of the Alpha ISA without byte stores. Or it would have to use software atomic-RMW with LL/SC when it couldn't prove that no other threads could have a pointer that would let them write neighbouring bytes.

IDK how slow byte load/store instructions are in any CPUs where they're implemented in hardware but not as cheap as word loads/stores. Byte loads are cheap on x86 as long as you use movzx/movsx to avoid partial-register false dependencies or merging stalls. On AMD pre-Ryzen, movsx/movzx needs an extra ALU uop, but otherwise zero/sign extension is handled right in the load port on Intel and AMD CPUs.) The main x86 downside is that you need a separate load instruction instead of using a memory operand as a source for an ALU instruction (if you're adding a zero-extended byte to a 32-bit integer), saving front-end uop throughput bandwidth and code-size. Or if you're just adding a byte to a byte register, there's basically no downside on x86. RISC load-store ISAs always need separate load and store instructions anyway. x86 byte stores are no more expensive that 32-bit stores.

As a performance issue, a good C++ implementation for hardware with slow byte stores might put each char in its own word and use word loads/stores whenever possible (e.g. for globals outside structs, and for locals on the stack). IDK if any real implementations of MIPS / ARM / whatever have slow byte load/store, but if so maybe gcc has -mtune= options to control it.

That doesn't help for char[], or dereferencing a char * when you don't know where it might be pointing. (This includes volatile char* which you'd use for MMIO.) So having the compiler+linker put char variables in separate words isn't a complete solution, just a performance hack if true byte stores are slow.

PS: More about Alpha:

Alpha is interesting for a lot of reasons: one of the few clean-slate 64-bit ISAs, not an extension to an existing 32-bit ISA. And one of the more recent clean-slate ISAs, Itanium being another from several years later which attempted some neat CPU-architecture ideas.

From the Linux Alpha HOWTO.

When the Alpha architecture was introduced, it was unique amongst RISC architectures for eschewing 8-bit and 16-bit loads and stores. It supported 32-bit and 64-bit loads and stores (longword and quadword, in Digital's nomenclature). The co-architects (Dick Sites, Rich Witek) justified this decision by citing the advantages:

Byte support in the cache and memory sub-system tends to slow down accesses for 32-bit and 64-bit quantities.
Byte support makes it hard to build high-speed error-correction circuitry into the cache/memory sub-system.

Alpha compensates by providing powerful instructions for manipulating bytes and byte groups within 64-bit registers. Standard benchmarks for string operations (e.g., some of the Byte benchmarks) show that Alpha performs very well on byte manipulation.

How does a 64-bit computer change one byte in memory?

On x86-64, the hardware will read one cache line, modify the byte in cache, and eventually that cache line will be written back to memory.

The main reason for the write-back to happen is that the CPU needs the cache line for other data. There are explicit instructions to force the write-back, but a C compiler would be unlikely to use those. It slows down the CPU to force an unnecessary write.

Are there any modern CPUs where a cached byte store is actually slower than a word store?

My guess was wrong. Modern x86 microarchitectures really are different in this way from some (most?) other ISAs.

There can be a penalty for cached narrow stores even on high-performance non-x86 CPUs. The reduction in cache footprint can still make int8_t arrays worth using, though. (And on some ISAs like MIPS, not needing to scale an index for an addressing mode helps).

Merging / coalescing in the store buffer between byte stores instructions to the same word before actual commit to L1d can also reduce or remove the penalty. (x86 sometimes can't do as much of this because its strong memory model requires all stores to commit in program order.)

ARM's documentation for Cortex-A15 MPCore (from ~2012) says it uses 32-bit ECC granularity in L1d, and does in fact do a word-RMW for narrow stores to update the data.

The L1 data cache supports optional single bit correct and double bit detect error correction logic in both the tag and data arrays. The ECC granularity for the tag array is the tag for a single cache line and the ECC granularity for the data array is a 32-bit word.

Because of the ECC granularity in the data array, a write to the array cannot update a portion of a 4-byte aligned memory location because there is not enough information to calculate the new ECC value. This is the case for any store instruction that does not write one or more aligned 4-byte regions of memory. In this case, the L1 data memory system reads the existing data in the cache, merges in the modified bytes, and calculates the ECC from the merged value. The L1 memory system attempts to merge multiple stores together to meet the aligned 4-byte ECC granularity and to avoid the read-modify-write requirement.

(When they say "the L1 memory system", I think they mean the store buffer, if you have contiguous byte stores that haven't yet committed to L1d.)

Note that the RMW is atomic, and only involves the exclusively-owned cache line being modified. This is an implementation detail that doesn't affect the memory model. So my conclusion on Can modern x86 hardware not store a single byte to memory? is still (probably) correct that x86 can, and so can every other ISA that provides byte store instructions.

Cortex-A15 MPCore is a 3-way out-of-order execution CPU, so it's not a minimal power / simple ARM design, yet they chose to spend transistors on OoO exec but not efficient byte stores.

Presumably without the need to support efficient unaligned stores (which x86 software is more likely to assume / take advantage of), having slower byte stores was deemed worth it for the higher reliability of ECC for L1d without excessive overhead.

Cortex-A15 is probably not the only, and not the most recent, ARM core to work this way.

Other examples (found by @HadiBrais in comments):

Alpha 21264 (see Table 8-1 of Chapter 8 of this doc) has 8-byte ECC granularity for its L1d cache. Narrower stores (including 32-bit) result in a RMW when they commit to L1d, if they aren't merged in the store buffer first. The doc explains full details of what L1d can do per clock. And specifically documents that the store buffer does coalesce stores.
PowerPC RS64-II and RS64-III (see the section on errors in this doc). According to this abstract, L1 of the RS/6000 processor has 7 bits of ECC for each 32-bits of data.

Alpha was aggressively 64-bit from the ground up, so 8-byte granularity makes some sense, especially if the RMW cost can mostly be hidden / absorbed by the store buffer. (e.g. maybe the normal bottlenecks were elsewhere for most code on that CPU; its multi-ported cache could normally handle 2 operations per clock.)

POWER / PowerPC64 grew out of 32-bit PowerPC and probably cares about running 32-bit code with 32-bit integers and pointers. (So more likely to do non-contiguous 32-bit stores to data structures that couldn't be coalesced.) So 32-bit ECC granularity makes a lot of sense there.

Is it worse in any aspect to use the CMPXCHG instruction on an 8-bit field than on a 32-bit field?

No, there's no penalty for lock cmpxchg [mem], reg 8 vs. 32-bit. Modern x86 CPUs can load and store to their L1d cache with no penalty for a single byte vs. an aligned dword or qword. Can modern x86 hardware not store a single byte to memory? answer: it can with zero penalty¹ because they spend the transistors to make even unaligned loads/stores fast.

The surrounding asm instructions dealing with a narrow integer in a register should also have negligible if any extra cost vs. [u]int32_t. See Why doesn't GCC use partial registers? - most compilers know how to be careful with partial registers, and modern CPUs (Haswell and later, and all non-Intel) don't rename the low 8 separately from the rest of the register so the only danger is false dependencies. Depending on exactly what you're doing, it might be best to use unsigned local temporaries with an _Atomic uint8_t, or it might be best to make your locals also uint8_t.

Footnote 1: Unlike on some non-x86 CPUs where a byte store actually is implemented with a cache RMW cycle (Are there any modern CPUs where a cached byte store is actually slower than a word store?). On those CPUs you'd hope that atomic xchg would be just as cheap for word vs. byte, but that's too much to hope for with cmpxchg. But almost all non-x86 ISAs have LL/SC instead of xchg / cmpxchg anyway, so even an atomic exchange is separate LL and SC instructions, and the SC would be take an RMW cycle to commit to cache.

Confused about data alignment

Word aligned memory access is much faster than byte aligned one. That makes it much faster to transfer large blocks of data. You can address a single byte, but likely a word will be read from memory and internally reduced to a byte. That makes the access slower.

Why can't we move directly 1 byte from stack's frame to register?

TL:DR: You can, GCC just chooses not, saving 1 byte of code-size vs. a normal movzbl byte load and avoiding any partial-register penalties from a movb load+merge. But for obscure reasons, this won't cause a store-forwarding stall when loading a function arg.

(This code is exactly what we get from GCC4.8 and later with gcc -O1 with those C statements and integer types of those widths. See it and clang on the Godbolt compiler explorer GCC -O3 schedules the movl one instruction earlier.)

There's no correctness reason for doing it this way, only possible performance. You're correct that a byte load would work just as well. (I've omitted redundant operand-size suffixes because they're implied by the register operands).

    mov     8(%rsp), %dl        # byte load, merging into RDX
    add     %dl, (%rax)

What you're likely to get from a C compiler is a byte load with zero-extension. (e.g. GCC4.7 and earlier does this)

    movzbl  8(%rsp), %edx       # byte load zero-extended into RDX
    add     %dl,  (%rax)

movzbl (aka MOVZX in Intel syntax) is your go-to instruction for loading bytes / words, not movb or movw. It's always safe, and on modern CPUs MOVZX loads are literally as fast as dword mov loads, no extra latency or extra uops; handled right in the load execution unit. (Intel since Core 2 or earlier, AMD since at least Ryzen. https://agner.org/optimize/).
The only cost being 1 extra byte of code size (larger opcode). movsbl or movsbq (aka MOVSX) sign-extension are equally efficient on more recent CPUs, but on some AMD (Like some Bulldozer-family) they're 1 cycle higher latency than MOVZX loads. So prefer MOVZX if all you care about is avoiding partial-register shenanigans when loading a byte.

Usually only use movb or movw (with register destinations) if you specifically want to merge into the low byte or word of the existing 64-bit register. Byte / word stores are perfectly fine on x86, I'm only talking about mov mem-to-reg or reg-to-reg. There are exceptions to this rule; sometimes you can safely use byte operand size without problems if you're careful and understand that microarchitecture(s) you care about your code running efficiently on. And beware that intentionally merging by writing a byte reg then reading a larger reg can cause partial-register merging stalls on some CPUs.

Writing to %dl would have a false dependency on the instructions (in your caller) that wrote EDX on some CPUs, including current Intel and all AMD. (Why doesn't GCC use partial registers?). Clang and ICC don't care and do it anyway, implementing the function the way you expected.

movl writes the full 64-bit register (by implicit zero-extension when writing a 32-bit register) avoiding that problem.

But reading a dword from 8(%rsp) could introduce a store-forwarding stall, if the caller only used a byte store. If the caller wrote that memory with a push, you're fine. But if the caller only used movb $123, (%rsp) before the call into already-reserved stack space, now your function is reading a dword from a location where the last store was a byte. Unless there was some kind of other stall (e.g. in code fetch after calling your function), the byte is probably in the store buffer when the load uop executes, but the load needs that plus 3 bytes from cache. Or from some earlier store that's also still in the store buffer, so it also has to scan the store buffer for all potential matches before merging the byte from the store buffer with the other bytes from cache. The fast path for store-forwarding only works when all the data you're loading comes from exactly one store. (Can modern x86 implementations store-forward from more than one prior store?)

But wait, an unwritten "extension" of the x86-64 System V calling convention means no risk of store-forwarding stalls

clang/gcc sign- or zero-extend narrow args to 32-bit, even though the System V ABI as written doesn't (yet?) require it. Clang-generated code also depends on it. This apparently includes args passed in memory, as we can see from looking at the caller on Godbolt. (I used __attribute__((noinline)) so I could compile with optimization enabled but still not have the call inline and optimize away. Otherwise I could have just commented out the body and looked at a caller that could only see a prototype.

This is not part of C's "default argument promotions" for calling unprototyped functions. The C type of the narrow args are still short or char. This is only a calling-convention feature that lets the callee make assumptions about bits in registers (or memory) outside of the object-representation of the C object. It would be more useful if the upper 32 bits were required to be zero, though, because you still can't use them as array indices for 64-bit addressing modes. But you can do int_arg += char_arg without a MOVSX first. So it can make code more efficient when you use narrow args and they get implicitly promoted to int by C rules for binary operators like +.

By compiling the caller with gcc -O3 -maccumulate-outgoing-args (or -O0 or -O1), I got GCC to reserve stack space with sub and then use movl $4, (%rsp) before call proc a function that calls yours. It would have been more efficient (smaller code-size) for gcc to use movb, but it chose to use a movl with a 32-bit immediate. I think this is because it's implementing that unwritten rule in the calling convention, rather than some other reason.

More usually (without -maccumulate-outgoing-args) the caller will use push $4 or push %rdi to do a qword store before the load, which can also store-forward efficiently to a dword (or byte) load. So either way, the arg will have been written with at least a dword store, making a dword reload safe for performance.

A dword mov load has 1 byte smaller code-size than a movzbl load, and avoids the possible extra cost of a MOVSX or MOVZX (on old AMD CPUs and extremely old Intel CPUs (P5)). So I think it's optimal.

GCC4.7 and earlier do use a movzbl (MOVZX) load for the char a4 arg like I recommended as the generally-safe option, but GCC4.8 and later use a movl.

Is it easier to fetch a 4 byte word from a word addressable memory compared to byte addressable?

if i'm not mistaking CPU will work with words right?

It depends on the Instruction Set Architecture (ISA) implemented by the CPU. For example, x86 supports operands of sizes ranging from a single 8-bit byte to as much as 64 bytes (in the most recent CPUs). Although the word size in modern x86 CPUs is 8 or 4 bytes only. The word size is generally defined as equal to the size of a general-purpose register. However, the granularity of accessing memory or registers is not necessarily restricted to the word size. This is very convenient from a programmer's perspective and from the CPU implementation perspective as I'll discuss next.

so when the cpu tries to get a word from the memory what's the
difference between getting a 4 byte word from a byte addressable
memory vs getting a word from word addressable memory?

While an ISA may support byte addressability, a CPU that implements the ISA may not necessarily fetch data from memory one byte at a time. Spatial locality of reference is a memory access pattern very common in most real programs. If the CPU was to issue single-byte requests along the memory hierarchy, it would unnecessarily consume a lot of energy and significantly hurt performance to handle single-byte requests and move one-byte data across the hierarchy. Therefore, typically, when the CPU issues a memory request for data of some size at some address, a whole block of memory (known as a cache line, which is usually 64-byte in size and 64-byte aligned) is brought to the L1 cache. All requests to the same cache line can be effectively combined into a single request. Therefore, the address bus between different levels of the memory hierarchy does not have to include wires for the bits that constitute an offset within the cache line. In that case, the implementation would be really addressing memory at the 64-byte granularity.

It can be useful, however, to support byte addressability in the implementation . For example, if only one byte of a cache line has changed and the cache line has to be written back to main memory, instead of sending all the 64 bytes to memory, it would take less energy, bandwidth, and time to send only the byte that changed (or few bytes). Another situation where byte addressability is useful is when providing support for the idea of critical-word first. This is much more to it, but to keep the answer simple, I'll stop here.

DDR SDRAM is a prevalent class of main memory interfaces used in most computer systems today. The data bus width is 8 bytes in size and the protocol supports only transferring aligned 8-byte chunks with byte enable signals (called data masks) to select which bytes to write.

Can Modern X86 Hardware Not Store a Single Byte to Memory