Movdqu Instruction + Page Boundary

Assembly movdqa access violation

Most of this has been said in the comments already, but let me summarise. There are three problems raised by your code/question:

1) MOVDQA requires the addresses it deals with ([rdx] in your case) to be aligned to a 16-byte boundary and will trigger an access violation otherwise. This is what you are seeing. Alignment to a 16-byte (DQWORD) boundary means that, using your example, you should read from e.g. 0xFFFFFFFFFFFFFFF0 rather than 0xFFFFFFFFFFFFFFFF, because the latter number is not divisible by 16.

2) The address you use, 0xFFFFFFFFFFFFFFFF, is almost certainly invalid.

3) Provided you use MOVDQA to read from a valid 16-byte-aligned memory location, the results (in xmm1 in your case) will be IDENTICAL to when you use MOVDQU. The only relevant difference between the two here is that movdqU allows you to read from Unaligned (hence the U) memory whereas movdqA requires a (16-byte) Aligned memory location. (The latter case will often be faster, but I don't think you need to worry about that at this stage.)

Accessing a variable that crosses a MMU page boundary

It's safe to read anywhere in a page that's known to contain any valid bytes, e.g. in static storage with an unaligned foo: dq 1. If you have that, it's always safe to mov rax, [foo].

Your assembler + linker will make sure that all storage in .data, .rdata, and .bss is actually backed by valid pages the OS will let you touch.


The point your book is making is that you might have an array of 3-byte structs like RGB pixels, for example. x86 doesn't have a 3-byte load, so loading a whole pixel struct with mov eax, [rcx] would actually load 4 bytes, including 1 byte you don't care about.

Normally that's fine, unless [rcx+3] is in an unmapped page. (E.g. the last pixel of a buffer, ending at the end of a page, and the next page is unmapped). Crossing into another cache line you don't need data from is not great for performance, so it's a tradeoff vs. 2 or 3 separate loads like movzx eax, word ptr [rcx] / movzx edx, byte ptr [rcx+2]

This is more common with SIMD, where you can make more use of multiple elements at once in a register after loading them. Like movdqu xmm0, [rcx] to load 16 bytes, including 5 full pixels and 1 byte of another pixel we're not going to deal with in this vector.

(You don't have this problem with planar RGB where all the R components are contiguous. Or in general, AoS vs. SoA = Struct of Arrays being good for SIMD. You also don't have this problem if you unroll your loop by 3 or something, so 3x 16-byte vectors = 48 bytes covering 16x 3-byte pixels, maybe doing some shuffling if necessary or having 3 different vector constants if you need different constants to line up with different components of your struct or pixel or whatever.)

If looping over an array, you have the same problem on the final iteration. If the array is larger than 1 SIMD vector (XMM or YMM), instead of scalar for the last n % 4 elements, you can sometimes arrange to do a SIMD load that ends at the end of the array, so it partially overlaps with a previous full vector. (To reduce branching, leave 1..4 elements of cleanup instead of 0..3, so if n is a multiple of the vector width then the "cleanup" is another full vector.) This works great for something like making a lower-case copy of an ASCII string: it's fine to redo the work on any given byte, and you're not storing in-place so you don't even have store-forwarding stalls since you won't have a load overlapping a previous store. It's less easy for summing an array (where you need to avoid double-counting), or working in-place.


See also Is it safe to read past the end of a buffer within the same page on x86 and x64?

That's a challenge for strlen where you don't know whether the data you're allowed to read extends into the next page or not. (Unless you only read 1 byte at a time, which is 16x slower than you can go with SSE2.)


AVX-512 has masked load/store with fault suppression, so a vmovdqu8 xmm0{k1}{z}, [rcx] with k1=0x7F will effectively be a 15-byte load, not faulting even if the 16th byte (where the mask is zero) extends into an unmapped page. Same for AVX vmaskmovps and so on. But the store version of that is slow on AMD.

See also Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all



Attempting to do so will generate an x86-64 general protection (segmentation) fault

Actually a #PF page fault for an access that touches an unmapped or permission-denied page. But yes, same difference.

page faulting maskmovdqu / _mm_maskmoveu_si128 - how to avoid?

maskmovdqu doesn't do fault-suppression, unlike AVX vmaskmovps or AVX512 masked stores. Those would solve your problem, although still maybe not the most efficient way.

As documented in Intel's ISA ref manual, with an all-zero mask (so nothing is stored to memory) Exceptions associated with addressing memory and page faults may still be signaled (implementation dependent).

With a non-zero mask, I assume it's guaranteed that it does page fault if the 16 bytes includes any non-writeable pages. Or maybe some implementations do the mask suppress faults even when some storing does happen (zeros in the unmapped page, but non-zero elsewhere)


It's not a fast instruction anyway on real CPUs.

maskmovdqu might have been good sometimes on single-core Pentium 4 (or not IDK), and/or its MMX predecessor was maybe useful on in-order Pentium. Masked cache-bypassing stores are much less useful on modern CPUs where L3 is the normal backstop, and caches are large. Perhaps more importantly, there's more machinery between a single core and the memory controller(s) because everything has to work correctly even if another core did reload this memory at some point, so a partial-line write is maybe even less efficient.

It's generally a terrible choice if you really are only storing 8 or 12 bytes total. (Basically the same as an NT store that doesn't write a full line). Especially if you're using multiple narrow stores to grab pieces of data and put them into one contiguous stream. I would not assume that multiple overlapping maskmovdqu stores will result in a single efficient store of a whole cache line once you eventually finish one, even if the masks mean no byte is actually written twice.

L1d cache is excellent for buffering multiple small writes to a cache line before it's eventually done; use that normal stores unless you can do a few NT stores nearly back-to-back.

To store the top 8 bytes of an XMM register, use movhps.

Writing into cache also makes it fine to do overlapping stores, like movdqu. So you can concatenate a few 12-byte objects by shuffling them each to the bottom of an XMM register (or loading them that way in the first place), then use movdqu stores to [rdi], [rdi+12], [rdi+24], etc. The 4-byte overlap is totally fine; coalescing in the store buffer may absorb it before it even commits to L1d cache, or if not then L1d cache is still pretty fast.


At the start of writing a large array, if you don't know the alignment you can do an unaligned movdqu of the first 16 bytes of your output. Then do the first 16-byte aligned store possibly overlapping with that. If your total output size is always >= 16 bytes, this strategy doesn't need a lot of branching to let you do aligned stores for most of it. At the end you can do the same thing with a final potentially-unaligned vector that might partially overlap the last aligned vector. (Or if the array is aligned, then there's no overlap and it's aligned too. movdqu is just as fast as movdqa if the address is aligned, on modern CPUs.)

Difference between MOVDQA and MOVAPS x86 instructions?

In functionality, they are identical.

On some (but not all) micro-architectures, there are timing differences due to "domain crossing penalties". For this reason, one should generally use movdqa when the data is being used with integer SSE instructions, and movaps when the data is being used with floating-point instructions. For more information on this subject, consult the Intel Optimization Manual, or Agner Fog's excellent microarchitecture guide. Note that these delays are most often associated with register-register moves instead of loads or stores.

what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

There's no reason to ever use _mm256_lddqu_si256, consider it a synonym for _mm256_loadu_si256. lddqu only exists for historical reasons as x86 evolved towards having better unaligned vector load support, and CPUs that support the AVX version run them identically. There's no AVX512 version.

Compilers do still respect the lddqu intrinsic and emit that instruction, so you could use it if you want your code to run identically but have a different checksum or machine code bytes.


No x86 microarchitectures run vlddqu any differently from vmovdqu. I.e. the two opcodes probably decode to the same internal uop on all AVX CPUs. They probably always will, unless some very-low-power or specialized microarchitecture comes along without efficient unaligned vector loads (which have been a thing since Nehalem). Compilers never use vlddqu when auto-vectorizing.

lddqu was different from movdqu on Pentium 4. See History of … one CPU instructions: Part 1. LDDQU/movdqu explained.

lddqu is allowed to (and on P4 does do) two aligned 16B loads and takes a window of that data. movdqu architecturally only ever loads from the expected 16 bytes. This has implications for store-forwarding: if you're loading data that was just stored with an unaligned store, use movdqu because store-forwarding only works for loads that are fully contained within a previous store. But otherwise you generally always wanted to use lddqu. (This is why they didn't just make movdqu always use "the good way", and instead introduced a new instruction for programmers to worry about. But luckily for us, they changed the design so we don't have to worry about which unaligned load instruction to use anymore.)

It also has implications for correctness of observable behaviour on UnCacheable (UC) or Uncacheable Speculate Write-combining (UCSW, aka WC) memory types (which might have MMIO registers behind them.)


There's no code-size difference in the two asm instructions:

  # SSE packed-single instructions are shorter than SSE2 integer / packed-double
4000e3: 0f 10 07 movups xmm0, [rdi]

4000e6: f2 0f f0 07 lddqu xmm0, [rdi]
4000ea: f3 0f 6f 07 movdqu xmm0, [rdi]

4000ee: c5 fb f0 07 vlddqu xmm0, [rdi]
4000f2: c5 fa 6f 07 vmovdqu xmm0, [rdi]
# AVX-256 is the same as AVX-128, but with one more bit set in the VEX prefix

On Core2 and later, there's no reason to use lddqu, but also no downside vs. movdqu. Intel dropped the special lddqu stuff for Core2, so both options suck equally.

On Core2 specifically, avoiding cache-line splits in software with two aligned loads and SSSE3 palignr is sometimes a win vs. movdqu, especially on 2nd-gen Core2 (Penryn) where palignr is only one shuffle uop instead of 2 on Merom/Conroe. (Penryn widened the shuffle execution unit to 128b).

See Dark Shikaris's 2009 Diary Of An x264 Developer blog post: Cacheline splits, take two for more about unaligned-load strategies back in the bad old days.

The generation after Core2 is Nehalem, where movdqu is a single uop instruction with dedicated hardware support in the load ports. It's still useful to tell compilers when pointers are aligned (especially for auto-vectorization, and especially without AVX), but it's not a performance disaster for them to just use movdqu everywhere, especially if the data is in fact aligned at run-time.


I don't know why Intel even made an AVX version of lddqu at all. I guess it's simpler for the decoders to just treat that opcode as an alias for movdqu / vmovdqu in all modes (with legacy SSE prefixes, or with AVX128 / AVX256), instead of having that opcode decode to something else with VEX prefixes.

All current AVX-supporting CPUs have efficient hardware unaligned-load / store support that handles it as optimally as possible. e.g. when the data is aligned at runtime, there's exactly zero performance difference vs. vmovdqa.

This was not the case before Nehalem; movdqu and lddqu used to decode to multiple uops to handle potentially-misaligned addresses, instead of putting hardware support for that right in the load ports where a single uop can activate it instead of faulting on unaligned addresses.

However, Intel's ISA ref manual entry for lddqu says the 256b version can load up to 64 bytes (implementation dependent):

This instruction may improve performance relative to (V)MOVDQU if the source operand crosses a cache line boundary. In situations that require the data loaded by (V)LDDQU be modified and stored to the same location, use (V)MOVDQU or (V)MOVDQA instead of (V)LDDQU. To move a double quadword to or from memory locations that are known to be aligned on 16-byte boundaries, use the (V)MOVDQA instruction.

IDK how much of that was written deliberately, and how much of that just came from prepending (V) when updating the entry for AVX. I don't think Intel's optimization manual recommends really using vlddqu anywhere, but I didn't check.

There is no AVX512 version of vlddqu, so I think that means Intel has decided that an alternate-strategy unaligned load instruction is no longer useful, and isn't even worth keeping their options open.

How does strncmp using SSE 4.2 avoid reading beyond the page boundaries when loading 16 bytes?

Is this correct? Does strncmp_sse4_2 read more than n bytes?

Yes.

Even if it does: Doing 16 bytes at a time should stop at 0x7ffeff58. Why does it read till 0x7ffeff60?

You are assuming that it started using movdqu from the address you passed in. It likely didn't. It probably aligned the pointers to cache line first.

If so, how does this not potentially cause a page fault?

If you have a 16-byte aligned pointer p, that means p+15 points to the same page as p so you can read 16 bytes from p with impunity.

If so, how do we tell distinguish acceptable read of uninitialized data from cases indicating bugs? E.g. how would Valgrind avoid reporting this as an uninitialized read?

Valgrind does this by interposing its own copy of strcmp (for dynamically linked binaries). Without such interposition, valgrind produces false positives (or, rather valgrind produces true positives which nobody cares or could do anything about).

Migrating from XMM to YMM

Is rcx aligned to 32 bytes? movdqa xmm, m128 requires 16 byte alignment but vmovdqa ymm, m256 requires 32 byte alignment, so if you just port the code to AVX2 without increasing the alignment, it won't work.

Either increase the alignment to 32 byte or use vmovdqu to sidestep all alignment issues instead. Contrary to SSE instructions, memory operands to AVX instructions generally do not have alignment requirements (vmovdqa is one of the few exceptions). It is still a good idea to align your input data if possible as memory accesses crossing cache lines incur extra penalties.



Related Topics



Leave a reply



Submit