The online https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for VPMASKMOV says that "mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated." But the documentation in the Intel Instruction Set Reference Guide does not mention an alignment requirement, and seems to imply that it is not required: "Faults occur only due to mask-bit required memory accesses that caused the faults.".
Agner Fog's documentation (which I presume is the primary source Intel uses when writing their manuals?), says "The instruction VPMASKMOVD (AVX2) and VMASKMOVPS (AVX), etc. can be used for writing to the first part of an unaligned array up to the first 16-bytes boundary as well as the last part after the last 16-bytes boundary."
What's the actual requirement for alignment for these instructions?
Yes I agree it is quite confusing when official documentation provides different explanations of that intrinsics requirement.
Searching further for information, the evidence seems to be leaning toward "yes, VPMASKMOVD and VPMASKMOVQ support unaligned access".
The AMD manual also makes no reference of requiring alignment, but mentions that an exception will occur if "Alignment checking enabled and: 256-bit memory operand not 32-byte aligned or 128-bit memory operand not 16-byte aligned."
AIDA64 explicitly lists timings on Haswell for what seem to be unaligned accesses:
2193 AVX2 :VPMASKMOVD xmm,xmm,[m128+4] L: [memory dep.] T: 0.59ns= 2.00c 2194 AVX2 :VPMASKMOVD [m128+4],xmm,xmm L: [memory dep.] T: 0.29ns= 1.00c 2195 AVX2 :VPMASKMOVD unaligned LSpair L: 0.88ns= 3.0c T: 3.85ns= 13.08c 2196 AVX2 :VPMASKMOVQ xmm,xmm,[m128+4] L: [memory dep.] T: 0.59ns= 2.00c 2197 AVX2 :VPMASKMOVQ [m128+4],xmm,xmm L: [memory dep.] T: 0.29ns= 1.00c 2198 AVX2 :VPMASKMOVQ unaligned LSpair L: 0.88ns= 3.0c T: 3.85ns= 13.08c 2199 AVX2 :VPMASKMOVD ymm,ymm,[m256+4] L: [memory dep.] T: 0.59ns= 2.00c 2200 AVX2 :VPMASKMOVD [m256+4],ymm,ymm L: [memory dep.] T: 0.29ns= 1.00c 2201 AVX2 :VPMASKMOVD unaligned LSpair L: 5.00ns= 17.0c T: 5.00ns= 17.00c 2202 AVX2 :VPMASKMOVQ ymm,ymm,[m256+4] L: [memory dep.] T: 0.59ns= 2.00c 2203 AVX2 :VPMASKMOVQ [m256+4],ymm,ymm L: [memory dep.] T: 0.29ns= 1.00c 2204 AVX2 :VPMASKMOVQ unaligned LSpair L: 5.00ns= 17.0c T: 5.00ns= 17.00c
At least, that's what I think their notation must mean. I guess I should just start testing it to see.
Upon further searching, it also becomes clear that very few people seem to be using these instructions. They are mentioned just about nowhere other than the sources I've already mentioned. But if anyone has experience with them in practice, I'd be eager to hear about it.
 http://www.agner.org/optimize/optimizing_assembly.pdf (I forgot to put in the footnote on the lead post for my joke about Agner's Optimization Manual)
This brings to mind the option -qopt-assume-safe-padding (apparently not implemented for AVX) which encourages the compiler to implement masked stores for remainder loops, taking the small risk of out-of-bounds access.
The mask, as I understand it, doesn't stop the unused data elements from being accessed, but ensures they aren't modified.
I don't recall any explanation of what happens if one thread has a mask set which prevents data modification while another thread modifies the data, in an apparent unintended data race condition.
>>I don't recall any explanation of what happens if one thread has a mask set which prevents data modification while another thread modifies the data, in an apparent unintended data race condition.
With AVX you would be safe with any byte combinations in QWORD or DQWORD depending on instruction. The below is for QWORD, the DQWORD is extended to 128-bits/16-bytes
Intel® 64 and IA-32 Architectures
Software Developer’s Manual
1, 2A, 2B, 2C, 3A, 3B and 3C
Order Number: 325462-051US
Stores selected bytes from the source operand (first operand) into a 64-bit memory location. The mask operand
(second operand) selects which bytes from the source operand are written to memory. The source and mask operands
are MMX technology registers. The memory location specified by the effective address in the DI/EDI/RDI
register (the default segment register is DS, but this may be overridden with a segment-override prefix). The
memory location does not need to be aligned on a natural boundary. (The size of the store address depends on the
The most significant bit in each byte of the mask operand determines whether the corresponding byte in the source
operand is written to the corresponding byte location in memory: 0 indicates no write and 1 indicates write.
The MASKMOVQ instruction generates a non-temporal hint to the processor to minimize cache pollution. The nontemporal
hint is implemented by using a write combining (WC) memory type protocol (see “Caching of Temporal
vs. Non-Temporal Data” in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation
implemented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVQ instructions
if multiple processors might use different memory types to read/write the destination memory locations.
This instruction causes a transition from x87 FPU to MMX technology state (that is, the x87 FPU top-of-stack
pointer is set to 0 and the x87 FPU tag word is set to all 0s [valid]).
The behavior of the MASKMOVQ instruction with a mask of all 0s is as follows:
• No data will be written to memory.
• Transition from x87 FPU to MMX technology state will occur.
• Exceptions associated with addressing memory and page faults may still be signaled (implementation
• Signaling of breakpoints (code or data) is not guaranteed (implementation dependent).
• If the destination memory region is mapped as UC or WP, enforcement of associated semantics for these
memory types is not guaranteed (that is, is reserved) and is implementation-specific.
The MASKMOVQ instruction can be used to improve performance for algorithms that need to merge data on a byteby-
byte basis. It should not cause a read for ownership; doing so generates unnecessary bandwidth since data is
to be written directly using the byte-mask without allocating old data prior to the store.
In 64-bit mode, the memory address is specified by DS:RDI.
With AVX2 this is extended to 256-bit (32-bytes).
I am not sure on AVX512 or whatever it will be called
With AVX you would be safe with any byte combinations in QWORD or DQWORD depending on instruction. The below is for QWORD, the DQWORD is extended to 128-bits/16-bytes
This doesn't necessarily contradict your conclusion, but readers should note that you quoted the manual for the the older SSE instruction MASKMOV rather than the AVX2 instruction VPMASKMOV. One crucial difference is that the older MASKMOV is a non-temporal streaming store that bypasses the cache, while the new VPMASKMOV is a normal store. There are likely other significant differences. Here's the description for VPMASKMOV from the reference manual for comparison:
Conditionally moves packed data elements from the second source operand into the corresponding data element of the destination operand, depending on the mask bits associated with each data element.
The mask bits are specified in the first source operand. The mask bit for each data element is the most significant bit of that element in the first source operand. If a mask is 1, the corresponding data element is copied from the second source operand to the destination operand. If the mask is 0, the corresponding data element is set to zero in the load form of these instructions, and unmodified in the store form.
The second source operand is a memory address for the load form of these instructions. The destination operand is a memory address for the store form of these instructions. The other operands are either XMM registers (for VEX.128 version) or YMM registers (for VEX.256 version).
Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero.
Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions. Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s.
VMASKMOV should not be used to access memory mapped I/O as the ordering of the individual loads or stores it does is implementation specific.
In cases where mask bits indicate data should not be loaded or stored paging A and D bits will be set in an imple- mentation dependent way. However, A and D bits are always set for pages where data is actually loaded/stored.
Note: for load forms, the first source (the mask) is encoded in VEX.vvvv; the second source is encoded in rm_field, and the destination register is encoded in reg_field.
Note: for store forms, the first source (the mask) is encoded in VEX.vvvv; the second source register is encoded in reg_field, and the destination memory location is encoded in rm_field.
Is there a better place to report documentation bugs than here on this forum? I just noticed another. Given the easily confused names of these two instructions, the typo "VMASKMOV" (a non-existent instruction half way between the two) in the documentation I just quoted should be fixed.
From my understanding of non-temporal store, from reading the manuals and from comments from knowledgeable people on this forum, the non-temporal store will perform the same cache coherency behavior with respect to the other sockets/cores/threads, but with the distinction that for the same core, the store will not perform indirectly a load of the cache line into L1/L2.
One of the Intel engineers could comment on this with respect to AVX2, AVX-512? using VMASKMOV as well as scatter (aligned and unaligned for both instrucitons).
Hmmm.... I had seen the description for the SSE MASKMOV instructions but I had not realized before that they could be used to substitute for the absence of the MOVNTSD instruction on Intel processors. (MOVNTSD is a "scalar" streaming store instruction introduced by AMD in the Family10h processors as part of a feature set called SSE4a.) On AMD systems this made a big difference in STREAM benchmark performance when the compiler could not guarantee alignment (and did not want to generate multiple paths with a run-time select). I also have an application code with stores to four different arrays in the inner loop and the Intel compiler can only generate non-temporal stores for two of the four arrays because they are aligned differently -- the MOVNTSD instruction could be used for the unaligned arrays on AMD processors, but I am a bit surprised that the Intel compiler does not use the MASKMOVDQU instruction in this case. I guess it is time to try the _mm_maskmoveu_si128() intrinsic!
Concerning non-temporal stores -- the coherence transactions are usually not the same as for ordinary stores . Ordinary stores that miss in the cache generate a RWITM (Read With Intent to Modify) transaction to:
Non-temporal stores typically do not perform the RWITM transaction . Instead, they collect the data in a write-combining buffer and then write the data directly to memory. This is typically treated in the same way as a DMA store from an IO device -- the memory accepts the data and the coherence controller broadcasts an invalidate transaction for that cache line.
The main difference (other than the issue of reading the data into the processor caches) is one of ordering. In the streaming store case the invalidation is broadcast at the end of the transaction, not at the beginning. This is one of the reasons that streaming stores are not strongly ordered with respect to ordinary stores, and why you need to put an SFENCE instruction after the streaming stores if you want subsequent ordinary stores to be ordered with respect to those streaming stores. E.g., in an OpenMP code that uses streaming stores in a parallel region, the SFENCE is required to ensure that the data is globally visible before the ordinary store used to enter the barrier at the end of the parallel region -- otherwise another processor could see "stale" data (either from memory or from their own cache) whose new value is "stuck" in a write-combining buffer that has not been flushed.
: Some implementations may ignore the non-temporal feature in non-temporal stores to system memory and treat them as ordinary stores. This should not change the functionality of a program, but will typically change the performance.
: Some implementations may implement non-temporal stores using some features of ordinary stores and some features of non-temporal stores. For example an implementation might issue (a variant of) the RWITM transaction to notify cache coherence directories that a non-temporal store is "in progress" for a particular line without reading the line into the cache. Or an implementation might issue a RWITM and read the corresponding cache line into the write combining buffer, rather than into the cache. Such hybrid approaches may be required to support certain (generally obscure) cases of the Intel memory ordering model on systems with cache coherence directories or filters.
I have not experimented with the VPMASKMOV instruction, but as a rule I would consider the description in Volume 2 of the SWDM as authoritative. The wording of the instruction description is certainly strange, but each of the statements there appears to be consistent with the hypothesis that alignment check exceptions are never thrown, no matter what address or mask bits are used, and no matter whether alignment checking is enabled or not. Furthermore, the statement about faults not being generated for addresses with mask bits set to zero requires that the base address be unaligned -- otherwise the target could not cross a cache line boundary and there would be no need to distinguish between memory locations with mask bits set and those with mask bits unset.
Most memory chips are byte wide. As such, the memory controller should be fully capable of enabling writes on individual bytes while performing reads on the other bytes. This then would permit:
IOW, place the combining on the other side of the memory controller. The write combining could further have two flavors:
Truly non-temporal where the CPU (program) is not interested in having the write combined data.
Quazi non-temporal where you would be interested in obtaining the combined data into a register without polluting the cache (think along the line of a VPMASKSWAB *** fictional instruction ***
(Argh! Just lost two hours of work on my attempt to reply. I hate computers!)
1. My first comments were about the ISA for streaming stores. AMD implemented the MOVNTSS and MOVNTSD instructions to store "scalar" data in SSE registers, while Intel did not implement these instructions. Intel has similar MOVNTQ and MOVNTDQ instructions as part of the MMX instruction set, but it is not clear whether these can be used in the same way --- I have certainly never seen a compiler generate them.
The MOVMASKDQU instruction could implement the same functionality, but the AIDA64 instruction latency tables suggest that it is much slower, which may explain why I have never seen it generated either..
Because Intel lacks the fast MOVNTSS & MOVNTSD instructions, disabling vectorization also disables non-temporal stores. This was confusing to me, and I imagine that it has confused others as well.
2. My second set of comments were in response to the statement: "[...] the non-temporal store will perform the same cache coherency behavior [...]", which is generally incorrect. The standard implementations of non-temporal stores use different low-level coherence transactions (and issue them at different times) than the transactions used by standard (allocating) stores that miss in the caches. I probably added lots of confusion with my caveats about systems that "de-optimize" streaming stores (treating them more like ordinary stores) in order to support the Intel memory ordering model in the presence of more advanced cache coherence features such as directories or snoop filters.
3. Every system that supports streaming stores already has to support a read/modify/write functionality for partial cache line updates at the memory controller. (The functionality was probably already there to support IO, but streaming stores require it as well and there is no way to prevent partial cache line updates from occurring, even if the code only writes full lines.)
Every PC or server memory controller that I know of supports both x8 and x4 DRAMs. Since byte-level write-masking is supported only for x8 DRAM chips (and x16 starting with DDR4), the memory controller must support a read/modify/write mode when using x4 DRAM chips. Because of the need to support read/modify/write for x4 DRAMs, I don't know of any DDR2/3/4 memory controllers that bother to support the byte-masked writing mode -- but I would be happy to learn about any out there! (Xeon Phi reportedly uses GDDR5 byte-masking on writes to update the ECC bits in memory, but that is too long a story to go into now.)
Read/modify/write at the memory controller is also required for ECC support of partial-cache-line stores that update a set of bytes that only partly update an ECC-protected block. Of course the general case for streaming stores is to deliver a full line, so the ECC can be computed directly and the contents of memory simply overwritten without being read first.