I'm wondering what are the guidelines on using LDDQU vs. MOVDQU instructions on latest and future Intel CPUs.
I know that during Netburst era LDDQU was supposed to be a more efficient way of loading unaligned data, when the data is not supposed to be modified soon. Later, in Core architectures, MOVDQU was updated to become equivalent to LDDQU. Therefore, the general guideline was to use LDDQU - it would at least be not worse than MOVDQU and on older CPUs it would be faster.
However, in the latest Agner Fog's instruction tables for Skylake I can see that LDDQU has one cycle longer latency compared to MOVDQU, which leads to the following questions:
1. Does this mean that LDDQU is no longer equivalent to MOVDQU? If so, what is the difference?
2. Is this discrepancy an unfortunate (mis-)feature of the Skylake architecture that is intended to be "fixed" in future architectures or the change is permanent?
3. What are the guidelines on choosing one instruction over the other? I'm interested with regard to modern architectures (say, Haswell and later) as well as future CPU architectures.
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
I thought the main benefit of LDDQU was in data that could potentially span a cache line boundry. In this case the penalty wasn't as severe with LDDQU. That's what the Intel Intrinsic Guide still says.
I guess on Skylake you have to weigh up whether the extra cycle of latency is going to be worth it for the faster cache line spanning case.
Richard Nutman wrote:
I thought the main benefit of LDDQU was in data that could potentially span a cache line boundry. In this case the penalty wasn't as severe with LDDQU.
Yes, exactly. LDDQU would load two cache lines and then combine the parts that are requested by the load. My understanding is that MOVDQU does the same on modern CPUs (except Skylake, apparently) where memory type permits, so the two instructions become equivalent.
There is no reason to use LDDQU in new code targeting modern chips. It was faster on a small slice of ancient P4 CPUs, but since the Core 2 architecture (released about a decade ago) the instructions have done the same thing and have had the same performance. I don't think it's actually slower (and I checked Agner's guide and it shows the same latency) - so there isn't much downside either. Just use MOVDQU for consistency and because going forward (e.g., in AVX512) LDDQU is being dropped. Here are all the details you'd probably ever want:
 Keep in mind that Agner doesn't really report load-to-use latency for loads, but rather times a load-store loop and arbitrarily divides the latency between load and store operations, so the latency figures aren't that meaningful, but I've timed LDDQU and it has the same timing as MOVDQU.
Thank you for the response, Travis D.. One minor correction is that Agner's tables (the version from 2018-04-27 that I have) does list the MOVDQA/U and LDDQU for arguments "x, m128" as having different latencies for Skylake and SkylakeX, I've just double checked. As you say, this may be an artifact of his method of measurement, but it leaves me wondering why it is there, given that presumably the two instructions were measured in the same conditions.
You are right, I scanned the table too quickly (and was looking at the store MOVDQU not load). Agner does report MOVDQU as 2 cycles and LDDQU as 3.
Anyways, I went ahead and tested it, and on my Skylake hardware MOVDQU and LDDQU perform identically in my load-load latency test for both aligned and misaligned (cache-line crossing) loads, at 6 cycles and 12 cycles respectively (misaligned loads that don't cross a cache line are just as fast as aligned loads). I think Agner's doc is just wrong in this case. It is almost certainly the case that both instructions in fact decode to the same internal uop, so the performance will identical.
You can see the test here: https://github.com/travisdowns/uarch-bench/commit/2fda5cc6028631e5b7a6e008e2e09c840de86d06 and run it on your hardware if you want. You should subtract 2 cycles from the displayed results (I get 8 and 14 displayed cycles for the aligned and misaligned cases, respectively) because the movq rax, xmm0 which brings the loaded value back into the GP domain takes 2 cycles.
Thanks again. Unfortunately, I don't have a Skylake hardware, so I can't test.
Well you don't need Skylake hardware - any will do and indeed testing on your hardware is likely to be what you are most interested in.
In fact, a non-Skylake result (particularly on older hardware, say Sandy Bridge) would be more interesting since it would give a data point on a new architecture, rather than simply confirming what I've already tested on Skylake.