The Optimization Reference Manual, page 2-16, Table 2-6 (Lookup Order and Load Latency) has a footnote that says "Subject to execution core bypass restriction shown in Table 2-4".
1. Does the execution core bypass restriction that causes the load latency to vary by 3 cycles (7-4=3) depending on the data type apply to L2 and LLC or only to L1? The location of the footnote in Table 2-6 suggests it applies only to L1 but someone told me it does apply to L2 and LLC.
2. Does the addressing mode (base+offset with offset<2048 or not) affect the load latency from L2 and LLC?
3. Table 2-8 shows the same load latency for X87 and 256-bit AVX (7 cycles) when offset>2048 but different load latencies (6 or 7 cycles) when offset<2048. Is one of these numbers a typo? If having a large offset with base+offset addressing increases the load latency by one cycle for integer, MMX, SSE, 128-bit AVX and X87, why doesn't it increase the load latency by one cycle for 256-bit AVX?
4. Do two 128-bit AVX loads dispatched on the same cycle have less latency than one 256-bit AVX load? Please explain.
By the way, the second column in Table 2-8 should probably say ">=2048" instead of ">2048". (That's my obsessivecompulsive disorder at work!)
1. Yes, it applies to both MLC and LLC latency as well. An x87 load that hits the MLC or LLC will have a latency 2 cycles longer than a Integer load that hits the MLC or LLC.
2. Yes, a load with base+offset with offset<2048 will have an MLC or LLC latency 1 cycle shorter than a load that doesn't.
3. Checking...does seem like a typo.
4. Yes, this is a design limitation dealing moving twice as much data between the memory and avx stacks.
> 4. Do two 128-bit AVX loads dispatched on the same cycle have less latency than one 256-bit AVX load?
> 4. Yes, this is a design limitation dealing moving twice as much data between the memory and avx stacks.
Two 128-bit loads are moving the same amount of data as one 256-bit load so I don't understand this answer. If there really is a limitation related to the amount of data being moved, then it seems that the data for one of thetwo 128-bit loads would get delayed by one or two cycles, depending on where the typo is in table 2-8. Column 2 of table 2-8 says a 256-bit AVX load isone cycleslower than a 128-bit AVX load while column 3 says it is two cycles slower. When two 128-bit AVX loads are dispatched on the same cycle, does the data for one of them get delayed by one or two cycles compared to the data for the other 128-bit AVX load?