Re: Cache/page lines and LDDQU

dark_shikari · ‎11-18-2007

I submitted this question to the email support but it was cut off by the character limit, apparently, so I figure I have no choice but to post it here.

I'm a developer for the x264 video encoder. When Loren Merritt, another developer, was testing SAD assembly operation performance, he noticed that the number of clocks required varied wildly. In particular, when the unaligned data to be loaded crossed a cache line boundary, which occurred 25% of the time for blocks of width 16 bytes, up to eight times as many clock cycles were required for the entire operation, depending on the processor. This issue was found to exist not only for SSE2 load operations like MOVDQU, but also for MMX, and on processors going all the way back to the Pentium 3 (though obviously the cache line intervals varied). No older processors were tested than the Pentium 3. Page lines were also encountered every 64 cache lines, i.e. 1/256th of the time. Data crossing a page line resulted in up to 100 times slower performance; thousands of clock cycles for a single 16x16 SAD operation that normally took 48. However, when testing the same operations on AMD processors, we noticed there was no measurable penalty at all for unaligned loads across cache lines, and furthermore, the page line penalty was a mere 5-10%.

We created a workaround using PALIGNR for SSSE3, for Core 2s, that eliminated most of the performance hit. We would have preferred to use LDDQU, but our testing showed that LDDQU does not work correctly on Core 2s; rather it appears to be using two MOVDQUs or similar, since the performance doesn't improve. However, LDDQU works correctly on Prescotts and other Netburst-based processors, along with the Core 1, allowing us to successfully implement this workaround on SSE3 chips. Some further code involving 16 separate hardcoded data loads was implemented for SSE2 and MMX processors.

The result is a near-50% decrease in the average time for a 16x16 SAD operation on Intel CPUs, and given the extremely heavy reliance of the encoder on this and other similar operations, this could easily speed up x264 on Intel CPUs by 10%, 15%, or even more. Without these changes, the Core 2s actually take longer (on average) to do a 16x16 SAD than the Athlon 64s, which is absurd given the faster SSE2 processing that the Core 2 is supposed to have.

Note that we cannot use aligned loads due to the nature of video encoding; such a system would require 16 copies of the frame, each shifted by one byte, a ridiculous overhead both in terms of memory, cache pollution, and memcpy.

Now that I have explained the background, here are my specific questions:
1. What causes the huge decrease in performance when loading across cache lines and page lines? Does Intel recommend any better way to mitigate the effect other than what we have already done?
2. Is this performance decrease mitigated at all in the new Penryn core?
3. Why does LDDQU not work correctly on the Core 2 line--is this an intentional regression from the Core 1? Does it work correctly on the Penryn?
4. Is there any detailed documentation on these sort of issues that could be helpful in improving x264's performance on Intel processors?

TimP · ‎11-18-2007

gcc avoids entirely the use of instructions which would incur cache line splits.
In some cases, where icc can determine where the cache boundaries occur, it will use full cache line unrolling, so that unaligned memory references can be restricted to within the cache line.
The cache line split penalty is much more severe for unaligned store. Compilers should align loops (using remainder loops at both ends) so that as many store instructions as possible are aligned.
Penryn processors are designed to mitigate the cache line split penalty.
It looks like you have as much experience as anyone with the workarounds for these issues on various processors.
Some forms of DTLB misses were much more severe on NetBurst. In some cases, on Core 2, it is possible to use software prefetch to hide page miss latency. This probably will not work effectively when there is a high buss utilization.
When it is appropriate to store to a cache line and simultaneously evict it from cache, the movnt instructions are useful.

dark_shikari · ‎11-18-2007

tim18:
gcc avoids entirely the use of instructions which would incur cache line splits.

This is pure assembly; using GCC for motion comparison metrics would ensure world's slowest encoder.

We aren't doing any store operations at all that cross cache lines, as far as I know; the only major ASM store operations are likely in the form of pavgb, for qpel motion searching, and motion compensation in general, both of which output to aligned memory if I recall correctly. DCT/iDCT also stores, but again, to an aligned memory chunk.

I was hoping to get a bit more information since, for example, the page line issue is completely undocumented as far as I've seen, and the LDDQU regression also. It would be nice to know why all these workarounds are necessary on Intel CPUs.

TimP · ‎11-18-2007

Opteron CPUs resolve TLB misses quicker than Intel CPUs, but the same may not be true of Barcelona. If you are writing so as to accentuate the advantage of Opteron over more recent CPUs, your market may dwindle.
If you are so sure that you don't want to use a compiler, you should have found your way to more documentation than appears to be the case. It should be well known that hardware prefetch stops at page boundaries with many CPUs, and that Intel Core CPUs don't suffer (or in specific cases, benefit) from combinations of software and hardware prefetch to the extent that early Pentium 4 did.

dark_shikari · ‎11-18-2007

tim18:
If you are writing so as to accentuate the advantage of Opteron over more recent CPUs, your market may dwindle.

All AMD CPUs have no issue with cache lines that we have tested. The "market will dwindle" statement is meaningless though; the program automatically picks the best ASM function on start for each operation based on the CPU type. My entire post talked about optimizations for Intel processors, which would increase speed on Intel processors, not AMD processors. The Core 2 already wipes the floor with the Athlon 64 in benchmarks with x264; this will only increase that lead.

tim18:
If you are so sure that you don't want to use a compiler

Well the entire program is of course compiled in gcc, but the non-assembly version is an order of magnitude slower. Just try using the --no-asm commandline on the encoder for loads of fun.

tim18:
Intel Core CPUs don't suffer (or in specific cases, benefit) from combinations of software and hardware prefetch to the extent that early Pentium 4 did.

Where can I find specific documentation on this change?

TimP · ‎11-18-2007

I don't think the prefetch change was documented clearly. Some people found they could optimize their applications for original P4 by issuing 3 software prefetches ahead of a loop in order to start up the hardware prefetch, then letting hardware prefetch take over. One of the measures taken before Prescott was to decouple software and hardware prefetch, so that scheme doesn't work. In general, it was necessary to avoid software prefetch, such as was often used for P-III, in order to get satisfactory performance on P4.

levicki · ‎12-06-2007

I see some questions here still unanswered — why LDDQU seems not to work as intended on Core 2 Duo?

Page boundary issue is more likely because of a page fault. Try touching the next page in advance if you are going through the large dataset in memory or if you use small buffer, lock it so it cannot be swapped out and touch all the pages before you start processing.

SHIH_K_Intel · ‎03-14-2008

Hi, Thank you for your question. Your question covers many areas dealing with performance, from individual instruction to some algorithmic issues; any one area may worth a long discussion. So, my reply may ramble a bit. Ill start at the instruction level issues dealing with microarchitectures.

Latency and throughput of MOVDQU. Basically, different microarchitectures handle unaligned loads with different performance characteristics. Heres what Ive found from several recent microarchitectures. Please note, the result on this instruction alone can be complicated by factors such as, address alignment, and locality. To make it a bit simpler, Ill just compare 2 different addresses alignments (aligned is done with offset 0 bytes, and unaligned is done with offset 57 bytes so a cacheline split must be handled by the microarchitecture) and L1 data cache locality. Ill show the best estimate I have based on software-based measurement of 5 machines in the order (Northwood-2001,Prescott-2004, Merom-2006, Penryn-2007, I only have access to a 2005 opteron chip) .

MOVDQA Throughput (no cacheline split): Nwd: 2, Psc: 1, Merom: 1, Penryn: 1, Opteron: 2

MOVDQU Throughput (no cacheline split): 4, 2, 2, 2, 2.5~3

MOVDQU Latency (no cacheline split): Nwd/PSC: ~Mid 20 cycles, Merom/Penryn: ~ 10 cycles, Opteron appears to be low teens.

MOVDQU Throughput (w/ cacheline split): Nwd: ~30, PSC: ~Mid 20 cycles, Merom: ~20, Penryn: ~17, Opteron: ~3.3.

MOVDQU Latency (w/ cacheline split): Nwd: 50+, PSC: ~40 cycles, Merom: ~21, Penryn: ~18, Opteron: ~20+.

Handling cacheline splits has always been a challenge for microarchitectures. Deeper-pipelined machine with aggressive speculative execution tend to face more challenges.

The LDDQU instruction has always been defined to have architectural behavior of the input and output state as behaving the same way as the load flavor of MOVDQU. Different microarchitecture may implement hardware to optimize the performance of cacheline split loads. PSC did put in extra hardware so that the throughput of cache split loads is ~ 2 cycles to mitigate the long-pipe issues manifesting in handling cacheline splits. In Merom and Penryn, they have shorter pipeline and the tradeoff was made so LDDQU does the same thing as a MOVDQU.

While these are just the relative simple case of getting data from L1 interacting with the out-of-order engine. L2 or cache misses will get into other interesting interactions with other subsystems in the microarchitecture. These would be interesting to some folks at some level. But the overarching thing is that how would software benefit from the finer details that can change from one stepping to another? The fundamental heuristic remains that the less cache line splits, the more likely the applications performance w ill improve.

This naturally leads to the central question of how feasible is that heuristic in real-world problems, such as the one that prompted your question?

I dont know enough details of the PSADBW implementation of your block-search, but Ill use a similar code that one of our colleague published on motion-estimation as a proxy, which faces the same unaligned memory access challenges. (see http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4)

At the center of such a block-search problem, we have two 16-byte loads, a psadbw and a paddusw, where one of the load is from a stationary block, and the other load on a roving block that must cover the entire target frame.

One of the easiest and intuitive approach is simply issue MOVDQU for both the stationary block (which can start from anywhere in the parent frame) and the roving block. But this is very sub-optimal.

Since the stationary block would be read outside the loops that rove around the target frame, a good compiler would have easily taken the approach of making a copy of the unaligned stationary block to a 16-byte aligned address. This immediately transformed half of the 16B memory references to aligned loads.

Your interest of using LDDQU is obviously in the right direction of attempting to reduce cache line splits that a significant portion of the other half of 16-B memory references would experience.

IMHO, the other approach you alluded to of using PALIGNR actually is more optimal than LDDQU (even if LDDQU didnt become plain MOVDQU in our recent processors).

Using Kiefers SSE2 intrinsic code (see link above) as an example, the inner loop of one pair of 16x16 block SAD operation involves 16 PSADBW, 16 PADDUSW, 16 loads done with 16-B aligned loads, 16 loads done with MOVDQU.

Instead of shifting the target block one pixel at a time to the right (why we have to use MOVDQU), the 16-B loads in the target frame could be done at an interval of every 16-bytes. And PALIGNR would be used to splice up the target block between every 16-byte mileposts. With the PALIGN approach, the target 16x16 block in between the mileposts can be synthesized from registers in-flight. The savings are reduced number of memory accesses in the first place, the number of cacheline splits are naturally reduced.

Additionally, it seems plausible that encoder software could have some control of the placement of each frame. If the beginning of each frame can be placed on 64-byte boundary, there may also be opportunity to control the address alignment of each scanline in some cases (if padding is allowed). These may be additional opportunities to reduce the number of unaligned memory loads to improve performance.

Intel_Software_Netw1 · ‎03-20-2008

Dark_Shikari:
I submitted this question to the email support but it was cut off by the character limit, apparently, so I figure I have no choice but to post it here.

It's good that this topic was opened up for public viewing and discussion on the forums, butdevelopers can always contact us directly, too (isn dot support at intel dot com).

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

dark_shikari · ‎04-06-2008

Thank you for the detailed response--I was just directed by someone to look at this thread again, and lo-and-behold, there was an answer.

If you want to look at our cacheline split implementation, see this file in our git repository.

You mentioned SSE4 in regards to motion estimation, and linked to that article (which I have seen before)--while we're on that topic, what exactly is the intended purpose of MPSADBW? Sequential elimination is faster than an MPSADBW-based exhaustive search by a considerable margin (though I have a few ideas on some slight accelerations of SEA using MPSADBW).

pengvado · ‎04-07-2008

tim18:
Some forms of DTLB misses were much more severe on NetBurst. In some cases, on Core 2, it is possible to use software prefetch to hide page miss latency.

Igor Levicki:
Page boundary issue is more likely because of a page fault.

The page split penalty can't be attributed to page faults. x264 uses only a few MB of memory, and reuses the same memory through the entire encode, so page faults can't account for a measurable fraction of an hours long run. TLB misses don't sound plausible either. While a real program will of course involve some misses, the unit test with which I found the page split issue fit in L1 cache and looped millions of times.

So the question remains: why do page boundaries matter at all given that the data is already in L1 cache? This is also the undocumented part. Documents such as Intel 64 and IA-32 Architectures Optimization Reference Manual or Agner's Optimizing subroutines in assembly mention cacheline split penalty, but I still haven't found anything which mentions page split unless it's talking about x264.

I suppose one could say that it's not important because you should avoid cacheline splits and if you do you've also avoided page splits. But it made benchmarking hard until I figured out what was going on. Note that page splits are so expensive that even though they're 64 times rarer, they account for 22% of the total cycles spent on cacheline splits (assuming random alignment).

wbc8:
If the beginning of each frame can be placed on 64-byte boundary, there may also be opportunity to control the address alignment of each scanline in some cases (if padding is allowed).

I tried that. The problem is that motion estimation accesses every possible offset in the reference frame (or for non-exhaustive searches, a subset of the offsets that's still too large), so the only way to avoid cacheline splits by controlling alignment is to have two copies of the reference frame, offset by half a cacheline. It turned out that in x264 the extra memory cost almost as much time in cache thrashing as it saved in splits, and was far slower than the collection of asm functions we ended up using.

Another thing Dark Shikari didn't mention: Why does PALIGNR take an immediate and not a register? I had to instantiate 15 copies of the SSSE3 SAD function with different immediate values, whereas the MMX and SSE2 versions used one instance with a variable shift (though the shifts are of course slower). Is it just because prior to SSE4 there weren't any x86 instructions taking 3 registers?

TimP · ‎04-07-2008

As the 16 DTLB entries for L1 cache on Core 2 cover only 64KB of data, I'm not convinced you have eliminated TLB misses as a significant possibility. There are counters for this in VTune and PTU.
I'm still confused about which architecture you mean to discuss.

dark_shikari · ‎04-08-2008

Ah yes, pengvado, I realize in my original post (well before I delved deep into the cacheline code to learn how it really worked) I confused the MMX/SSE requiring multiple instances of the function with the PALIGNR version, which uses the computed jump to handle the immediate value.

pengvado · ‎04-10-2008

Here's a trivial program to benchmark the cacheline and page split costs: pagesplit.c. As you can see, it measures the loading of just one memory address over and over.
Results on my Core2 e6600: Addresses within a cacheline take 2 cycles per load, cacheline split takes 13 cycles, page split takes 224 cycles. oprofile says it incurs about 500 DTLB misses per second, and that every cacheline split is a L1d miss, despite being definitely already in cache. No significant number of L2 misses (despite the page splits taking about the same time as a L2 miss. Oh well, one guess down.).
(I failed to install VTune on Gentoo, so using oprofile instead. I think it measures the same things.)

tim18:
I'm still confused about which architecture you mean to discuss.

No specific architecture. Our issue applies to all Intel cpus we could get our hands on, including Pentium 3/4/4D/4E/M, Core, and Core 2.

TimP · ‎04-10-2008

Surely there are significant differences among thes architectures.
One of my colleagues rates 1 DTLB miss per 1000 instructions retired as "bad." You appear to have confirmed a significant rate of DTLB misses (probably L1-DTLB, althoug L2_DTLB misses are possible without cache misses). I guess you can infer for your CPU that cache line splits involve refreshing (2 cache lines of?) L1, as a data modification would.
Yes, it's true that only the latest CPU models have begun to work on the cache line split performance issue.

levicki · ‎04-27-2008

Have you tried locking the memory to prevent paging?