Multidimensional Transpose -- Prefetching

Paul_S_ · ‎08-12-2015

Hello,

I have been investigating the performance of two multi-dimensional transpositions. Among
other things I have noticed that some transpositions take more time then other, despite
the fact that they move the same amount of data.

I ended up writing a small code-generator which generates vectorized (with AVX intrinsics)
code for a given transposition and all its possible loop orders (I made sure that icc is
not reordering the loops).

The two transpositions that I want to talk about are the following:
1) B_[1, 4, 2, 0, 3] <- A_[0, 1, 2, 3, 4]
2) B_[4, 2, 0, 3, 1] <- A_[0, 1, 2, 3, 4]

Meaning that index 0 in the input of transposition (1) goes to index 3 in the output, ...
Moreover, the unit-stride index is the left-most index (e.g., 0 in A).
The size of the indices are as following:(32,256,16,16,16) for (1) and (32,16,16,16,256)
for (2).

On a side note: I will only show a subset of the performance results to make this story much shorter.

All performance results were taken on an Intel Xeon E5-2670 v2 CPU with speed step and
turbo boost disabled. Moreover, the code was compiled with icc 14.0.2 20140120 using -O3
and -mavx.

The reported bandwith is calculated as follows: 3 * sizeof(float) * size(A) / 1e9 / time.
I used the factor of 3 because I assume that B will be read into cache because I'm using
_mm256_storeu_ps instead of _mm256_stream_ps.

When comparing the fastest solution of each transposition (i.e., the fastest out of 120 = 5!
loop-orderings) we get the following numbers for a serial run:

Transpose (1) achieves 11.7 GB/s while transpose (2) only achieves 6.2 GB/s! For some
reason I decided to turn the hardware prefetchers off (both MLC as well as adjacent
prefetching) and the numbers changed significantly for transpose (1) while they didn't
affect transpose (2). The timings _without_ HW prefetching are: Transpose (1) goes down to
6.2GB/s as well and transpose (2) stays at 6.2 GB/s. Everything else was the same and
these timings are all consistent over several runs.

Hence, I would conclude that HW prefetching made all the difference!

Please note that these are still the best results out of all the 120 loop-orderings.

I also tried the same setup using multi-threading (arrays are initialized in parallel)
running with 40 thread and KMP_AFFINIT=compact (i.e., using SMT). As expected the
additional threads help to cover the memory latency a little bit but the overall effect is
still visible:
Transpose (1) achieves 62 GB/s while transpose (2) only achieves 51 GB/s (both with HW
prefetching). Disabling prefetching is results in the same story: Transpose (1) goes down
to 51 GB/s and Transpose (2) stays the same.

My questions are the following:
1) Why is the prefetching working for one of the transpositions but not for the other?
2) Is it possible that the performance difference comes from something else than
prefetching?
3) Can I use SW prefechting to improve the performance of transposition (2) to the same
level of (1)? If so how would this look like?

Please feel free to ask me for additional data (I probably have it already).
I've also attached the code which was used to measure these results. You can compile it
with "make" and then run ./transpose1.exe or ./transpose2.exe.

Please also note the these results are invariant of the compiler (e.g., gcc gives
comparable results). Moreover, the results on a Xeon E5-2650 are also similar.

Any help is much appreciated.

Thanks a lot,
Paul

transpose.tar

TimP · ‎08-12-2015

I've been trying to build this for long enough; you have some gratuitous system dependencies, for example, all Intel and gnu compilers accept __restrict but not necessarily __restrict__. I have only 1 AVX(2) box. I got through compile with only a few used before set warnings, then failed to link, apparently due to a missing external library.

My expectation would be that you would want to arrange for storage as near sequential as possible. Assuming that makes the reads occur with large strides, software prefetch might be worthwhile there. Hardware prefetch probably requires several reads in sequence per 4KB page to be successful, thus leaving an opening for software prefetch. Maybe transparent huge pages might kick in on your box (not mine). That could eliminate the need for very long distance prefetches or improve h/w prefetch.

Intel compilers may see automatically and report whether there is scope for streaming store, assuming you left it at -opt-streaming-stores=auto. That would confirm no need for prefetch on those stores. It's often worth while to test the opposite setting.

McCalpinJohn · ‎08-12-2015

This is a problem that has been looked at many times over the years. A quick look at a paper I did on the subject in 1995 suggests that it is still applicable (http://dl.acm.org/citation.cfm?id=1122054).

Important things to know:

The hardware prefetchers only work for streams within 4KiB pages.
- (There is a page-crossing L1 prefetcher in Ivy bridge, but it does not seem to make a lot of difference in my tests.)
The L2 hardware prefetchers can only track a limited number of pages (16 to 32, depending on the processor model).
Hardware prefetchers tend to be much more aggressive at fetching read streams than at fetching write streams.
Transpose operations have to deal with non-contiguous accesses over some large area -- it helps if that area is small enough to be mapped by the TLBs.
- For Ivy Bridge, this is 4KiB * 512 entries = 2 MiB when using the default page size.
Strided accesses in transpositions can very easily cause pathological cache conflicts.
- If the stride is a multiple of 4KiB, the L1 Data Cache will only hold 8 entries before beginning to overflow.
Memory bandwidth performance in Intel processors is strongly degraded if a store maps to the same congruence class in the L1 cache as a preceding load.
- This will happen whenever a store maps to the same location in its 4KiB page as a recently preceding load.

From these principles, it is pretty easy to come up with tuning guidelines that work well for transpositions.

Pad arrays to avoid power-of-two strides (or offsets) whenever possible.
Unrolling loops by 8 is usually a very good strategy.
- This minimizes cache conflicts, so you (typically) get to read all the elements in a cache line before it gets evicted.
- The number is small enough that the compiler does not typically get terribly confused about register scheduling.
When working in higher dimensions, make sure that the sub-blocks resulting from the unrolling don't cover more space than the TLB can cover.
- This may mean unrolling by less than 8 for some or all of the loops.
- Analysis from first principles is challenging for 2D transpositions, and is probably not practical for higher than 3D.
Favor contiguous reads over contiguous stores, but remember that there is minimal benefit to contiguous reads longer than 4 KiB.
If using streaming stores, remember to construct the code to store 64 contiguous bytes all at once, then store to the next block.
- This will maximize the number of full write-combining buffers that get used and will minimize the read/modify/write cycles at the DRAM (which are required if a partially full write-combining buffer gets flushed).

I have only looked at the 2D and 3D cases in detail, but unrolling and blocking are required to get good performance -- especially with arrays that have large power-of-two dimensions.

McCalpinJohn · ‎08-13-2015

Some more thoughts...

(1) Parallelization of transposition on NUMA systems introduces some additional opportunities for contention.

Assuming a 1D data decomposition, the straightforward parallelization of a nested set of loops will result in all of the cores starting by reading from node 0, with 1/2 writing back to node 0 and 1/2 writing to node 1. (Or the converse.) Once they have finished that step, they all read from node 1, with 1/2 writing back to node 1 and 1/2 writing to node 0. This is not a disaster on a 2-socket system, but it is sub-optimal.

For 2 sockets it only takes a little bit of swizzling to get rid of the uneven memory loading. For larger NUMA systems it helps to build a transpose in phases.

In Phase 0 each node reads the block of the source that is local and transposes it into the block of the destination that is local.
In Phase 1, each node "N" reads the block of the source from node "N+1" (modulo the number of nodes) and transposes it into the block of the destination that is local.
In Phase 2, each node "N" reads the block of the source from node "N+2" (modulo the number of nodes) and transposes it into the block of the destination that is local.
etc

(2) In performance tuning it is always a good idea to know when to quit. For this case, I ran the STREAM benchmark on a very similar system (2-socket Xeon E5-2680 v2 with 256 GiB DDR3/1866 per node), but with memory placement interleaved across the two nodes. The STREAM Copy kernel should give an approximation to the upper bound of transpose performance (assuming the data is spread across the memory of both sockets). Using icc 14.0 and "-O3 -xAVX -ffreestanding -openmp -opt-streaming-stores [always|never]":

With streaming stores, I get 66.8 GB/s using 20 threads.
Without streaming stores, I get 59.4 GB/s using 20 threads.
- STREAM does not count the "write allocate" traffic, so the raw traffic is 50% higher than this (89.1 GB/s).
- Execution time is inversely proportional to the STREAM bandwidth, so removing the streaming stores increases the execution time by (66.8/59.4)=1.125 -- about a 12.5% drop in performance.

Based on past experience on many systems, I usually quit performance tuning on transposes when I reach 60% of the STREAM Copy value. Beyond (roughly) this point, the performance bottlenecks increase in number and complexity, so the return on investment for additional performance tuning decreases considerably.

The reported (raw) value of 62 GB/s corresponds to about 70% of the 89 GB/s (raw) value from STREAM Copy without streaming stores, so my experience suggests that the remaining performance upside is limited. The STREAM results suggest a possible upside from using streaming stores, but these are tricky to apply in the transpose case, so there is no guarantee that you will be able to find a code organization that will be able to exploit this potential.

Paul_S_ · ‎08-14-2015

Thank you for your answers.

To be honest, the only things that I was planing to tune for, w.r.t. transpositions, were:
vectorization (i.e., spatial locality) and the number of tlb entries. Looking at the
performance results and your suggestions I might add a heuristic which measures the
"linearity of memory accesses" and the size of the stride as well as the trip-count of the
inner-most loop (to help the prefetcher to learn).

Regarding "linearity of memory accesses", does the HW prefetcher look at every load
instruction individually and tries to predict its next address or does it look at the
memory accesses of all memory instructions and tries to determine a pattern?

I have modified the code to use streaming stores and I noticed that the performance goes
down significantly (14 GB/s). Everything is well aligned and the results are
correct. Can you think of an easy explaination? I have also checked the assembly code and
I can confirm that the compiler indeed used 'vmovntps'.

John, apparently I run into a case for which the rule of thumb "preferring contiguous reads
over rights" does not apply. In that case the inner-most loop prefers contiguous writes,
but everything goes back to the HW prefetcher: without HW prefetching both loop orderings
are identical. I guess that this is due to the fact that the stride-1 index for the output
is larger than the stride-1 index of the input which allows the HW prefetcher to detect
the pattern more easily. All to say that looking at the huge variety of different
transpositions (and their implementations, sizes,...), I agree that an analysis from first
principles seems not practical.

Moreover, I've measured the 'copy'-bandwidth of STREAM benchmark using 40 threads and it reports
62.2 GB/s. However, 'triad' reports 75.8 GB/s. Is that a typical scenario?

McCalpinJohn · ‎08-14-2015

The Xeon E5-2670 v2 has at least four hardware prefetchers (maybe five, depending on whether the new "Next Page Prefetcher" is a separate unit or a function of one of the existing prefetchers). As far as I know the only documentation on these prefetchers is in Section 2.2.5.4 of the Intel Optimization Reference Manual (document 248966), with two sentences describing the Ivy Bridge "Next Page Prefetcher" in Section 2.2.7.

These descriptions are helpful, but certainly not complete enough to really understand what is going on.

The difference between STREAM Copy and Triad results can be due to two factors:

If compiled without streaming stores (e.g., with gcc), the bandwidth is typically limited at the DRAMs, so the difference in the number of write allocates for the two kernels leads to a difference in reported results -- even if the raw DRAM activity is the same. STREAM Copy reports 16 Bytes of traffic per iteration, which is 2/3 of the raw traffic with write allocates. STREAM Triad reports 24 Bytes of traffic per iteration, which is 3/4 of the raw traffic with write allocates. So the ratio of the two would be (3/4)/(2/3)=9/8=1.125. Your ratio is about 1.22:1, which seems a bit too large for this to be the only explanation.
If compiled at high optimization levels, both the Intel compiler and the gcc compiler will replace the Copy kernel with a call to an "optimized" memory copy routine. I call it "optimized" because the optimization typically favors single-thread performance over multi-thread performance. With either compiler the "-ffreestanding" flag prevents this replacement. If there is a significant difference between the Copy and Scale performance (which have the same memory reference pattern), it is probably a good idea to try again with the "-ffreestanding" flag.

Depending on the memory configuration, you may find slightly better performance for STREAM if you don't use all the threads. In general you want to use no more than one core per 2 DRAM banks. If you have one dual-rank DIMM per channel, that provides 2x8=16 banks and you are OK up to 8 threads/socket. If you have two dual-rank DIMMs per channel, that provides 2x2x8=32 banks and you are OK up to about 16 threads/socket. My Xeon E5-2680 v2 systems have 2 dual-rank DIMMs per channel and run best with 10 cores. We don't have HyperThreading enabled on those nodes, so I can't test it, but lots of experience suggests that using more than 16 threads/chip would degrade performance by a few percent. Systems with one single-rank DIMM per channel can experience much larger performance reductions -- I have seen ~15% performance loss on STREAM and up to 30% with applications that have more memory access streams per thread.

The low performance with streaming stores suggests that the code is generating too many streaming store streams. The hardware only supports a small number of streaming store buffers, and if you interleave writes across too many streams, the store buffers will flush prematurely. This will result in shorter, less efficient data transfers across the ring, and may force the memory controller to perform read/modify/write cycles on the cache line. (The memory controller can buffer these partial cache line stores in the hope of combining them into full cache line stores, but whether it will succeed or not depends on lots of implementation details that are not visible to us.)

There are other possibilities that I can think of, but I am not sure that I can remember them well enough to avoid accidentally disclosing confidential information.

Paul_S_ · ‎08-14-2015

Thanks for all the very helpful insights.

Looking at the limited information in the Intel documents, it seems that the IP prefetcher
is more important for my purposes than the other prefetchers. Unfortunately the document
doesn't specify how many slots it has (i.e., for how many loads) nor does it say how long it
takes too pick-up a pattern.

Furthermore, I can only see two prefetchers in the BIOS, the MLC prefetcher and
the Adjacent-cacheline prefetcher. However, I just noticed that I seem to be able to
enable/disable each of the for prefetchers separately, such that I can test which
prefetcher is making all the difference:
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

Can you point me to some good document/book that sheds some light on HW prefetchers in
general?

As you have guess correctly, 'scale' indeed had a higher BW than 'copy'. You were also
right about '-ffreestanding': adding it to the compile-line improved the performance to 73
GB/s.

I also payed attention to thread placement and reduced the number of threads with the
following results (all with '-ffreestanding'):
KMP_AFFINITY=compact OMP_NUM_THREADS=16 taskset -c 0-3,5-8,10-13,15-18 ./stream_c.exe: copy / triad: 71 / 83 GB/s
KMP_AFFINITY=compact OMP_NUM_THREADS=20 taskset -c 0-19 ./stream_c.exe: copy / triad: 75 / 83 GB/s
KMP_AFFINITY=compact OMP_NUM_THREADS=40 ./stream_c.exe: copy / triad: 72 / 75 GB/s
KMP_AFFINITY=scatter OMP_NUM_THREADS=20 taskset -c 0-19 ./stream_c.exe: copy / triad: 63 / 69 GB/s
KMP_AFFINITY=scatter OMP_NUM_THREADS=40 ./stream_c.exe: copy / triad: 54 / 55 GB/s

Hence, hyper-threading indeed reduces the performance by ~10% in the triad case. It seems
to be optimal to use 10threads/socket.

What do you mean by "streaming store streams" I never heard this term before. Does it mean
that the CPU tries to buffer streaming stores into X streams to reduce the activations of
different memory banks? If so, I guess that the non-linear memory accesses of the
six-dimensional transpose is creating too many streams, is that correct?

You also mentioned that "The memory controller can buffer these partial cache line stores
in the hope of combining them into full cache line stores,...". Is the following scenario
a possible explanation to the performance problem that I'm seeing?

We have X buffers, but Y > X consecutive streaming stores write to different,
non-consecutive locations (i.e., non will fill-up a whole cache line). This would require
the memory controller to fetch each cacheline separately, then modify it and finally write
it. This does not only seem to be inefficient in terms of BW but also latency. On the
other hand, without streaming stores, the modified cache lines are kept in cache and may
be combined at a later stage (given that the cache has a larger capacity than the
stream buffers).

Thanks,
Pau

McCalpinJohn · ‎08-16-2015

I don't know of any reliable references for recent prefetcher designs. I have been on the processor design teams at IBM and AMD and what I remember is that there are a lot of picky details. Many of the details depend on confidential idiosyncrasies of the implementation of the load/store unit, the cache, or the coherence protocol. Some of my speculations on the algorithms used by the Intel L2 streaming prefetchers are at https://software.intel.com/en-us/forums/topic/569935

Sorry for the ambiguous nomenclature on "streams". It looks like you figured out what I meant. On AMD Family10h processors, for example, there were 4 "write-combining buffers" used to collect the data from streaming stores. So you could interleave stores to four different cache lines, but if you tried to interleave stores across 5 (or more) cache lines, every new store would cause one of the "write-combining buffers" to send its payload to the memory controller prematurely. The processor would send a "partial cache line write" transaction to the memory controller, and (if the rest of the line does not show up during the memory controller's buffering window) the memory controller has to perform a read/modify/write cycle on the cache line (rather than just writing the data, which it would do if a full cache line of data arrived). The read/modify/write cycle is required because the memory controller has to re-write the ECC bits, and those have to be computed based on the merged data.

It appears that Intel processors use the "Line Fill Buffers" to hold the data for streaming stores. Most recent Intel processors have 10 of these buffers per core, but the documentation warns that you should not assume that interleaving streaming stores across multiple cache lines will be fast. This is another reason to unroll loops to operate on full cache lines -- then you can rearrange the stores so that all the stores to a particular cache line are performed in contiguous instructions.

TimP · ‎08-16-2015

In some applications I supported, usage of line fill buffers was optimum when 6 to 9 separate streams were written per core. One would expect that re-use of 1 or more line buffers will stall while the optimally completed cache line they contain is being written out. As John hinted, if you make a point of writing full cache lines, you will not need to be concerned about the number of buffers. AVX512 appears to be oriented toward full cache line aligned writes, which are quite important on the current Intel(r) Xeon Phi(tm).

Prior to Woodcrest, Intel NetBurst CPUs used a write-combining scheme such as John discussed, with at most 6 wc buffers.

I don't recall an explanation why the write-combining buffering terminology was replaced by line fill buffers, but the cache updating scheme appeared to have been improved.

As John said, alternating stores to too many cache lines will thrash the buffers, requiring partial cache line writes and read-back. The situation is not so bad with AVX-256 parallel store, as the worst case then would be 2 writes per cache line rather than 1.

Intel compilers appear to consider opportunities to split (distribute) and fuse loops at -O3 so as to optimize line buffer usage numbers automatically. I haven't seen where loop unrolling (other than vectorization for 512-bit register) caused compilers to sort stores and align loops to handle full cache lines.

The buffers are shared between hyperthreads, so the threads on a given core compete for access to line fill buffers. This may be one reason why hyperthreads don't help with memory bandwidth.