If you take a look at my code

Sebastian_v_ · ‎02-05-2015

Hi,

I'm currently trying to figure out how the Xeon Phi performance depends on memory alignment. Of course it's beneficial to use at least 64 Byte alignment to use SIMD instructions. However other than that I'm observing another huge dependency on performance if I choose a more coarse memory alignment. For my specific application I get the best performance if I allocate memory inside the offloaded region using _mm_malloc and aligned to 32KB boundaries. Allocating outside the parallel region and/or choosing a smaller/larger alignment gives up to 20% worse performance. I'd like to understand what other alignment issues have to be taken into consideration and how to get the same performance using memory allocation on the host.

Also I'm observing a ~4% deviation over different runs using the same binary. This seems to be caused by the memory allocations since reusing a once allocated block of memory gives very little deviation and freeing and reallocating memory gives the same deviation as different runs of the executable. This is strange since the reallocation gives the same addresses for my memory blocks.

My application is using four memory blocks of 64MB size, each organized in a 256^3 cube of 32 bit integers. If it's not possible to answer my question in general I might post my full code later.

Frances_R_Intel · ‎02-06-2015

Without your code it is hard to say what is going on here, but aligning on 32KB boundaries is definitely strange. I'm not even sure what _mm_malloc thinks of such a request. So I don't have anything specific to say, but here are some generalities:

If you haven't already done so, you probably want to have the compiler generate a vectorization report. Try '-vec-report=5' (the old way of generating the report) or '-qopt-report=5 -qopt-report-phase=vec'.

You used _mm_malloc to allocate the arrays but did you also use either '#pragma vector aligned' or '__assume_aligned(<array_name>,<align_boundary>)' where the array was used? (The #pragma vector aligned' needs to go right before the loop that is being vectorized.) When it vectorizes a loop, the compiler can't always determine that the array is aligned unless you tell it so at the point where you use the array. (I wonder if this is why moving the _mm_malloc outside the parallel region had such a big effect.)

Is the data alignment maintained inside each thread in your parallel region? If you are parallelizing on an outer loop, so that an individual thread only ever works on whole rows (since you are in C) in the vectorized loop, you are ok. If you have collapsed everything and the loop that vectorizes is also the loop that is being parallelized, an individual thread might be given a part of the array to work on that isn't aligned.

Sebastian_v_ · ‎02-06-2015

Hi,

thanks for your reply. I probably should have said that I used intrinsics all over the place so I'm not depending on vectorization by the compiler. I also have attached my source code. I compile this using: icc ising.cpp -O3 -openmp -lrt -std=c++11 -opt-prefetch-distance=6,1

I'm current getting the best performance with enviroment variables set to: KMP_AFFINITY=compact,granuarity=fine; KMP_STACKSIZE=32k This gives me a peak performance of 5.8 picoseconds per spin update on a Xeon Phi 3120. The calculation is memory bound and depends on proper prefetching and cache locality. I hope my source code helps in finding the cause of my problems. Also if you notice any other opportunity for performance improvements please let me know.

jimdempseyatthecove · ‎02-06-2015

The allocation of the larger aligned buffers might yield a reduction by 1 in the number of DTLB's required (per allocation) to run the functions. This reduction of on DTLB results in the number used being .LE. number available, and the non-32KB aligned data exceeds the number of DTLB's required, the core will have to re-fetch the DTLB entries in addition to normal data.

You might also consider experimenting with the DTLB page size (I haven't done this).

Jim Dempsey

Sebastian_v_ · ‎02-06-2015

I do not really understand your argument. The DTLB pages are 4KB or 2MB in size so optimal alignment should be 4KB or 2MB. Also this doesn't explain why performance gets worse for bigger alignment sizes. I choose only alignments that are a power of two so if I align to 64KB or even 4MB it's still aligned to 32KB boundaries but gives notable worse performance. I already did some experiments with MIC_USE_2MB_BUFFERS but it didn't change much. The spin flip time was a little bit longer if I remember correctly.

McCalpinJohn · ‎02-06-2015

There are lots of places where alignment might matter. Some are easier to see than others....

The L1 cache won't see 32KiB alignment as different than any other kind of alignment, but with large pages the L2 cache might see a difference. The L2 cache is described as 512 KiB and 8-way associative, so for addresses on large pages the L2 cache appears to be 64 KiB "tall" with each entry being 8-way set-associative. (The L2 cache is also described as having 2 banks, but I don't recall seeing a description of how the bits are divided between the banks.)

When using 4KiB pages, the data TLB has 64-entries with 4-way associativity, so addresses that are 16 pages apart (64 KiB) will map to the same set. This is no trouble if you only access 4 such arrays (per core), but could be a problem if you try to access more than 4. Note that the TLB is shared across the threads, so even 2 arrays separated by a multiple of 64 KiB is a problem when using 3-4 threads. The data TLB is also 4-way associative for 2MiB pages, but with 2MiB pages a smaller alignment like 32KiB should not have any special behavior.

If ECC is enabled, memory should not care about 32KiB alignment (as I will discuss in some detail below), but if ECC is disabled there is a chance that 32 KiB alignment is special. (I have not looked at DRAM mapping with ECC disabled, but 8 memory controllers or 16 channels with 4KiB DRAM page size makes 32KiB and 64 KiB "natural" repeat ranges for many types of mappings.)

I don't recall if I posted this set of results here before, but I recall seeing a significant slowdown in the STREAM benchmark on Xeon Phi when the offset between corresponding data items used by different threads was within a few hundred elements of 8 MiB. So when the total array size divided by the number of threads was divisible by 8 MiB, there was a conflict somewhere that dropped the STREAM Triad memory bandwidth from ~178 GB/s to ~145 GB/s -- almost 19%.

Each thread was working on (a piece of) 2 or 3 vectors, but a bunch of experiments showed that the slowdown was not dependent on the relative alignment of those 2-3 addresses within each thread. Instead, it depended on the distance between the starting points used by the various threads.

In this case it was pretty easy to rule out all the standard conflict models. The code was run with one thread per core, so no conflicts inside the core could be responsible. On most systems I would have blamed a DRAM conflict, but on Xeon Phi when ECC is enabled the DRAM mapping is based on blocks of 62 contiguous cache lines mapped to a repeating permutation of the 16 DRAM channels. (The other 2 cache lines in each 4KiB DRAM page are used to hold the ECC data, so the user's "physical addresses" have to be mapped to skip over those DRAM locations.) This mapping has a factor of 31, which does not correspond to the observed slowdown near large powers of 2.

The only explanation that I can think of cannot be confirmed or denied based on public documentation. We know that each physical cache line address is mapped to a Distributed Tag Directory that assists with managing cache coherence. The mapping of physical cache line addresses to Distributed Tag Directories is unpublished, but it appears to be pseudo-random within each 4KiB page (i.e. address bits 11:6) and slowly varying for consecutive 4KiB pages. If the mapping is based on the hash of a limited number of address bits, addresses that differ only in the *higher* address bits will all map to the same DTD. I interpret these results as suggesting that the hash uses address bits starting at bit 6 (just above the cache line boundary) but ending before bit 23. Then physical addresses that differ only in bits 32:23 will map to the same DTD, and sequences of addresses that start at these locations will access the same sequence of DTDs in the same order, so the conflict will be persistent. So in the STREAM case, the idea is that at the start of the parallel section, all 60 cores miss in their L2 and send their requests to 1 of the ~60 Distributed Tag directories. This contention partially serializes the memory accesses and results in the observed slowdown. Even if every core sends out 8 software prefetches, they will all go to the same set of 8 DTDs because of the fixed offset, so you will get some parallelism in the DTDs, but not nearly as much as is needed. During the execution of the parallel section the cores will get out of sync and the accesses will be spread over more DTDs, but at the end of each parallel section the threads are brought back into sync and the conflicts at the DTDs are started again.

Sebastian_v_ · ‎02-09-2015

Thanks for your very extensive answer.

I just did some testing and I also measured a performance dependence on the relative alignment of my arrays. Instead of four memory allocations I allocated one large block of memory and placed my four arrays with different padding in there. Again I found that with 32KB padding between each array I get the best performance. The impact of absolute alignment is now less critical and also gives good performance with large alignment boundaries.

However I'm still having a problem with allocations outside the offload region and get a performance degradation of about 20%. I need these because I want to save the whole array into a file after a certain amount of timesteps and then continue simulating.

jimdempseyatthecove · ‎02-09-2015

Assume you record once every 100 iterations, then you would be looking at:

1% of (20% / 2) for a copy operation where the copy operation is ?? 1% of the compute operation.

Also, the allocations of the output buffers can be made inside the parallel region, and used outside the parallel region (as well as used inside the parallel region). There would be no reason to allocate these buffers more than once for the life of the program.

Also look at using non-temporal stores when copying to the output buffers.

An alternative that can pick up more time is to duplicate your inner compute loop, one that is as is, and the other that stores to both the current array plus the output buffer. This would save you a read.

Jim Dempsey

Sebastian_v_ · ‎02-09-2015

Ok I guess with an additional buffer it would be possible to allocate memory in the offload region and copy it to the host using the additional buffer. Using the nocopy parameter to save a pointer that is allocated using _mm_malloc inside an offload region seems to be allowed (haven't tested it yet). However I'm still wondering what makes allocations on the host different from allocations inside the offload region.

jimdempseyatthecove · ‎02-09-2015

If you are not using nocopy now, then the offload is copying the empty output buffer to the MIC on each offload. I suggest you investigate and assure that the allocations are performed once and the copy goes only in the direction needed. This may require you perform an initial offload to perform the allocation, then perform subsequent offloads using the in, out and into parameters as required.

Jim Dempsey

Sebastian_v_ · ‎02-09-2015

If you take a look at my code above you will see that I only use one large offload region with the simulation loop inside a parallel region. Currently all allocations happen inside the offload region using _mm_malloc but since I need the data on the host I wanted to allocate memory on the host, transfer it to the phi card on the first offload and only copy it back to the host when needed. However I'm already failing at the first step because allocating outside the offload block and using offload in parameter gives a 20% slowdown which is unacceptable. Just to be clear a snippet of example code:

// First version working fine
#pragma offload target(mic:0)
{
  uint32_t* system = _mm_malloc(...);
  
  // 500 iterations
}

// Second version 20% slower
uint32_t* system = _mm_malloc(...);
#pragma offload target(mic:0) in(system : length(...))
{  
  // 500 iterations
}

Of course I only measure time for the iterations and not the allocations. Using other offload parameters like align doesn't seem to help.

jimdempseyatthecove · ‎02-09-2015

// Third version
uint32_t* system = _mm_malloc(...);
#pragma offload target(mic:0) in(system : length(...) align(32768) )
{  
  // 500 iterations
}

Note, you wouldn't perform the 2nd or 3rd version unless you were passing data into (and/or out of) the MIC.

Jim Dempsey

Performance dependence on memory alignment and allocation