Benefit of aligned memory access

Georg_V_ · ‎03-20-2013

Hi,

I have an application that requires to "stream" through arrays of floats. A lot of time is actually spent on loading and storing __m512 vectors of floats. Unfortunately, those arrays are not 64 byte aligned, and changing the code to support alignment would be a lot of work (much more than just adding a few attributes). So to access my data, I am using code like this:

[cpp]inline __m512 vec_loadu_ps(float const *p_p){

__m512 v=_mm512_setzero_ps();
v=_mm512_loadunpacklo_ps(v,p_p);
v=_mm512_loadunpackhi_ps(v,p_p+16);
return v;
}[/cpp]

Question: How much benefit would be expected by changing the code to have 64 byte aligned data, so that I can use _mm512_load_ps() to access much of the data? Will I see substantial benefit? Or is it rather minor (as I would expect from experience with Nehalem)?

Georg

DubitoCogito · ‎03-20-2013

It would presumably depend on the amount and frequency of data movement and the characteristics of the algorithm, but I have noticed a significant performance difference on the MIC. Solving a 6,000x6,000 matrix using DGEMM I have seen a performance increase of approximately 30% using the recommended 64- versus default 16-byte data alignment. The MIC has rather high memory latency. However, it is difficult to say how it would impact your code overall.

Sumedh_N_Intel · ‎03-20-2013

Posting on behalf of Indraneil Gokhale:

Hello,

On the Intel Xeon Phi coprocessor, If the data is 64 byte aligned, the data comes from a single cache line, so we need only one cache access. But, if it is unaligned, then we may need multiple cache accesses. In case of unaligned accesses on Intel Xeon Phi coprocessor, the compiler issues 2 instructions instead of 1 (vmovloadunpackH and vmovloadunpackL). While compiling your application I suggest you also include the '-align aray64byte' switch, which enforces alignment for vectorization. the benefit of alignment on the Intel Xeon Phi coprocessor is large, so aligning data is highly reccomended.

Thanks,

DubitoCogito · ‎03-20-2013

Does the -align flag affect dynamically allocated memory? Also, I am using v13.0.1 of the Intel compiler and it does not recognize that option.

TimP · ‎03-20-2013

-align array64byte is available only for ifort, and it doesn't (yet?) apply to COMMON. A colleague reports a 60% speedup for an application after changing the COMMON arrays to MODULE arrays, which are aligned by that option. This best case may apply when using VECTOR ALIGNED directives or when the declaration is local, and when the array is a multiple of 64 bytes.

For C or C++ it's necessary to use declspec or attributes aligned definitions.

32-byte alignment may be useful on host, even when there are only 128-bit data moves.

Matthew_G_Intel · ‎03-20-2013

Does the -align flag affect dynamically allocated memory? Also, I am using v13.0.1 of the Intel compiler and it does not recognize that option.

Georg_V_ · ‎03-21-2013

Well, for my C++ program it is probably not as simple as using "-align array64byte" or using some declspec, since the switch only exists for Fortran, plus my data is in std::vector<complex<float >> (which does not use compiler generated storage, rather new()), plus it is accesses as a 2 dimensional array which often means that line 2 is not aligned even if line 1 is.

I am a bit surprised to hear that this may make a difference of 30%. According to vtune, I have >99% cache hits, and according to https://secure-software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemsoftwaredevelopersguide.pdf Table 2.4 an L1 cache hit requires only 1 cycle for data load. I dont know...

Georg