The compiler will use movupd

FabioL_ · ‎12-17-2012

Hi all

Some preliminar information. I have the latest intel suite (2013) on a Linux machine. In the code, PADDING is a macro expanded to either 16 or 32 depending on the actual vector instruction set used to compile the program (SSE4.2 or AVX).

I am struggling to get aligned accesses on this code:

void f ( double A[restrict 3][4], double x[3][2] ) {

double det = ... ;

double W3[3] __attribute__((aligned(PADDING))) = {0.166666666666667, 0.1666
66666666667, 0.166666666666667};
double FE0[3][4] __attribute__((aligned(PADDING))) = \
{{0.666666666666667, 0.166666666666667, 0.166666666666667},
{... }};

for ( int ip = 0; ip <3; ip++)
{
double tmp = W3[ip]*det;
for ( int j = 0; j < 3; j++ )
{
#pragma vector aligned
for ( int k = 0; k < 4; k++ )
{
A += FE0[ip]*FE0[ip]*tmp;
}
}
}
}

In the caller, A is a static array (of size [3][4]) labelled with __attribute__((aligned(PADDING)))

Despite extensive use of __attribute__ (aligned) and pragmas, by looking at the assembly (generated by icc -O3 -ansi-alias nomefile.c -xSSE4.2 -restrict), it is clear how for the innermost loop unaligned loads and stores (movupd) are used, Shouldn't I expect movapd?

Notice how FE0 and A have been padded from 3 to 4 doubles (and the trip count of loop k extended to 4) so that it is possible to exploit avx aligned instructions.

Thanks for considering my request

Fabio

TimP · ‎12-17-2012

The compiler will use movupd even where movapd could have been used, if no performance difference is expected. Successful auto-vectorization of such short loops makes it appear that alignment is expected. For AVX compilation, 256-bit moves (unaligned instructions) will be used only when 32-byte alignment is expected; otherwise 128-bit instructions are used (usually in pairs).

FabioL_ · ‎12-18-2012

Hi Tim, thanks for the immediate answer, much appreciated. In light of your answer, I have three other questions. 1) From my experiments, it is evident how the only way to get the innermost loop efficiently vectorized is to pad the data structures and increase the k loop counter. This way, the loop is vectorized with both SSE4.2 and AXV, and the final performance is slightly better than the original version (the execution time of the function called N times is smaller of roughly 17% when using SSE). However, it seems the other keywords (restrict, pragma, attribute align) have no effect on the final performance. Is it because the compiler automatically align the data structure to the proper base or because in this specific case there is no specific benefit in getting aligned data structure (if so, why) ? 2) (tight to 1) For which sizes (order of magnitude) of the iteration space the impact of the various (restrict, align, ..) could become significant and why? 3) By compiling with AVX, the final performance is worse than with SSE4.2. Instead, I would expect an improvement, as an entire row of the innermost loop could be computed with single packed instructions. What am I missing? Thanks again Fabio

TimP · ‎12-18-2012

I believe the pragma vector aligned doesn't add anything to __attribute__((aligned..... Without the __attribute__((aligned it would assert 16-byte data alignment for SSE4 and 32-byte alignment for AVX. As the FE0 and tp are local data structures which can't be aliased to A[], it appears there is no need to define double *restrict A. The local data structures should be 16-byte aligned by default. If A[] definition isn't visible in the compilation unit, either attribute align or vector aligned would be needed for compilation to assume alignment. AVX performance on corei7-2 or -3 would be limited to 16 bytes stored per clock cycle, so you can't expect AVX to improve performance of this fragment over SSE2. If any of the variables were not 32-byte aligned, or not recognized at compile time as 32-byte aligned, AVX-256 code would be slower In such a simple case, you might be able to find out something by using AVX intrinsics for the inner loop.

FabioL_ · ‎12-18-2012

Thanks for this. Could you clarify this sentence: "AVX performance on corei7-2 or -3 would be limited to 16 bytes stored per clock cycle"? Actually, I am getting worse performance with AVX than with SSE even if I told the compiler to align the data structures. Your last sentence is ambiguous: if the case is really simple, the compiler auto-vectorization should be the best choice. What could I ever do with intrinsics? Thanks

TimP · ‎12-18-2012

As your code appears to require 2 data reads and 1 store per clock cycle, if you had sufficient data locality in L1 cache, you might approach the capability of "Sandy Bridge" core i7-2 to read 32 bytes and store 16 bytes per cycle, regardless of whether you use 2 16-byte memory reads and 1 16-byte write issued in a single cycle or 32-byte memory accesses issued in 2 cycles. Any lower effective memory bandwidth limitation due to accessing higher levels of cache or memory would appear to apply regardless of choice of SSE4 or AVX instructions. If you specified 32-byte vmovaps instructions, you would find out whether the data are actually aligned as well as checking on your original supposition that those perform different from vmovups. You could also check alignment by testing the low order bits of the addresses. One would think that it doesn't matter whether tmp is 32-byte aligned, in the case where it is carried in register for the duration of the inner loop, except that the inner loop isn't long enough to be sure of that.

FabioL_ · ‎12-19-2012

Thanks for these answers, you've been very helpful. A last observation: in one of your previous answers, you said that "I believe the pragma vector aligned doesn't add anything to __attribute__((aligned..... Without the __attribute__((aligned it would assert 16-byte data alignment for SSE4 and 32-byte alignment for AVX." Actually, this is not what I found in some intel's presentations about vectorization, where it is clearly stated that is needed to "Align your data (attribute align..)) *AND* tell the compiler (e.g. by means of pragma or __assume_aligned) "

TimP · ‎12-19-2012

Since the data are defined with alignment in the same compilation unit, #pragma vector aligned shouldn't be needed. I agree that the doc you refer to doesn't make it clear that this pragma would be used when the for() is in a different compilation unit from the data definition, or what the specific alignment assumptions are.

Getting aligned accesses with AVX/SSE