Padding does not help AVX

FabioL_ · ‎01-25-2013

Hi all

I have the following C function:

void mass_ffc( double A[10][10], double x[3][2])
{
// Compute Jacobian of affine map from reference cell
const double J_00 = x[1][0] - x[0][0];
...
const double J_11 = x[2][1] - x[0][1];

// Compute determinant of Jacobian
double detJ = J_00*J_11 - J_01*J_10;
const double det = fabs(detJ);

// Array of quadrature weights.
const double W12[12] __attribute__((aligned(PADDING))) = { .... };

// Value of basis functions at quadrature points.
const double FE0[12][10] __attribute__((aligned(PADDING))) = \
{{0.0463079953908666, 0.440268993398561, 0.0463079953908666, 0.402250914961474, -0.201125457480737, -0.0145210435563256, -0.0145210435563258, ...0.283453533784293}};

for (int ip = 0; ip < 12; ip++) {
double tmp = W12[ip]*det;
for (int j=0; j<10; ++j) {
double tmp2 = FE0[ip]*tmp;

#pragma vector aligned
for (int k=0; k<10; ++k) {
A += FE0[ip]*tmp2;
}
} // end loop over 'j'
} // end loop over 'ip'

} // end function

Compiling it with ICC 2013 (flags: -xAVX, -O3) I end up with, let's say, a quite expected result: the innermost loop over k is fully unrolled, the first two iterations are peeled out and the remaining 8 are performed with avx instructions (mulpd, addpd). Then, I padded the FE0 and A matrices to 12 elements and I increased the k trip count to 12. The idea is that this way I would have been able to get a fully unrolled k loop and to carry it out with just 3 "groups" (mulpd, addpd) of packed avx instructions, saving the time spent for peeling and, in general, with scalar instructions.

Now the point is that if I compile the function with trip count 12, the compiler inserts a long sequence of movupd instructions both before and after the piece of assembly code representing the full unrolling of the loops over j and k. These movupd basically copy the elements in A to the stack (before) and from the stack back to A (after, and then the function returns). For example:

...

vmovupd 32(%r15), %ymm2
vmovupd 96(%r15), %ymm14
vmovupd %ymm15, 1280(%rsp)
vmovupd 608(%r15), %ymm15
vmovupd %ymm1, 1792(%rsp)
vmovupd %ymm2, 1824(%rsp)

...

# compilation of the loop nests

...

1760(%rsp), %ymm3
vmovupd %ymm15, 928(%r15)
vmovupd 1600(%rsp), %ymm15
vmovupd %ymm0, 544(%r15)
vmovupd %ymm1, 480(%r15)

Of course, you might say why caring about a so mild (potential?) optimization in such a small function? because the function is invoked millions of times.

My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12?

In the end, the version with trip count 10 goes faster than that with trip count 12.

By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour.

Thanks for considering my (long) request.

Fabio

SergeyKostrov · ‎01-25-2013

>>... >>My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12? >> >>In the end, the version with trip count 10 goes faster than that with trip count 12. >> >>By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour. Some optimization tricks are unexplained by software engineers however in a "battle" between numbers 12 and 16 Intel engineers prefer to use 16. PS: Sorry for off the topic and here two examples: - with Intel C/C++ compiler sizeof( long double ) = 16 when option /Qlong-double is used - SIMD structures, like: typedef union __declspec(intrin_type) _CRT_ALIGN(16) __m128 { float m128_f32[4]; unsigned __int64 m128_u64[2]; __int8 m128_i8[16]; __int16 m128_i16[8]; __int32 m128_i32[4]; __int64 m128_i64[2]; unsigned __int8 m128_u8[16]; unsigned __int16 m128_u16[8]; unsigned __int32 m128_u32[4]; } __m128;