Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

## Padding does not help AVX Beginner
132 Views

Hi all

I have the following C function:

void mass_ffc( double A, double x)
{
// Compute Jacobian of affine map from reference cell
const double J_00 = x - x;
...
const double J_11 = x - x;

// Compute determinant of Jacobian
double detJ = J_00*J_11 - J_01*J_10;
const double det = fabs(detJ);

// Array of quadrature weights.
const double W12 __attribute__((aligned(PADDING))) = { .... };

// Value of basis functions at quadrature points.
const double FE0 __attribute__((aligned(PADDING))) = \
{{0.0463079953908666, 0.440268993398561, 0.0463079953908666, 0.402250914961474, -0.201125457480737, -0.0145210435563256, -0.0145210435563258, ...0.283453533784293}};

for (int ip = 0; ip < 12; ip++)  {
double tmp = W12[ip]*det;
for (int j=0; j<10; ++j)  {
double tmp2 = FE0[ip]*tmp;

#pragma vector aligned
for (int k=0; k<10; ++k) {
A += FE0[ip]*tmp2;
}
} // end loop over 'j'
} // end loop over 'ip'

} // end function

Compiling it with ICC 2013 (flags: -xAVX, -O3) I end up with, let's say, a quite expected result: the innermost loop over k is fully unrolled, the first two iterations are peeled out and the remaining 8 are performed with avx instructions (mulpd, addpd). Then, I padded the FE0 and A matrices to 12 elements and I increased the k trip count to 12. The idea is that this way I would have been able to get a fully unrolled k loop and to carry it out with just 3 "groups" (mulpd, addpd) of packed avx instructions, saving the time spent for peeling and, in general, with scalar instructions.

Now the point is that if I compile the function with trip count 12, the compiler inserts a long sequence of movupd instructions both before and after the piece of assembly code representing the full unrolling of the loops over j and k. These movupd basically copy the elements in A to the stack (before) and from the stack back to A (after, and then the function returns). For example:

...

vmovupd 32(%r15), %ymm2
vmovupd 96(%r15), %ymm14
vmovupd %ymm15, 1280(%rsp)
vmovupd 608(%r15), %ymm15
vmovupd %ymm1, 1792(%rsp)
vmovupd %ymm2, 1824(%rsp)

...

# compilation of the loop nests

...

1760(%rsp), %ymm3
vmovupd %ymm15, 928(%r15)
vmovupd 1600(%rsp), %ymm15
vmovupd %ymm0, 544(%r15)
vmovupd %ymm1, 480(%r15)

Of course, you might say why caring about a so mild (potential?) optimization in such a small function? because the function is invoked millions of times.

My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12?

In the end, the version with trip count 10 goes faster than that with trip count 12.

By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour.

Thanks for considering my (long) request.

Fabio  