Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Technologies
- Intel® ISA Extensions
- Padding does not help AVX

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

FabioL_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-25-2013
01:54 AM

64 Views

Padding does not help AVX

Hi all

I have the following C function:

void mass_ffc( double A[10][10], double x[3][2])

{

// Compute Jacobian of affine map from reference cell

const double J_00 = x[1][0] - x[0][0];

...

const double J_11 = x[2][1] - x[0][1];

// Compute determinant of Jacobian

double detJ = J_00*J_11 - J_01*J_10;

const double det = fabs(detJ);

// Array of quadrature weights.

const double W12[12] __attribute__((aligned(PADDING))) = { .... };

// Value of basis functions at quadrature points.

const double FE0[12][10] __attribute__((aligned(PADDING))) = \

{{0.0463079953908666, 0.440268993398561, 0.0463079953908666, 0.402250914961474, -0.201125457480737, -0.0145210435563256, -0.0145210435563258, ...0.283453533784293}};for (int ip = 0; ip < 12; ip++) {

double tmp = W12[ip]*det;

for (int j=0; j<10; ++j) {

double tmp2 = FE0[ip]*tmp; #pragma vector aligned

for (int k=0; k<10; ++k) {

A+= FE0[ip] *tmp2;

}

} // end loop over 'j'

} // end loop over 'ip'

} // end function

Compiling it with ICC 2013 (flags: -xAVX, -O3) I end up with, let's say, a quite expected result: the innermost loop over k is fully unrolled, the first two iterations are peeled out and the remaining 8 are performed with avx instructions (mulpd, addpd). Then, I padded the FE0 and A matrices to 12 elements and I increased the k trip count to 12. The idea is that this way I would have been able to get a fully unrolled k loop and to carry it out with just 3 "groups" (mulpd, addpd) of packed avx instructions, saving the time spent for peeling and, in general, with scalar instructions.

Now the point is that if I compile the function with trip count 12, the compiler inserts a long sequence of movupd instructions both **before** and **after **the piece of assembly code representing the **full unrolling of the loops over j and k**. These movupd basically copy the elements in A to the stack (before) and from the stack back to A (after, and then the function returns). For example:

...

vmovupd 32(%r15), %ymm2

vmovupd 96(%r15), %ymm14

vmovupd %ymm15, 1280(%rsp)

vmovupd 608(%r15), %ymm15

vmovupd %ymm1, 1792(%rsp)

vmovupd %ymm2, 1824(%rsp)...

# compilation of the loop nests

...

1760(%rsp), %ymm3

vmovupd %ymm15, 928(%r15)

vmovupd 1600(%rsp), %ymm15

vmovupd %ymm0, 544(%r15)

vmovupd %ymm1, 480(%r15)

Of course, you might say why caring about a so mild (potential?) optimization in such a small function? because the function is invoked millions of times.

My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12?

In the end, the version with trip count 10 goes faster than that with trip count 12.

By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour.

Thanks for considering my (long) request.

Fabio

Link Copied

1 Reply

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-25-2013
06:10 AM

64 Views

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.