- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi all

I have the following C function:

void mass_ffc( double A[10][10], double x[3][2])

{

// Compute Jacobian of affine map from reference cell

const double J_00 = x[1][0] - x[0][0];

...

const double J_11 = x[2][1] - x[0][1];

// Compute determinant of Jacobian

double detJ = J_00*J_11 - J_01*J_10;

const double det = fabs(detJ);

// Array of quadrature weights.

const double W12[12] __attribute__((aligned(PADDING))) = { .... };

// Value of basis functions at quadrature points.

const double FE0[12][10] __attribute__((aligned(PADDING))) = \

{{0.0463079953908666, 0.440268993398561, 0.0463079953908666, 0.402250914961474, -0.201125457480737, -0.0145210435563256, -0.0145210435563258, ...0.283453533784293}};for (int ip = 0; ip < 12; ip++) {

double tmp = W12[ip]*det;

for (int j=0; j<10; ++j) {

double tmp2 = FE0[ip]*tmp; #pragma vector aligned

for (int k=0; k<10; ++k) {

A+= FE0[ip] *tmp2;

}

} // end loop over 'j'

} // end loop over 'ip'

} // end function

Compiling it with ICC 2013 (flags: -xAVX, -O3) I end up with, let's say, a quite expected result: the innermost loop over k is fully unrolled, the first two iterations are peeled out and the remaining 8 are performed with avx instructions (mulpd, addpd). Then, I padded the FE0 and A matrices to 12 elements and I increased the k trip count to 12. The idea is that this way I would have been able to get a fully unrolled k loop and to carry it out with just 3 "groups" (mulpd, addpd) of packed avx instructions, saving the time spent for peeling and, in general, with scalar instructions.

Now the point is that if I compile the function with trip count 12, the compiler inserts a long sequence of movupd instructions both **before** and **after **the piece of assembly code representing the **full unrolling of the loops over j and k**. These movupd basically copy the elements in A to the stack (before) and from the stack back to A (after, and then the function returns). For example:

...

vmovupd 32(%r15), %ymm2

vmovupd 96(%r15), %ymm14

vmovupd %ymm15, 1280(%rsp)

vmovupd 608(%r15), %ymm15

vmovupd %ymm1, 1792(%rsp)

vmovupd %ymm2, 1824(%rsp)...

# compilation of the loop nests

...

1760(%rsp), %ymm3

vmovupd %ymm15, 928(%r15)

vmovupd 1600(%rsp), %ymm15

vmovupd %ymm0, 544(%r15)

vmovupd %ymm1, 480(%r15)

Of course, you might say why caring about a so mild (potential?) optimization in such a small function? because the function is invoked millions of times.

My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12?

In the end, the version with trip count 10 goes faster than that with trip count 12.

By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour.

Thanks for considering my (long) request.

Fabio

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**12**? >> >>In the end, the version with trip count 10 goes faster than that with trip count

**12**. >> >>By the way, if I increase the trip count to, let's say,

**16**, I don't get this weird behaviour. Some optimization tricks are unexplained by software engineers however in a "battle" between numbers

**12**and

**16**Intel engineers prefer to use

**16**. PS: Sorry for off the topic and here two examples: - with Intel C/C++ compiler

**sizeof( long double ) = 16**when option /Qlong-double is used - SIMD structures, like: typedef union __declspec(intrin_type)

**_CRT_ALIGN(16)**__m128 { float m128_f32[4]; unsigned __int64 m128_u64[2]; __int8 m128_i8[16]; __int16 m128_i16[8]; __int32 m128_i32[4]; __int64 m128_i64[2]; unsigned __int8 m128_u8[16]; unsigned __int16 m128_u16[8]; unsigned __int32 m128_u32[4]; } __m128;

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page