Inefficient code for Coulomb interaction: why??? - Page 2

Jofre1 · ‎01-20-2021

Dear all,

I was re-testing performance on an FORTRAN code with my "new" machine (2-years old Intel Xeon W-2155), and I was scared by the Intel Advisor results: 12.41% of Peak Performance (Single Thread).

I have spend a fair amount of time looking around and nothing I did improve it. I decided to re-write it on C, to take manual advantage of AVX512 instructions. The new code bring things up to 24.41%, which is good, but still far away from peak performance.

I was wondering if any of the local experts could help me out in understanding why the low performance, and if possible, to help to get closer to the peak value.

In principle (as given by Advisor) the problem is "inefficient memory pattern" (although I have "unit stride" and low cache miss, if I understand correctly the information reported from Advisor). Using the typical de-doubling of the FOR loops, to "improve cache", makes things worst...

I am interested, at this stage, only on Single Thread performance. The OpenMP part should be fine (fingers-crossed!)

I join several files:

- 'Verlet_SingleThread_Simplified.c++' : the codes no-optimized code

- 'Summary_SingleThread_Simple.png': Result of the code above as given by Advisor

- 'Verlet_SingleThread_Simplified_RSQRT.c++' : the code "a bit"-optimized

-'Summary_SingleThread_with_Intr.png': Result of the code above as given by Advisor

(The "forum" complained about the *.c++ not valid, so I changed the extension to to *.c)

Keep in mind that they are C versions written by a FORTRAN person, who learnt C in the last two weeks just for this project!

Thanks in advance,
J.

jimdempseyatthecove · ‎01-28-2021

where is the fun when there is no discovery involved....

__m512d _mm512_permute_pd(__m512d a, const int imm8)

conditionally selects the other paired (adjacent) lanes.
each bit in the imm8 represents each lane selector.
0 == output same lane, 1 == output other lane
where other lane of lane 0 is lane 1,
other lane of lane 1 is lane 0
2 with 3, 4 with 5, 6 with 7.

So, use permute to swap adjacent lanes with the imm8=0xFF

__m512d _mm512_permutex_pd (__m512d a, const int imm8)

Is slightly different...

The input vector is divided in two sections (256-bit low, and 256-bit high)
The same permutation selections are performed on both halves.
Each half now holds 4 dp scalars
The imm8 value holds four 2-bit selection encodings, one for each of the 4 scalars
The 2-bit field is an index into which of the 4 dp scalars to output

You can use this to swap lanes imm8 = 0b10110001
.OR. to swap double lanes imm8 = 0b01001011
(or other outputs not required for your program)

So, use permutex for swap double lanes

Swapping quad lanes (efficiently) requires you to put on your thinking cap.

__m512d _mm512_shuffle_f64x2 (__m512d a, __m512d b, const int imm8)

Where a and b can be the same register (and the output can be the same register)

imm8 is split into four 2-bit fields
The first two 2-bit fields are selectors for input a
The second two 2-bit fields are selectors for input b (a can be same as b)
each 2-bit field represents an index of 128-bit fields

yourVec = _mm512_shuffle_f64x2(yourVec, yourVec, 0b01001110); // swap quads

I recommend that you experiment with commenting out your compute sections leaving only your vector lane manipulations.

Then start with initializing the initial inputs with values that identify the lanes.

{0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7 }

Then single step through your lane manipulations to verify that they match the sketch code shown earlier.

Jim Dempsey

Jofre1 · ‎01-28-2021

OK... I will have never found it in my own!!!

After a bit of playing around (just in case it helps somebody else):

#include <stdio.h>
#include <immintrin.h>
int main(){
__m512d a, b;
const int i_swap = 0b01010101;
const int i_quad = 0b01001011;
const int i_lane = 0b01001110;
double r[8]  __attribute__((aligned(64)));
for( size_t i = 0; i < 8; i++ ){ r[i] = i*1.0e00; }

a = _mm512_load_pd( &r[0] );
printf("%d %d %d %d %d %d %d %d\n",int(a[0]), int(a[1]),int(a[2]), int(a[3]),int(a[4]), int(a[5]),int(a[6]), int(a[7]));

b = _mm512_permute_pd (a, i_swap);
printf("%d %d %d %d %d %d %d %d\n",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));

b = _mm512_permutex_pd (b, i_quad);
printf("%d %d %d %d %d %d %d %d\n",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));

b = _mm512_permute_pd (b, i_swap);
printf("%d %d %d %d %d %d %d %d\n",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));


b = _mm512_shuffle_f64x2(a, a, i_lane);
printf("%d %d %d %d %d %d %d %d\n",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));

b = _mm512_permute_pd (b, i_swap);
printf("%d %d %d %d %d %d %d %d\n",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));

b = _mm512_permutex_pd (b, i_quad);
printf("%d %d %d %d %d %d %d %d\n",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));

b = _mm512_permute_pd (b, i_swap);
printf("%d %d %d %d %d %d %d %d\n",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));


return 0;
}

Leads to:

0 1 2 3 4 5 6 7
1 0 3 2 5 4 7 6
2 3 1 0 6 7 5 4
3 2 0 1 7 6 4 5
4 5 6 7 0 1 2 3
5 4 7 6 1 0 3 2
6 7 5 4 2 3 1 0
7 6 4 5 3 2 0 1

I will try to use this approach on the main code tomorrow.

Thanks again Jim!!!

Jofre

jimdempseyatthecove · ‎01-29-2021

Nice test code....

Wait, there's more... (old TV commercial sales pitch)

After you get that working....

Count the AVX512 registers used out of the 32 available, if you have sufficient numbers unused, consider fetching 2 vectors on your outer loop, this will reduce the inner loop's memory fetches by a factor of 2. I do not have your original code in front of my now, but from recollection, this may require an additional 6 registers.

BTW (also from recollection)

Your current algorithm (appears to) calculate the force upon the a (outer loop) alone as opposed to applying the force/-force to both a side and b side. The SOP is

for(i=0;i<N-1;++i) {
fetch a's
for(j=i+1;j<N;++j) {
    fetch b's
    calc interactions
    a's force += interaction force
    b's force -= interaction force
}
}

Also, at the point in the dual loop iteration where i==j you have an identity (radius of separation == 0). It appears that you correct for this, or should I say thwart divide by (inv sqrt of) zero is by adding some fudge in the first FMA. While this fixes one problem, it introduces a bias.

As to if calculating the force once or twice is faster, this is left up to testing.

Jim Dempsey

Jofre1 · ‎01-29-2021

Dear Jim,

I have a working version with the swapping thing. Something surprising happen.

The code is slower: 2.98s vs 3.55s

But it is "more correct"!!! I should explain this.

As it was mention in one of the replies, the use of "rsqrt14" is less accurate that "Sqrt" and then "Div". So I thought that the small differences I was observing on the final results after 1e4 time steps, was due to the use of "rsqrt14". I was willing to live with it, as for many of the things I want to do with it, it should be fine as the importance is on the (r_i - r_j). I said this in case someone suggests to go to single precision.

However, the "swapping" approach provides with the exact same result as with the "Sqrt + Div" algorithm!!!

What has changed? With the new approach I do not need the "_mm512_reduce_add_pd". I can not explained in any other way.

I know it is not black magic, but sometimes...

Regarding your 1/0: the standard way in the field is to use a softening factor as the introduction of an "if" or the split of the inner loop in two ( j = 0 : i-1 then j = i+1 : N_ions ) used to kill performance (i have not tried in a while).

Of course Fij = -Fji, but you need more memory writes, which use to be a killing factor for performance, at least in the way my code used to be structured. From what I have learn in the last weeks, I may have a look at that again...

Regards,

Jofre

Jofre1 · ‎01-29-2021

I implemented a version where I re-use the force. However, as I suspected, all the extra writes and swapping back (the algorithm becomes actually quite beautiful but too complex to explain it in a few lines ) ends up taking the same time as before, when "intuitively", it should take halve the time.

For info, without the "swapping back" part, although the result is wrong, it is actually faster (2.0s vs 3.2s).

For info, my sample code for "swapping back" (maybe could be optimized):

#include <stdio.h>
#include <immintrin.h>
int main(){

__m512d a, b, c;
const int i_swap = 0b01010101;
const int i_quad = 0b01001011;
const int i_lane = 0b01001110;

const int i_quad2 = 0b10001011;

double r[8]  __attribute__((aligned(64)));
 for( size_t i = 0; i < 8; i++ ){ r[i] = i*1.0e00; }
//~ 0 1 2 3 4 5 6 7
a = _mm512_load_pd( &r[0] );
printf("Zero \n");
printf("%d %d %d %d %d %d %d %d\n",int(a[0]), int(a[1]),int(a[2]), int(a[3]),int(a[4]), int(a[5]),int(a[6]), int(a[7]));

//~ 1 0 3 2 5 4 7 6
printf("First Swap      Swap Back \n");
b = _mm512_permute_pd (a, i_swap);
c = _mm512_permute_pd (b, i_swap);
printf("%d %d %d %d %d %d %d %d ",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));
printf("|| %d %d %d %d %d %d %d %d\n",int(c[0]), int(c[1]),int(c[2]), int(c[3]),int(c[4]), int(c[5]),int(c[6]), int(c[7]));

// 2 3 1 0 6 7 5 4
printf("Second Swap      Swap Back \n");
b = _mm512_permutex_pd (b, i_quad);
c = _mm512_permutex_pd (b, i_quad);
printf("%d %d %d %d %d %d %d %d ",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));
printf("|| %d %d %d %d %d %d %d %d\n",int(c[0]), int(c[1]),int(c[2]), int(c[3]),int(c[4]), int(c[5]),int(c[6]), int(c[7]));

// 3 2 0 1 7 6 4 5
printf("Third Swap      Swap Back \n");
b = _mm512_permute_pd (b, i_swap);
c = _mm512_permutex_pd (b, i_quad);
c = _mm512_permute_pd (c, i_swap);
printf("%d %d %d %d %d %d %d %d ",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));
printf("|| %d %d %d %d %d %d %d %d\n",int(c[0]), int(c[1]),int(c[2]), int(c[3]),int(c[4]), int(c[5]),int(c[6]), int(c[7]));

// 4 5 6 7 0 1 2 3
printf("Fourth Swap      Swap Back \n");
b = _mm512_shuffle_f64x2(a, a, i_lane);
c = _mm512_shuffle_f64x2(b, b, i_lane);
printf("%d %d %d %d %d %d %d %d ",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));
printf("|| %d %d %d %d %d %d %d %d\n",int(c[0]), int(c[1]),int(c[2]), int(c[3]),int(c[4]), int(c[5]),int(c[6]), int(c[7]));

// 5 4 7 6 1 0 3 2
printf("Fifth Swap      Swap Back \n");
b = _mm512_permute_pd (b, i_swap);
c = _mm512_shuffle_f64x2 (b, b,i_lane);
c = _mm512_permute_pd (c, i_swap);
printf("%d %d %d %d %d %d %d %d ",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));
printf("|| %d %d %d %d %d %d %d %d\n",int(c[0]), int(c[1]),int(c[2]), int(c[3]),int(c[4]), int(c[5]),int(c[6]), int(c[7]));

// 6 7 5 4 2 3 1 0
printf("Sixth Swap      Swap Back \n");
b = _mm512_permutex_pd (b, i_quad);
c = _mm512_permutex_pd (b, i_quad);
c = _mm512_shuffle_f64x2(c, c, i_lane);
printf("%d %d %d %d %d %d %d %d ",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));
printf("|| %d %d %d %d %d %d %d %d\n",int(c[0]), int(c[1]),int(c[2]), int(c[3]),int(c[4]), int(c[5]),int(c[6]), int(c[7]));

// 7 6 4 5 3 2 0 1
printf("Seventh Swap      Swap Back \n");
b = _mm512_permute_pd (b, i_swap);
c = _mm512_shuffle_f64x2 (b, b,i_lane);
c = _mm512_permutex_pd (c, i_quad);
c = _mm512_permute_pd (c, i_swap);

printf("%d %d %d %d %d %d %d %d ",int(b[0]), int(b[1]),int(b[2]), int(b[3]),int(b[4]), int(b[5]),int(b[6]), int(b[7]));
printf("|| %d %d %d %d %d %d %d %d\n",int(c[0]), int(c[1]),int(c[2]), int(c[3]),int(c[4]), int(c[5]),int(c[6]), int(c[7]));


return 0;
}

Output:

Zero 
0 1 2 3 4 5 6 7
First Swap      Swap Back 
1 0 3 2 5 4 7 6 || 0 1 2 3 4 5 6 7
Second Swap      Swap Back 
2 3 1 0 6 7 5 4 || 0 1 2 3 4 5 6 7
Third Swap      Swap Back 
3 2 0 1 7 6 4 5 || 0 1 2 3 4 5 6 7
Fourth Swap      Swap Back 
4 5 6 7 0 1 2 3 || 0 1 2 3 4 5 6 7
Fifth Swap      Swap Back 
5 4 7 6 1 0 3 2 || 0 1 2 3 4 5 6 7
Sixth Swap      Swap Back 
6 7 5 4 2 3 1 0 || 0 1 2 3 4 5 6 7
Seventh Swap      Swap Back 
7 6 4 5 3 2 0 1 || 0 1 2 3 4 5 6 7

In any case, it would have been a nightmare for the following parallelization afterwards. Nightmare that I would have consider to handle if the gains were significant...

Regards,

Jofre

jimdempseyatthecove · ‎01-29-2021

Can you post your actual code with the swaps?

Jim Dempsey

Jofre1 · ‎02-02-2021

Dear Jim,

Have you seen the private message I send you?

Jofre1 · ‎02-02-2021

In any case, I claimed victory too early: my "new" algorithm where the forces are re-used is leading to incorrect results (can't find the problem)...

Jofre1 · ‎02-02-2021

OK. Mistake found. The new time is 3.24s (re-using force) vs 3.35s (last version with shuffling, without reusing forces).

The single thread improvement is small and the passage to OpenMP is going to be very difficult due to writes across threads. I suspect that will probably be a disaster...

Nevertheless, I wanted to have it working!

Viet_H_Intel · ‎03-09-2021

We will close this at our end. Any further interaction in this thread will be considered community only.

Thanks,