Inefficient code for Coulomb interaction: why???

Jofre1 · ‎01-20-2021

Dear all,

I was re-testing performance on an FORTRAN code with my "new" machine (2-years old Intel Xeon W-2155), and I was scared by the Intel Advisor results: 12.41% of Peak Performance (Single Thread).

I have spend a fair amount of time looking around and nothing I did improve it. I decided to re-write it on C, to take manual advantage of AVX512 instructions. The new code bring things up to 24.41%, which is good, but still far away from peak performance.

I was wondering if any of the local experts could help me out in understanding why the low performance, and if possible, to help to get closer to the peak value.

In principle (as given by Advisor) the problem is "inefficient memory pattern" (although I have "unit stride" and low cache miss, if I understand correctly the information reported from Advisor). Using the typical de-doubling of the FOR loops, to "improve cache", makes things worst...

I am interested, at this stage, only on Single Thread performance. The OpenMP part should be fine (fingers-crossed!)

I join several files:

- 'Verlet_SingleThread_Simplified.c++' : the codes no-optimized code

- 'Summary_SingleThread_Simple.png': Result of the code above as given by Advisor

- 'Verlet_SingleThread_Simplified_RSQRT.c++' : the code "a bit"-optimized

-'Summary_SingleThread_with_Intr.png': Result of the code above as given by Advisor

(The "forum" complained about the *.c++ not valid, so I changed the extension to to *.c)

Keep in mind that they are C versions written by a FORTRAN person, who learnt C in the last two weeks just for this project!

Thanks in advance,
J.

Jofre1 · ‎01-20-2021

I forgot to mention a few things:

OS: Ubuntu 16.04.7 LTS

Compiler used: icc version 2021.1

Compilation flags used: -O3 -march=native -mkl (plus -g3 when using Advisor)

PrasanthD_intel · ‎01-22-2021

Hi Jofre,

Thanks for reaching out to us.

We are transferring this query to the concerned team.

Regards

Prasanth

Jofre1 · ‎01-22-2021

Thanks! Looking forward to have some insights!

Viet_H_Intel · ‎01-22-2021

I couldn't be able to compile your provided source files with the same command line options.

orspt-le64-103:/tmp$icc -O3 -march=native -mkl Verlet_SingleThread_Simplified.c -V

Intel(R) C Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1.2 Build 20201208_000000

Verlet_SingleThread_Simplified.c(52): error: function call is not allowed in a constant expression

static const __m512d packed_soft = _mm512_set1_pd(softening);

^

Verlet_SingleThread_Simplified.c(54): error: function call is not allowed in a constant expression

static const __m512d packed_alpha = _mm512_set1_pd(alpha);

^

Also, when compare performance, what are results of the other compilers (clang, gcc)?

Thanks,

Jofre1 · ‎01-22-2021

Dear Viet (and others),

It compiles when changing the files extensions to .cpp. Not sure why, as I mention I am a bit of a novice in the C, C++ language.

I did not try any other compiler as I have been using the Intel Fortran compiler for years, so it did not crossed my mind... In any case I am using the Gaussian Random Generator from MKL.

I just tried to compile with gcc, but it complains regarding the way I have declared the functions. I do not why because I follow the notation of "ModernC" by Jens Gustedt.

I'll try later with Clang.

I really miss the easy of array manipulation in Fortran thanks to the language built in intrinsic... but that's another story

Thanks for giving it a go!

J.

Viet_H_Intel · ‎01-22-2021

Hi,

After renamed .c to .cpp I was able to compile with icpc. However, we don't have a way to compare the results, I don't know where to go from here.

If you have questions/concerns about Intel Fortran, we have a Forum https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler, our Engineers there would be able to help you out.

Thanks,

Viet_H_Intel · ‎01-22-2021

I compiled your test program and then ran on system supports AVX512.

orspt-le64-103:/tmp$icpc -O3 -march=native -mkl Verlet_SingleThread_Simplified.cpp && time ./a.out

fun() took 11.182771 seconds to execute

Since you called some mkl's functions and clang doesn't recognize it. Thus, I removed those calls and compiled with clang++ (8.1)

orspt-le64-103:/tmp$clang++ -O3 -march=native -mkl Verlet_SingleThread_Simplified.cpp

clang-8: error: unknown argument: '-mkl'

orspt-le64-103:/tmp$clang++ -O3 -march=native Verlet_SingleThread_Simplified.cpp

/tmp/t-6273fd.o: In function `main':

t.cpp:(.text+0xf2): undefined reference to `vslNewStream'

t.cpp:(.text+0x117): undefined reference to `vdRngGaussian'

t.cpp:(.text+0x13c): undefined reference to `vdRngGaussian'

t.cpp:(.text+0x161): undefined reference to `vdRngGaussian'

clang-8: error: linker command failed with exit code 1 (use -v to see invocation)

orspt-le64-103:/tmp$ commented out these 4 lines

/*

vslNewStream( &stream, VSL_BRNG_MT19937, 777 );

vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream, n_ions, r_x, 0.0, 10.0e-6 );

vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream, n_ions, r_y, 0.0, 10.0e-6 );

vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream, n_ions, r_z, 0.0, 10.0e-6 );

*/

orspt-le64-103:/tmp$clang++ -O3 -march=native Verlet_SingleThread_Simplified.cpp && time ./a.out

fun() took 11.425697 seconds to execute

orspt-le64-103:/tmp$icpc -O3 -march=native Verlet_SingleThread_Simplified.cpp && time ./a.out

fun() took 11.102835 seconds to execute

My observations:

1) icpc is a hair faster than clang++ (11.102835 seconds vs. 11.425697 seconds)

2) there isn't much differ in icpc with/without those mkl's calls. (11.182771 seconds vs. 11.102835 seconds)

Jofre1 · ‎01-22-2021

Thanks Viet for the feedback.

I compile with icc without any issues. It compiles also with icpc. Today my time is at 9.7s for the code using AVX-512 intrinsic instructions (I have already seen 8.8s, no idea what the PC was doing few minutes ago), but to be honest it is not just the seconds that I care: what I would like is to really understand why the code is so far from the theoretical peak performance.

- I am hitting some intrinsic limitation due to the algorithm I am using?

- There is some fundamental flaw on the way I have written/implemented the algorithm?

- Is the information being reported by Advisor miss-leading?

Other similar examples that I found on the web use Structure Of Arrays: sure they start with OAS and say it is bad, then go to SOA and it better, and so on. Fine. As I said, I am an experienced Fortran user, and I wanted a code in which I feel comfortable, as it is just the starting point for more complex things. For this very reason, I really would like to know where it is the issue.

By looking at the assembly code (which I am not at all an expert), I saw that the 1/sqrt(x) was not being translated into a call to rsqrt(x) in Fortran, which I was possible using the AVX-512 instruccions. This motivates me to go to C, so I could call the packed 512 intrinsics provided by Intel [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=rsqrt&expand=4804,4803,4806,4812]. I thought that it will allow me to get closer to the theoretical performance, but it did not.

My understanding of "good practices" tells me that the code should be pretty efficient, and my scientist side really, really, would like to know why it is not...

All the best,

J.

jimdempseyatthecove · ‎01-23-2021

I suggest you explore the program using VTune, look at the Bottom-Up view, then expand with Disassembly if/where necessary.

Looking at summary info tells you the end results, but provides no information as to how/why you attained the results.

Also, (fast) inverse square root sacrifices precision for performance. I will assume you are aware of this.

Jim Dempsey

Jofre1 · ‎01-26-2021

Dear Jim,

Thanks for the advice. Indeed Vtune provides more detailed information. I did:

vtune -collect uarch-exploration ./a.out

I attach the result as a file as it is a bit long.

But the main takeovers (assuming I am reading the report correctly) are:

- "A significant portion of pipeline slots are remaining empty. "

- "A significant fraction of cycles was stalled due to Core non-divider-related issues."

I have to admit that I am a bit lost on what to do next... The problem is, of course, in this part:

        
        for(size_t i = 0; i < n_ions; i++ ){
            r_xi = _mm512_set1_pd(r_x[i]);
            r_yi = _mm512_set1_pd(r_y[i]);
            r_zi = _mm512_set1_pd(r_z[i]);
            ax = a2_x[i];
            ay = a2_y[i];
            az = a2_z[i];
            #pragma nounroll
            for(size_t j = 0; j < n_ions; j+=8 ){
                r_xj  = _mm512_load_pd( &r_x[j] );
                rji_x = _mm512_sub_pd( r_xi, r_xj ); 

                r_yj  = _mm512_load_pd( &r_y[j] );
                rji_y = _mm512_sub_pd( r_yi, r_yj );

                r_zj  = _mm512_load_pd( &r_z[j] );
                rji_z = _mm512_sub_pd( r_zi, r_zj );

                r2inv = _mm512_fmadd_pd(rji_x, rji_x, packed_soft);
                r2inv = _mm512_fmadd_pd(rji_y, rji_y, r2inv);
                r2inv = _mm512_fmadd_pd(rji_z, rji_z, r2inv);

                r2inv = _mm512_rsqrt14_pd(r2inv);

                r2inv = _mm512_mul_pd(_mm512_mul_pd(r2inv,packed_alpha),_mm512_mul_pd(r2inv,r2inv) );

                rji_x = _mm512_mul_pd(rji_x,r2inv);
                ax += _mm512_reduce_add_pd( _mm512_mul_pd(rji_x,r2inv));
                
                rji_y = _mm512_mul_pd(rji_y,r2inv);
                ay += _mm512_reduce_add_pd( _mm512_mul_pd(rji_y,r2inv));

                rji_z = _mm512_mul_pd(rji_z,r2inv);
                az += _mm512_reduce_add_pd( _mm512_mul_pd(rji_z,r2inv));
            }
            a2_x[i] = ax;
            a2_y[i] = ay;
            a2_z[i] = az;
        }

I did also re-written it using structures (so they would be closer in memory), but not much change...

I will try the GUI tomorrow once I am in the office, to see if Vtune-GUI can guide me better.

Nevertheless, any suggestions are welcomed!

Regards,

Jofre

jimdempseyatthecove · ‎01-26-2021

The point I was making about using VTune was not to obtain the summary, rather, to use the Bottom Up view to locate the hot spots, then by clicking on the hot spots, open the source, then if necessary, open Disassembly to see what is actually going on. Note, you do not need to fully understand assembly code to get feedback about what is going on with code generation by the compiler.

In looking at your sample code, I expect you will find substantial pipeline stalls when using r2inv for input immediately after you compute and write to it. In particular after the rsqrt14 but also to a lesser extent with the fmadd's. (as indicated by the two items you noted in your report).

To correct for this, I suggest not unrolling the inner loop (as you have done) but rather rewrite the loop such that you process two iterations per pass... in an interleaved manner *** but you may also need to stagger the interleave.

Also,try this first: consider not using r_xj and rji_x after the rsqrt14, instead use additional variables (registers) .AND. then place the memory loads into r_xj and rji_x immediately after the r rsqrt14 (this also requires lifting the first loads out of the inner loop.

Jim Dempsey

Jofre1 · ‎01-26-2021

Thanks Jim!!!

I need to digest your answer. I will be back as soon as I can!

Jofre

jimdempseyatthecove · ‎01-26-2021

Also, perform the reduced adds outside the inner loop (zero accumulators outside the loop, accumulate inside, do reduced add after the inner loop, then accumulate into a2_x,...

Jim Dempsey

Jofre1 · ‎01-26-2021

Regarding your first comment: I see, I read it somewhere that is a technique to improve pipeline:

(FG)^n => F (GN)^(n-1) G

I already tried but in a different way that you suggest. I need to think more about the algorithm itself...

Regarding your second comment: of course, I did not thought about it before! rather than use a reduce, just accumulate into a packed vector, and do the reduce outside! Really good point!

Ok, let's try...

jimdempseyatthecove · ‎01-26-2021

This is a suggestion as to how to buffer the reads:

            r_xj_buff  = _mm512_load_pd( &r_x[0] );
            r_yj_buff  = _mm512_load_pd( &r_y[0] );
            r_zj_buff  = _mm512_load_pd( &r_z[0] );
            #pragma nounroll
            for(size_t j = 0; j < n_ions; j+=8 ){
                r_xj  = r_xj_buff
		r_xj_buff  = _mm512_load_pd( &r_x[j+8] );	// requires one vector of padd
                rji_x = _mm512_sub_pd( r_xi, r_xj ); 

                r_yj  = r_yj_buff
		r_yj_buff  = _mm512_load_pd( &r_y[j] );		// requires one vector of padd
                rji_y = _mm512_sub_pd( r_yi, r_yj );

                r_zj  = r_zj_buff
		r_zj_buff  = _mm512_load_pd( &r_z[j] );		// requires one vector of padd
                rji_z = _mm512_sub_pd( r_zi, r_zj );

Once that is working and you have performance data, try moving the loads (between the fmadd and/or following the sqrt.

I will let you sort out the reduce add and later interleave (the hardest).

You still need to look at the Bottom-up and Disassembly. Note, often, the instruction following the one with the latency gets billed for the time of the latency (iow when the following instruction stalls due to one of the inputs not having been completed with prior operation).

Jim Dempsey

Jofre1 · ‎01-27-2021

Dear Jim,

Changing the inner reduction intrinsic by an addition of mm512, and doing the reduction after loop has lead to huge decrease on cpu time (~8.8s to ~2.9s). That was a great insight!

However, the introduction of the buffer is a great suggestion. I think that I understand what you had in your mind: the rsqrt is taking time to finish, so do a load while it is finishing with it (using the out-of-order ports). So I put the buffer loads right after the rsqrt. However when looking at the assembly code, the loads are not done at the right place!!!:

Address Line Assembly Total Time % Self Time %
0 0x402448 Block 1: 1310720000
0 0x402448 190 vsubpd %zmm11, %zmm15, %k0, %zmm19 31,999ms 0.52% 31,999ms 1.08%
0 0x40244e 188 inc %r10d 208,006ms 3.40% 208,006ms 7.00%
0 0x402451 196 vsubpd %zmm9, %zmm13, %k0, %zmm22 12,000ms 0.20% 12,000ms 0.40%
0 0x402457 193 vsubpd %zmm10, %zmm14, %k0, %zmm20 207,998ms 3.40% 207,998ms 7.00%
0 0x40245d 207 vmovupsz 0x669600(%r9), %k0, %zmm10 8,000ms 0.13% 8,000ms 0.27%
0 0x402467 208 vmovupsz 0x667400(%r9), %k0, %zmm11 75,998ms 1.24% 75,998ms 2.56%
0 0x402471 198 vmovaps %zmm5, %k0, %zmm9 19,999ms 0.33% 19,999ms 0.67%
0 0x402477 198 vfmadd231pd %zmm19, %zmm19, %k0, %zmm9 115,993ms 1.90% 115,993ms 3.91%
0 0x40247d 199 vfmadd231pd %zmm20, %zmm20, %k0, %zmm9 8,000ms 0.13% 8,000ms 0.27%
0 0x402483 200 vfmadd231pd %zmm22, %zmm22, %k0, %zmm9 191,993ms 3.14% 191,993ms 6.47%
0 0x402489 201 vrsqrt14pd %zmm9, %k0, %zmm16 59,991ms 0.98% 59,991ms 2.02%
0 0x40248f 206 vmovupsz 0x66b800(%r9), %k0, %zmm9 327,579ms 5.35% 327,579ms 11.03%
0 0x402499 210 vmulpd %zmm4, %zmm16, %k0, %zmm17 20,000ms 0.33% 20,000ms 0.67%
0 0x40249f 210 vmulpd %zmm16, %zmm16, %k0, %zmm18 247,988ms 4.05% 247,988ms 8.35%
0 0x4024a5 210 vmulpd %zmm18, %zmm17, %k0, %zmm21 28,000ms 0.46% 28,000ms 0.94%
0 0x4024ab 188 add $0x40, %r9 479,996ms 7.84% 479,996ms 16.16%
0 0x4024af 212 vfmadd231pd %zmm19, %zmm21, %k0, %zmm12 20,000ms 0.33% 20,000ms 0.67%
0 0x4024b5 213 vfmadd231pd %zmm20, %zmm21, %k0, %zmm0 496,000ms 8.10% 496,000ms 16.70%
0 0x4024bb 214 vfmadd231pd %zmm22, %zmm21, %k0, %zmm7 73,986ms 1.21% 73,986ms 2.49%
0 0x4024c1 188 cmp $0x80, %r10d 111,999ms 1.83% 111,999ms 3.77%
0 0x4024c8 188 jb 0x402448 <Block 1>

Corresponding code:
Line Source Total Time % Loop/Function Time % Traits
188 for(size_t j = 0; j < n_ions; j+=8 ){ 800,001ms 29.14% 2 745,530ms 100.00%
189 r_xj = r_xj_buff;
190 rji_x = _mm512_sub_pd( r_xi, r_xj ); 32,000ms 1.17%
191
192 r_yj = r_yj_buff;
193 rji_y = _mm512_sub_pd( r_yi, r_yj ); 207,998ms 7.58%
194
195 r_zj = r_zj_buff;
196 rji_z = _mm512_sub_pd( r_zi, r_zj ); 12,000ms 0.44%
197
198 r2inv = _mm512_fmadd_pd(rji_x, rji_x, packed_soft); 135,993ms 4.95% FMA
199 r2inv = _mm512_fmadd_pd(rji_y, rji_y, r2inv); 8,000ms 0.29% FMA
200 r2inv = _mm512_fmadd_pd(rji_z, rji_z, r2inv); 191,993ms 6.99% FMA
201 r2inv = _mm512_rsqrt14_pd(r2inv); 59,991ms 2.19% Appr. Reciprocals(AVX-512F)
202 //~ r2inv = _mm512_invsqrt_pd(r2inv);
203 //~ r2inv = _mm512_sqrt_pd(r2inv);
204 //~ r2inv = _mm512_div_pd(packed_one,r2inv);
205
206 r_zj_buff = _mm512_load_pd( &r_z[j+8] ); 327,579ms 11.93%
207 r_yj_buff = _mm512_load_pd( &r_y[j+8] ); 8,000ms 0.29%
208 r_xj_buff = _mm512_load_pd( &r_x[j+8] ); 75,998ms 2.77%
209
210 r2inv = _mm512_mul_pd(_mm512_mul_pd(r2inv,packed_alpha),_mm512_mul_pd(r2inv,r2inv) ); 295,989ms 10.78%
211
212 ax = _mm512_add_pd(ax,_mm512_mul_pd(rji_x,r2inv)); 20,000ms 0.73% FMA
213 ay = _mm512_add_pd(ay,_mm512_mul_pd(rji_y,r2inv)); 496,000ms 18.07% FMA
214 az = _mm512_add_pd(az,_mm512_mul_pd(rji_z,r2inv)); 73,986ms 2.69% FMA
215 }

}
Any idea on how to stop the compiler re-ordering things? (I tried Google, but no luck...)

I, of course, tried to unroll manually (and with a pragma), but because afterwards the compiler re-order things, it ends up worst.

I found a interesting note which is clearly related but I do not manage to grasp how to transpose it to the present problem:
https://colfaxresearch.com/skl-avx512/#sec-2-1

In any case, thanks a lot for your help so far!
Regards,
Jofre

jimdempseyatthecove · ‎01-27-2021

>>However when looking at the assembly code, the loads are not done at the right place!!!:

The compiler optimizations will reorder the instructions based on well studied behavior. Most of the time the compiler writers do a marvelous job.

you might experiment with inserting

asm volatile ("" ::: "memory")

in front of the loads

Jim Dempsey

Jofre1 · ‎01-27-2021

>> The compiler optimizations will reorder the instructions based on well studied behavior. Most of the time the compiler writers do a marvelous job.

And unfortunately (and unsurprisingly), you are right! The assembly now looks like I thought had to look, but the performance is a bit worst ( 2.92s vs 3.11s).

OK, I think I have hit the bottom then... unless any last minute suggestions?

jimdempseyatthecove · ‎01-27-2021

The Colfax is informative, but not directly applicable to your coding situation.

In your current code, you you have an i loop that extracts scalars and then vector by vector computes the ax, ay, az values. In this method each vector worth of the i loop (8 elements) each experience the memory fetch latencies of the inner j loop.

I suggest you fetch a vector worth of the a's in the i loop, and then within the j loop, perform 8 computations in sequence using combinations of the permute/swizzle that perform the permutations of the a-values.

The sketch is as follows:

loop i
load a's into vectors (ax_vec, ay_vec, az_vec)	7|6|5|4|3|2|1|0
loop j
load b's into vectors
compute interactions between a's and b's (essentially same algorithm as you have now)
swap lanes of a's		6|7|4|5|2|3|0|1
compute interactions between a's and b's (essentially same algorithm as you have now)
swap double lanes of a's	4|5|6|7|0|1|2|3
compute interactions between a's and b's (essentially same algorithm as you have now)
swap lanes of a's		5|4|7|6|1|0|3|2
compute interactions between a's and b's (essentially same algorithm as you have now)
swap quad lanes of a's		1|0|3|2|5|4|7|6
compute interactions between a's and b's (essentially same algorithm as you have now)
swap lanes of a's		0|1|2|3|4|5|6|7
compute interactions between a's and b's (essentially same algorithm as you have now)
swap double lanes of a's	2|3|0|1|6|7|4|5
compute interactions between a's and b's (essentially same algorithm as you have now)
swap lanes of a's		3|2|1|0|7|6|5|4
compute interactions between a's and b's (essentially same algorithm as you have now)
swap quad lanes of a's		7|6|5|4|3|2|1|0 (back to original lanes of loop i)
end loop j
store a's

Now I won't tell you how to do this as you will learn better reading through the

Intel® Intrinsics Guide

To learn how to move lanes about.

The above technique, will take some work the first time you do it, but will give you a template of how to do this for future projects.

The above technique will reduce the memory fetches of the inner loop by a factor of 8. The reduction of the i loop would depend upon if the scalar code fetch cache line remained in cache or not. I assume not.

Jim Dempsey

Jofre1 · ‎01-28-2021

Dear Jim,

I am trying to see how to do the swaping that you suggested. I found _mm512_permutevar_pd which could do the job (it is this that you had in mind?).
However, I do not understand how to use it. The following code give always the same result, independently of values on ii:

__m512d a, b;
__m512i ii = {1,1,0,0, 0,0,0,0};
double r[8] __attribute__((aligned(64)));
for( size_t i = 0; i < 8; i++ ){
r[i] = i*1.0e00;
}

a = _mm512_load_pd( &r[0] );
b = _mm512_permutevar_pd (a, ii);
for( size_t i = 0; i < 8; i++ ){
printf("%d %E %E\n",i, a[i], b[i]);
}

0 0.000000E+00 0.000000E+00
1 1.000000E+00 0.000000E+00
2 2.000000E+00 2.000000E+00
3 3.000000E+00 2.000000E+00
4 4.000000E+00 4.000000E+00
5 5.000000E+00 4.000000E+00
6 6.000000E+00 6.000000E+00
7 7.000000E+00 6.000000E+00

What I am missing?