outter loop openMP + inner loop vectorization vs MKL

pilot117 — Sun, 13 Dec 2009 03:51:32 GMT

Hi,

I have writen a simple code to implement b=A*x and test it on the machine with 2 quad-core cpus. While compiling, Its outter loop is openMP parallized and innter loop is vectorized.

#pragma omp parallel for
for(int i=0; i for (int j=0; j b+=A[i*N+j]*x;
}
}

while setting the number of thread to be 8 N=20000 and KMP_AFFINITY=verbose,

KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}

vec mv time: 90.882004 ms

which is a little faster than mkl sgemv. Set the thread to be 4, it is ~102 ms. Why only slight improvement while double the number of threads?

However,

when I set KMP_AFFINITY=verbose,compact (in this case mkl sgemv has the best performance ~92 ms),

the timing of the above code changes a lot:

KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]
KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]
KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]
KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]
KMP_AFFINITY: OS proc 4 maps to package 1 core 0 [thread 0]
KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]
KMP_AFFINITY: OS proc 6 maps to package 1 core 2 [thread 0]
KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}
KMP_AFFINITY: Internal thread 4 bound to OS proc set {4}
KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}
KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}
KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}

vec mv time: 284.753998

set KMP_AFFINITY=verbose, scatter then I have the improvement from 284ms to 137ms but still much worse than 90ms!!!

KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]
KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]
KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]
KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]
KMP_AFFINITY: OS proc 4 maps to package 1 core 0 [thread 0]
KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]
KMP_AFFINITY: OS proc 6 maps to package 1 core 2 [thread 0]
KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {4}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {1}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {5}
KMP_AFFINITY: Internal thread 4 bound to OS proc set {2}
KMP_AFFINITY: Internal thread 5 bound to OS proc set {6}
KMP_AFFINITY: Internal thread 6 bound to OS proc set {3}
KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}

vec mv time: 137.539993

why is poor performance by setting how the threads distributed among the cores?

by setting KMP_AFFINITY = scatter/compact, it seems that openMP + vectorization perform WORSE that just vectorization of the inner loop!

outter loop openMP + inner loop vectorization vs MKL

Grant_H_Intel — Fri, 12 Feb 2010 16:26:07 GMT

pilot177,

First, please let me apologize that nobody has responded to your post sooner.

Although it is difficult to be sure, I would guess that you are timing the entire program including the time it takes to bind the threads to processors using KMP_AFFINITY. If you put an empty parallel region before your code (or maybe a parallel region that just prints omp_get_thread_num() result to prevent the parallel region getting removed by the compiler), then the binding will happen at the first (dummy) parallel region. Then, by putting timing calls before and after the parallel region with the actual work, you should see much better times because you will not be including the time it takes to bind each thread to its processor.

Another option is to make theb=A*x arraysmuch larger to amortize the time it takes to bind the threads to processors. I think this should give you a better idea of the effect that the thread binding has on the computation, rather than the time it takes to do the binding itself.

Hope this helps,

- Grant

topic outter loop openMP + inner loop vectorization vs MKL in Intel® oneAPI Math Kernel Library

outter loop openMP + inner loop vectorization vs MKL

outter loop openMP + inner loop vectorization vs MKL