Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Significant Overhead if threaded MKL is called from OpenMP parallel region

Felix__K_
Beginner
1,925 Views

Hello,

my aim is to diagonalize quadratic matrices with different sizes dxd in parallel. To this end I wrote a for  loop. In each iteration the aligned memory (dependent on the dimension d) is allocated with mkl_malloc(). The matrix is filled and afterwards dsyev is called to determine the optimal workspace size. Then I allocate the (aligned) workspace needed with mkl_malloc(), call dsyev once again to diagonalize the matrices and deallocate the memory that was used for the workspace and to store the matrix (using mkl_free()). 

Since the diagonalizations are independent of each other I want to run these in parallel by using OpenMP. Therefore I used the OpenMP pragma: #pragma omp parallel for with proper scheduling. The memory for each diagonalization is not accessed by different threads.

I run the code with OMP_NESTED=true, MKL_DYNAMIC=false, OMP_DYNAMIC=false. If I set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=4,8,16 no significant overhead ( %sys of linux top command) is observed. If I set OMP_NUM_THREADS=4 and MKL_NUM_THREADS=1 i.e. call the sequential version of MKL dsyev also no significant overhead is observed and roughly the same performance is ached like in the opposite case where MKL_NUM_THREADS=4 and OMP_NUM_THREADS=1.

BUT, if I now want to exploit my OpenMP parallelization with for example OMP_NUM_THREADS=2,4 and MKL_NUM_THREADS=4 I get a huge slow down. Up to 30% of the processors capacity are used for system calls (kernel) (the more OpenMP threads I use, the greater is the slow down). I tried different scheduling techniques to ensure load balancing as best as I can. If I change the scheduling, the problem i.e. overhead still persists.

Are the frequent calls to mkl_malloc() and ml_free() from different threads the reason for this ? If yes, I could allocate the maximum memory needed as one big block before entering the parallel region. Unfortunately the MKL routines have their own memory management to tune their performance. Is it likely that the internal memory management of threaded MKL dsyev can cause also such a large overhead ? Are there any other reasons for this slow down ?

Best regards,

Felix Kaiser

0 Kudos
21 Replies
Felix__K_
Beginner
1,774 Views

**** UPDATE ****

I've moved all calls to mkl_malloc() and mkl_free() outside the parallel region and set the MKL_DISABLE_FAST_MM environment variable. This did not help. Setting OMP_NUM_THREADS=2 and increasing the number of MKL threads to MKL_NUM_THREADS=2,4,8 the overhead increases too. 

Best regards,

Felix Kaiser

0 Kudos
Ying_H_Intel
Employee
1,774 Views

Hi Felix,

What OpenMP compiler are you using?  How many of  your physical processors, when you Setting OMP_NUM_THREADS=2 and increasing the number of MKL threads to MKL_NUM_THREADS=2,4, 8 etc. do they oversubscribe the Physical processors?

As I understand, you have your own OpenMP threads on "for" loop and you have MKL function, which may call MKL threading internally,you hope the nested thread may help, right? But it should be a precondition: you have enough physical cores. Otherwise, the nested threading doesn't help.  

I digged some discussion about the MKL nested threading issues for your reference.

for example, In MKL user guide and  https://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library/

It was recommended,  Intel MKL should run on a single thread when called from a threaded region of an application to avoid over-subscription of system resources.

And in most of case, MKL threading is not needed. When you believe the threads of your application utilize all physical cores of the system, or MKL threading will lead to oversubscription3

Only in some case, to  Enable MKL threading - use when you are sure that there are enough resources (physical cores) for MKL threading in addition to your own threads. Choose N carefully if you'd like your own threads,.

https://software.intel.com/en-us/articles/recommended-settings-for-calling-intelr-mkl-routines-from-multi-threaded-applications

and some discussions also in

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications

https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application

Best Regards,

Ying

 

0 Kudos
Felix__K_
Beginner
1,774 Views

Hello Ying,

I'm using Intel C++ Composer (if this is the answer to your question). Typing icpc in terminal results in: icpc (ICC) 11.1 20100414.I fully agree with you and I've read all those literature you posted already and yes, you understand me right. When using OMP_NUM_THREADS=2 and MKL_NUM_THREADS=2,4,8 I would need at most 16 physical cores. I ensured this in my tests by explicitly setting KMP_AFFINITY='proclist=[{list of ID's for 16 physical cores}],explicit'. For my problem sizes the threaded MKL routines dgemm and dsyev with 8 threads gives best performance. Hence I would like to call dsyev (using 8 cores) from a parallel for loop (which is parallelized using my own OpenMP threads i.e. 2,4 or even more) to get - if proper load balancing is ensured - a huge speed up. However, it turned out that even if I ensure that enough physical cores can be used and set the recommended environment variables, the code slows down significantly which I don't understand.

Best regards,

Felix

0 Kudos
Ying_H_Intel
Employee
1,774 Views

Hi Felix,

is it  the same for dgemm? or just for dsyev?

When you run the application, do you have some tool like vtune to show how many openmp threads in system?

There is one thing comes to my mind, ( which may be related). dsyev is LApack functions, will call blas function.So the nested parallel internally may be some issues.

How about try

OMP_NESTED=False and keep OMP_NUM_THREADS=2 and MKL_NUM_THREADS=2,4,8?

or export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=2,4,8 etc.

and let me know if you get any result.

Best Regards,

Ying

 

0 Kudos
Felix__K_
Beginner
1,774 Views

Hello Ying,

till know I checked only dsyev. I got the following results:

Setting 1: OMP_NESTED=false, OMP_NUM_THREADS=2, MKL_NUM_THREADS=2,4,8: No overhead observed. Only 2 open threads ( the OpenMP ones).

Setting 2: OMP_NESTED=true, MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=2,4,8, OMP_NUM_THREADS=2: No overhead observed. But uses only the 2 OpenMP threads (again). Why does the setting of MKL_DOMAIN_BLAS=2,4,8 has no impact ?

 Setting 3: OMP_NESTED=true, OMP_NUM_THREADS=1, MKL_NUM_THREADS=2,3,4: runs with 4,6,16 threads. No overhead introduced. It seems that each MKL thread creates MKL_NUM_THREADS itself once again. The actual number of threads that is created by internal nested regions of MKL functions can be restricted with OMP_THREAD_LIMIT.

Finally I found out that the problem was the internal nested parallelism of MKL functions. By having 16 cores that can be used, and setting OMP_NUM_THREADS=2, OMP_NESTED=true and e.g. MKL_NUM_THREADS=4, one would get 2*4*4 = 32 threads due to the nested parallelism inside the dsyev. Hence two threads run on each core, leading to a overhead. Even worse, setting OMP_NUM_THREADS=4 one would get 4*4*4 = 64, and 4 threads would run on each core leading to a significant overhead.

One possible workaround would be to set OMP_THREAD_LIMIT=16. The other would be to restrict the active levels of nested parallelism to 2 by calling omp_set_max_active_levels(2) if MKL dsyev is called from a simple (not nested) OpenMP region of my program. Setting omp_set_max_active_levels(2) and using OMP_NUM_THREADS=4, MKL_NUM_THREADS=4, OMP_NESTED=true gives the desired result. 16 Threads run on 16 physical cores. No overhead (beside scheduling) is generated. 

My final question would be: If I would just enable OMP_NESTED=true in a serial program. How do the MKL functions that don't use nested parallelism and those who do perform ? Is there a significant performance difference ?

Best regards,

Felix

0 Kudos
Alexander_K_Intel3
1,774 Views

Hello Felix,

MKL BLAS and LAPACK runs better when deeper nested threading is disabled (MKL spawns only one level of threads),
So the recommended approach is to use omp_set_max_active_levels(2) for MKL to spawn only the first level of nested threads.

Best regards,
Alexander

0 Kudos
Felix__K_
Beginner
1,774 Views

Hello Alexander, Hello Ying,

thanks a lot for your help !. Finally I got the desired parallelization. I wonder why the recommended settings

https://software.intel.com/en-us/articles/recommended-settings-for-calling-intelr-mkl-routines-from-multi-threaded-applications

do not work out. Setting MKL_DYNAMIC=true only my own OpenMP Threads are used and MKL runs in sequential mode. Setting MKL_DYNAMIC=false the nested threading of dsyev creates again more threads than physical Cores available. By setting OMP_DYNAMIC=true this cannot be prevented. Did I miss something ?

I finally did some (very rough) benchmarking on a test problem. Running dsyev with 8 cores without my own OpenMP parallelization  I get the best performance. Now I expected to observe a speed up if I run the test calling dsyev with 8 cores from 2 (own) OpenMP threads. Unfortunately this is slower. What could be the reason for  that ?

In numerical expensive steps the number of diagonalizations that have to be performed is ca. 127 and the dimensions of the real symmetric matrices that need to be diagonalized range from 1 to ca. 5500.

Best regards,

Felix Kaiser

0 Kudos
Ying_H_Intel
Employee
1,774 Views

Hi Felix,

Thank you a lot for the exploration.  Right, we need remove or modify the control of OMP_DYNAMIC on that paper.

In parallel region , only MKL_DYNAMIC control of the MKL's threads.

Here is description in MKL user guide:

The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads.
The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE.
When MKL_DYNAMIC is TRUE, Intel MKL tries to use what it considers the best number of threads, up to the
maximum number you specify.

So when MKL_DYNAMIC=true,  MKL is able to detect if it is in parallel region and change  the thread.  So  only your own OpenMP Threads are used and MKL runs in sequential mode because MKL found itself in Intel OpenMP parallel region, so it choose to run with sequential to avoid the oversubscription of system resource.  and   OMP_DYNAMIC=true or false can't control the MKL threads.

Regarding the performance of  2 X  "Running dsyev with 8 cores without my own OpenMP parallelization" and run the test calling dsyev with 8 cores from 2 (own) OpenMP threads.   How  the performance look like ?    and   could you tell me the processor type? 

Best Regards,

Ying  

 

0 Kudos
Ying_H_Intel
Employee
1,774 Views
0 Kudos
Felix__K_
Beginner
1,774 Views

Hello Ying,

sorry for the delay, but the benchmarks needed some time. So finally here are my results. As mentioned before I diagonalized 127 real symmetric matrices whose dimensions range from 1 to 5718. Hence my loop has 127 iterations. The time (estimated up to seconds) needed to diagonalize all these matrices with MKL's DSYEV were estimated for different setups (OMP_NUM_THREADS x MKL_NUM_THREADS | scheduling):

(1x8 | none): 8min 24sec, (2x8| dynamic,1): 1h 26min 15sec, (2x8| static,1): 1h 22min 26sec , (2x8| guided): 1h 25min 52sec

(1x4| none): 12min 25sec, (2x4| dynamic, 1): 8min 6sec (2x4| static, 1): 8min 5sec, (4x4| guided): 22min 30sec

So if I use 8 cores within the MKL function, I get execution times that are larger by a factor of 10 (!) if run with two (own) OpenMP threads instead of using only a single thread i.e. calling DSYEV from a sequential region. This significant decrease of performance seems to be independent of the scheduling type. 

If I compare the performance for DSYEV using 2x4 threads, I get a performance that is comparable to the 1x8 case. This speed up depends also not on the scheduling in a significant way. Finally if I want to gain even more speed up by calling 4-core DSYEV from 4 (own) OpenMP threads I once again get a huge slow down. This time by a factor of 2 compared to the sequential run (1x4).

The results were checked. All runs give the same results (max absolute difference is 10^{-13}). These calculations were performed on 16 Intel Xeon X5550 CPU's @ 2.67GHz.

Any idea what could be the reason for this ?

Best regards,

Felix

0 Kudos
Ying_H_Intel
Employee
1,774 Views

Hi Felix,

Further check,  when you mentioned : 16 Intel Xeon X5550 CPU's @ 2.67GHz.

and  on Xeon X550 CPU   have 4 core (hardware) and 8 threads (logical threads  or Hyper Threading (HT) is on ). 

 do you mean  4  CPUs  or 2 CPU on the system,  is the HT on or off?

Seeing  from your test result, the best is  (2x4| static, 1): 8min 5sec, (4x4| guided,   I guess , you may have  2 CPU, which have total 8 hardware core.  and HT is on.  So only 8 thread are valid  and if 16 (2x8) , then all threads will battle for hardware resource,  it impacts performance badly.

Regarding  the result of the HyperThreading and overhead,    Intel® Math Kernel Library 11.3 User's Guide provide some explanation:

Using Intel® Hyper-Threading Technology

 

Intel® Hyper-Threading Technology (Intel® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology.

Intel Optimized LINPACK Benchmark is threaded to effectively use multiple processors. So, in multi-processor systems, best performance will be obtained with the Intel® Hyper-Threading Technology turned off, which ensures that the operating system assigns threads to physical processors only.

Best Regards,

Ying

{C}Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

 

0 Kudos
Felix__K_
Beginner
1,774 Views

Hello Ying,

sorry for the error in my information. The code runs on 4 Intel Xeon X5550 CPU's with 4 cores each. /proc/cpuinfo tells me that each Intel Xeon processor has 4 cores and 4 siblings i.e. hyper threading is off.

Best regards,

Felix

0 Kudos
Ying_H_Intel
Employee
1,774 Views

Hi Felix, 

Thanks for hardware information.  When you run the test, could you please try 

> export KMP_AFFINITY=verbose 

> your  exe.   for example  4x4

and copy the output. or provide your exe and we may try here?

Best Regards,

Ying

0 Kudos
Felix__K_
Beginner
1,774 Views

 

Hello Ying,

using > export KMP_AFFINITY='proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],explicit,verbose' with (MKL_NUM_THREAS=4 and OMP_NUM_THREADS=4, OMP_NESTED=true, MKL_DYNAMIC=false and OMP_ACTIVE_LEVELS=2) and typing > ./exe afterwards results in the following output:

OMP: Warning #2: Cannot open message catalog "libiomp5.cat":

OMP: System error #2: No such file or directory

OMP: Hint: Check NLSPATH environment variable, its value is "/opt/intel/mkl/10.2.3.029/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.1/072/lib/intel64/locale/%l_%t/%N:/opt/intel/Compiler/11.1/072/ipp/em64t/lib/locale/%l_%t/%N:/opt/intel/Compiler/11.1/072/mkl/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.1/072/idb/intel64/locale/%l_%t/%N".

OMP: Hint: Check LANG environment variable, its value is "de_DE.UTF-8".

OMP: Info #3: Default messages are used.

OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid instr info

OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127}

OMP: Info #156: KMP_AFFINITY: 128 available OS procs

OMP: Info #157: KMP_AFFINITY: Uniform topology

OMP: Info #159: KMP_AFFINITY: 32 packages x 4 cores/pkg x 1 threads/core (128 total cores)

OMP: Info #160: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):

OMP: Info #168: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 4 maps to package 1 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 6 maps to package 1 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 8 maps to package 2 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 9 maps to package 2 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 10 maps to package 2 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 11 maps to package 2 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 12 maps to package 3 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 13 maps to package 3 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 14 maps to package 3 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 15 maps to package 3 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 16 maps to package 4 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 17 maps to package 4 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 18 maps to package 4 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 19 maps to package 4 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 20 maps to package 5 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 21 maps to package 5 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 22 maps to package 5 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 23 maps to package 5 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 24 maps to package 6 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 25 maps to package 6 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 26 maps to package 6 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 27 maps to package 6 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 28 maps to package 7 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 29 maps to package 7 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 30 maps to package 7 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 31 maps to package 7 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 32 maps to package 8 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 33 maps to package 8 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 34 maps to package 8 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 35 maps to package 8 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 36 maps to package 9 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 37 maps to package 9 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 38 maps to package 9 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 39 maps to package 9 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 40 maps to package 10 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 41 maps to package 10 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 42 maps to package 10 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 43 maps to package 10 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 44 maps to package 11 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 45 maps to package 11 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 46 maps to package 11 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 47 maps to package 11 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 48 maps to package 12 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 49 maps to package 12 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 50 maps to package 12 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 51 maps to package 12 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 52 maps to package 13 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 53 maps to package 13 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 54 maps to package 13 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 55 maps to package 13 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 56 maps to package 14 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 57 maps to package 14 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 58 maps to package 14 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 59 maps to package 14 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 60 maps to package 15 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 61 maps to package 15 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 62 maps to package 15 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 63 maps to package 15 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 64 maps to package 16 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 65 maps to package 16 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 66 maps to package 16 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 67 maps to package 16 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 68 maps to package 17 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 69 maps to package 17 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 70 maps to package 17 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 71 maps to package 17 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 72 maps to package 18 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 73 maps to package 18 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 74 maps to package 18 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 75 maps to package 18 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 76 maps to package 19 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 77 maps to package 19 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 78 maps to package 19 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 79 maps to package 19 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 80 maps to package 20 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 81 maps to package 20 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 82 maps to package 20 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 83 maps to package 20 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 84 maps to package 21 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 85 maps to package 21 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 86 maps to package 21 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 87 maps to package 21 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 88 maps to package 22 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 89 maps to package 22 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 90 maps to package 22 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 91 maps to package 22 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 92 maps to package 23 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 93 maps to package 23 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 94 maps to package 23 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 95 maps to package 23 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 96 maps to package 24 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 97 maps to package 24 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 98 maps to package 24 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 99 maps to package 24 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 100 maps to package 25 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 101 maps to package 25 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 102 maps to package 25 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 103 maps to package 25 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 104 maps to package 26 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 105 maps to package 26 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 106 maps to package 26 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 107 maps to package 26 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 108 maps to package 27 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 109 maps to package 27 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 110 maps to package 27 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 111 maps to package 27 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 112 maps to package 28 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 113 maps to package 28 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 114 maps to package 28 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 115 maps to package 28 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 116 maps to package 29 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 117 maps to package 29 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 118 maps to package 29 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 119 maps to package 29 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 120 maps to package 30 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 121 maps to package 30 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 122 maps to package 30 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 123 maps to package 30 core 3 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 124 maps to package 31 core 0 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 125 maps to package 31 core 1 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 126 maps to package 31 core 2 [thread 0]

OMP: Info #168: KMP_AFFINITY: OS proc 127 maps to package 31 core 3 [thread 0]

OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}

OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}

OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}

OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}

OMP: Info #147: KMP_AFFINITY: Internal thread 9 bound to OS proc set {9}

OMP: Info #147: KMP_AFFINITY: Internal thread 8 bound to OS proc set {8}

OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {4}

OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}

OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}

OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}

OMP: Info #147: KMP_AFFINITY: Internal thread 11 bound to OS proc set {11}

OMP: Info #147: KMP_AFFINITY: Internal thread 12 bound to OS proc set {12}

OMP: Info #147: KMP_AFFINITY: Internal thread 10 bound to OS proc set {10}

OMP: Info #147: KMP_AFFINITY: Internal thread 14 bound to OS proc set {14}

OMP: Info #147: KMP_AFFINITY: Internal thread 13 bound to OS proc set {13}

OMP: Info #147: KMP_AFFINITY: Internal thread 15 bound to OS proc set {15}

 

I could provide the source code. But the diagonalization is just a part of a larger code. So I would like to create a minimal example and then post its source code. Unfortunately I don't know how the symmetric matrices have to be set up in order to be well suited for benchmark tests. Are there any advantageous properties (besides, that they have to be symmetric and diagonalizable) I should be aware of ?

 

Best regards,

 

Felix 

0 Kudos
Ying_H_Intel
Employee
1,774 Views

Hi Felix, 

Thanks a lot.  So you have really huge Xeon cpu cluster than we have here :).   So the problem seems not be in HT.   When you run the 

export KMP_AFFINITY='proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],explicit,verbose' with (MKL_NUM_THREAS=2 and OMP_NUM_THREADS=8, OMP_NESTED=true, MKL_DYNAMIC=false and OMP_ACTIVE_LEVELS=2) and typing

(2x8| dynamic,1): 1h 26min 15sec, (2x8| static,1): 1h 22min 26sec , (2x8| guided): 1h 25min 52sec

What the CPU active looks like?  first 16 are actived? 

Another factor, I noticed you are used /opt/intel/mkl/10.2.3.029,  and the latest  version are MKL 11.2.3.  is it possible to try the new version? 

There is no special request about the  symmetric matrices. you may wrote out  10 of them in your larger code.  and test MKL_NUM_THREAS=2 and OMP_NUM_THREADS=8. should be ok. 

Best Regards,

Ying 

0 Kudos
Felix__K_
Beginner
1,774 Views

Hello Ying,

yes first 16 are active, and one thread runs on each of the 16 cores. As an additional information, the operating system for the tests I posted before was an Red Hat 5.8 (Tikanga) with icpc (ICC) 11.1. The user activity is roughly >95% (linux top). I tried OMP_NUM_THREADS=1,2 and MKL_NUM_THREADS=4,8 (guided scheduling) on a different system where the code runs on 2 Intel Xeon E5-2680 @ 2.7GHz with 8 cores per processor. I get the following results( this time the operating system was SUSE Linux Server 11 patch level 3, and the MKL version 11.2, compiler version icpc(ICC) 15.0.2):

(1x8|none): 3min 49sec , (2x8|guided): 2min 52sec, (4x4|guided): 2min 29sec

I also tested the minimal example (compiled with icpc -O3 -Wall example.cpp -o example $MKL_INC $MKL_LIB -openmp -openmp_report2):

#include <omp.h>

#include <mkl.h>

#include <vector>

#include <cmath>

#include <iostream>

#include <ctime>

#include <sstream>

#include <iomanip>

#include <string>

#include <cstring>

 

class A

{

    public:

        int Dim;

        double *Memory;

        double *Matrix;

        double *EigenValues;

 

        A() : Dim(1), Memory(NULL), Matrix(NULL), EigenValues(NULL) {};

 

        A(const A &other) : Dim(other.Dim)

        {

           Memory = (double*)mkl_malloc(sizeof(double)*(Dim * Dim + Dim), 64);

           if(NULL != other.Memory) std::copy(other.Memory, other.Memory + Dim * Dim + Dim, Memory);

 

           Matrix = Memory;

           EigenValues = Memory + Dim * Dim;

        }

        

        ~A()

        {

            if(NULL != Memory) 

            {

                mkl_free(Memory);

                Matrix = NULL;

                EigenValues = NULL;

                Memory = NULL;

            }

        }

        

        void Create(int Dim)

        {

            this->Dim = Dim;

            Memory = (double*)mkl_malloc(sizeof(double)*(Dim * Dim + Dim), 64);

            std::memset(Memory,0,sizeof(double)*(Dim * Dim + Dim));

 

            Matrix = Memory;

            EigenValues = Memory + Dim * Dim;

 

            for(std::size_t i = 0; i < Dim; ++i)

                for(std::size_t j = i+1; j < Dim; ++j)

                     Memory[i+j*Dim] = sin((double)i) * j;

        }

 

        int Diagonalize()

        {

            char Job = 'V';

            char UpLo = 'L';

 

            int DimWorkspace = -1;

            int Info = 0;

 

            double WorkspaceQuery;

            DSYEV(&Job, &UpLo, &Dim, Matrix, &Dim, EigenValues, &WorkspaceQuery, &DimWorkspace, &Info);

 

            DimWorkspace = (int)WorkspaceQuery;

            double *Workspace = (double*)mkl_malloc(sizeof(double) * DimWorkspace, 64);

 

            std::memset(Workspace, 0, sizeof(double) * DimWorkspace);

 

            DSYEV(&Job, &UpLo, &Dim, Matrix, &Dim, EigenValues, Workspace, &DimWorkspace, &Info);

 

            mkl_free(Workspace);

              

            return Info;

        }

};

 

int main()

{

    omp_set_max_active_levels(2);

    

    int Info = 0;

    int N = 30;

 

    int D[30] = {100, 100, 200, 200, 700, 700, 1000, 1000, 1050, 1050, 1500, 1500, 1750, 1800, 1800, 1800, 2000, 2000, 2500, 2500, 2700, 2700, 2950, 4000, 4000, 5500, 5500, 5500, 5500, 5750};

        

    std::vector<A> H(N);

    

    time_t start;

    time(&start);

    std::cout << "Time at Start: " << ctime(&start) << std::endl;

 

    std::stringstream *Buffers = new std::stringstream;

   

    #pragma omp parallel for shared(H) schedule(static,1)

    for(int i = 0; i < N; ++i)

    {

        int ID = omp_get_thread_num();

        H.Create(D);

        Buffers << "Diagonalize matrix with dimension " << D << ". Thread " << ID << std::endl;

        Info = H.Diagonalize();

        if(Info != 0) std::cout << "Full diagonalization of matrix with dimension " << D << " failed." << std::endl;

    }

    for(int i = 0; i < N; ++i) std::cout << Buffers.str() << std::endl;

    delete[] Buffers;

 

    time_t end;

    time(&end);

    std::cout << "Time at End: " << ctime(&end) << std::endl;    

    

    return 0;

}

 

On both systems and I got the same qualitative result: The source of the problem could be either the underlying operating system or the MKL library itself. For a further investigation I will try to use the newest MKL version on the Red Hat system.

Best regards,

Felix

0 Kudos
Ying_H_Intel
Employee
1,774 Views
Hi Felix, is there any wrong with the result? (1x8|none): 3min 49sec , (2x8|guided): 2min 52sec, (4x4|guided): 2min 29sec I test on one 2 Intel Xeon E5-2680 @ 2.7GHz with 8 cores per processor, cat /etc/issue Red Hat Enterprise Linux Server release 6.3 (Santiago) source /opt/intel/composer_xe_2015.2.164/bin/compilervars.sh intel64 icpc -O3 -Wall example_dsyev.cpp -o example -mkl -openmp -openmp_report2 the result looks fine. 2x8 and 4x4 are faster than 1x16. if with different static, dynamic, guided, there are imbalance issues (100, 5000), so guided and dynamic have better result. (1x16| none) ; 1m0.370s (1x8|none): 1m33.253s , (1x8|dynamic): 1m37.697s , (2x8|guided,2): 0m50.530s , (2x8|static,1): real 1m2.764s (2x8|dynamic,1): 0m54.719s (4x4|guided,2): 0m49.588s (4x4|static,1): 0m55.362s (4x4|dynamic,1): 0m54.256s #pragma omp parallel for shared(H) private(Info) //schedule(dynamic,1), schedule(guided,2) KMP_AFFINITY="verbose,compact" OMP_ACTIVE_LEVELS="2" OMP_NESTED="true" MKL_DYNAMIC="false" [yhu5@snb04 MKL_forum]$ export MKL_NUM_THREADS=4 [yhu5@snb04 MKL_forum]$ export OMP_NUM_THREADS=4 [yhu5@snb04 MKL_forum]$ echo $MKL_NUM_THREADS 4 [yhu5@snb04 MKL_forum]$ echo $OMP_NUM_THREADS 4 [yhu5@snb04 MKL_forum]$ time ./example real 0m55.362s user 11m53.768s sys 0m7.580s Best Regards, Ying
0 Kudos
Felix__K_
Beginner
1,774 Views

Hello Ying,

to clarify this, the results: (1x8|none): 3min 49sec , (2x8|guided): 2min 52sec, (4x4|guided): 2min 29sec were not obtained by running example.cpp. They are obtained by running the original (larger) code on a different machine (Suse). These results are correct.

I tested the example.cpp on the SUSE machine (see above)

(1x8|none): 1m 3sec, (1x16|none): 50sec, (2x8|guided): 53sec, (4x4|guided):52sec

as well as on the Red Hat machine (see above)

(1x8|none): 2m 48sec, (1x16|none): 3m 1sec, (2x8|guided):4m 42sec, (4x4|guided):3m 31sec

Did I missunderstand you ?

Best regards,

Felix

0 Kudos
Ying_H_Intel
Employee
1,774 Views

Hi Felix,

Let's summary,  there are two test platform :  Redhat  and Suse ,     and two test codes:  Large problem (127 iteration) and example.cpp (30 iterations).

Case A:  On RedHat machine   (32 packages x 4 cores/pkg x 1 threads/core (128 total cores), with 10.2.3.029,  . 4 Intel Xeon X5550 CPU's with 4 cores each

large problem:

(1x8 | none): 8min 24sec, (2x8| dynamic,1): 1h 26min 15sec, (2x8| static,1): 1h 22min 26sec , (2x8| guided): 1h 25min 52sec

(1x4| none): 12min 25sec, (2x4| dynamic, 1): 8min 6sec (2x4| static, 1): 8min 5sec, (4x4| guided): 22min 30sec

example.cpp:

(1x8|none): 2m 48sec, (1x16|none): 3m 1sec, (2x8|guided):4m 42sec, (4x4|guided):3m 31sec

Case B:  On  SUSE machine (2 Intel Xeon E5-2680 @ 2.7GHz with 8 cores per processor.  SUSE Linux Server 11 patch level 3, and the MKL version 11.2, compiler version icpc(ICC) 15.0.2) 

large problem: (1x8|none): 3min 49sec , (2x8|guided): 2min 52sec, (4x4|guided): 2min 29sec

example.cpp: (1x8|none): 1m 3sec, (1x16|none): 50sec, (2x8|guided): 53sec, (4x4|guided):52sec

So the problem is  on  the Red Hat machine,  while all of results (both large problem and example.cpp) on SUSE machine    are expected, right?.

If yes,  please upgrade your MKL version on that Redhad machine.

Another small issues on Redhad machine:   you mentioned   4 Intel Xeon X5550 CPU's with 4 cores each , but from KMP_AFFINITY, it shows  (32 packages x 4 cores/pkg x 1 threads/core (128 total cores), .   so the  4 Intel Xeon X5550 CPU's  are part of the  32 packages x 4 cores/pkg ?

and you may need "compact" to make openmp threads not migrate on different cores,   >export KMP_AFFINITY='verbose,compact'  and see if any changes.

Best Regards,

Ying

0 Kudos
Felix__K_
Beginner
1,573 Views

Hello Ying,

yes you are right. The 4 Intel Xeon 5550 CPU's are part of the 32 packages x 4 cores/pkg. I managed to install a newer version of the Intel icc compiler as well as a new version of the MKL on the RedHat machine. Here are the results setting KMP_AFFINITY='compact' -- this should have the same effect as if I bind the threads explicitly to the cores or ?.

RedHat 5.8, icpc 11.1, MKL 10.2.3, example.cpp: (1x8|none): 1m44sec, (1x16|none): 2m37sec, (2x8|guided):4m19sec, (4x4|guided):2m58sec

RedHat 5.8, icpc 14.0.4, MKL 11.1, example.cpp (first try): (1x8|none): 1m46,sec (1x16|none): 3m21sec , (2x8|guided): 5m8sec , (4x4|guided):3m32sec

So the problem still persists and seems not to be an issue related to the MKL and Intel compiler versions, but a problem with the underlying operating system or with the hardware design of the cluster, right ? So the hardware design is such that on each circuit board there are 2 of Intel Xeon 5550 CPU's i.e. 8cores and 26GB RAM. Hence if running an application with more than 8 threads, some of the threads have to migrate onto another circuit board. But the RAM they are using might be still on another board. Hence the overhead I guess. Furthermore I noticed that there is a strong variation in execution times for e.g. the 2x8 case of about 40%. This is strange, but maybe related to the kernel and to other jobs which are running in parallel. So the precise execution times should not be take too seriously. But the overall message is clear I think i.e.newer versions of MKL and ICC do not solve the problem.

Best regards,

Felix

0 Kudos
Reply