Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

maximum no of threads from OMP and MKL

LRaim
New Contributor I
1,849 Views

Some time ago I have already opened a similar problem in the premier support. The difference was between MKL and some OS subroutines.
​Now in the MKL documentation I can find:
​___________________________________________
mkl_get_max_threads
Gets the number of OpenMP* threads targeted for parallelism
____________________________________________________
The following piece of code:
​================================================
      NMAXTH = OMP_GET_MAX_THREADS()
!
      NPROC = OMP_GET_NUM_PROCS()
!
      ITH = OMP_GET_THREAD_NUM()
!
!     MX_THREADS = NMAXTH
      NMAXTH0 = NMAXTH
!
      NTMKL = MKL_GET_MAX_THREADS()
======================================================
gives:
NMAXTH​=8, NPROC = 8, NTMKL=4.
​The running workstation is: Intel Core i7-4810MQ CPU 2.80 Ghz.
​Running compiler:
Intel® Parallel Studio XE 2016 Update 4 Composer Edition for Fortran Windows* Integration for Microsoft Visual Studio* 2015, Version 16.0.0063.14, Copyright © 2002-2016 Intel Corporation. All rights reserved.
​Intel should clarify differences.

 

 

 

0 Kudos
11 Replies
Steve_Lionel
Honored Contributor III
1,849 Views

This would be better asked in the MKL forum.

0 Kudos
SergeyKostrov
Valued Contributor II
1,849 Views
Specs for your CPU are at: . http://ark.intel.com/products/78937/Intel-Core-i7-4810MQ-Processor-6M-Cache-up-to-3_80-GHz // # of Cores = 4 - reported by MKL ( MKL_GET_MAX_THREADS ) Note: Cores is a hardware term that describes the number of independent central processing units in a single computing component (die or chip). # of Threads = 8 - reported by OpenMP ( OMP_GET_MAX_THREADS ) Note: A Thread, or thread of execution, is a software term for the basic ordered sequence of instructions that can be passed through or processed by a single CPU core. // My understanding is that Intel never claimed that MKL_GET_MAX_THREADS and OMP_GET_MAX_THREADS should return the same values. What value would I use to get maximum from parallelization using OpenMP or MKL? The answer is 4 because Intel Core i7 4810MQ processor has 4 cores.
0 Kudos
Jing_Xu
Employee
1,849 Views

mkl_get_max_threads returns the number of OpenMP threads for Intel MKL to use in internal parallel regions. This number depends on whether dynamic adjustment of the number of threads by Intel MKL is disabled (by an environment setting or in a function call):

  • If the dynamic adjustment is disabled, the function inspects the environment settings and return values of the function calls below in the order they are listed until it finds a non-zero value:

  • If the dynamic adjustment is enabled, the function returns the number of physical cores on your system.

The number of threads returned by this function is a hint, and Intel MKL may actually use a different number.

Reference:

https://software.intel.com/en-us/node/471142

0 Kudos
SergeyKostrov
Valued Contributor II
1,849 Views
>>...The number of threads returned by this function is a hint, and Intel MKL may actually use a different number. By default a number of threads used by MKL is equal to the number of cores.
0 Kudos
Gregg_S_Intel
Employee
1,849 Views

Sergey Kostrov wrote:

>>...The number of threads returned by this function is a hint, and Intel MKL may actually use a different number.

By default a number of threads used by MKL is equal to the number of cores.

Yes, although I am hoping this will change soon for Intel Xeon Phi x200 processors, where best performance may often be 2 threads per core.

0 Kudos
Mikhail_K_
Beginner
1,849 Views

What is the exact reason why this cannot be fixed to what everyone would obviously expect -- use full power of your CPU/all threads by default? Independently of threads per core?

This is the most tired MKL problem

0 Kudos
Gregg_S_Intel
Employee
1,849 Views

MKL performance is usually best with 1 hardware thread per core.  For many HPC kernels and applications it is best not to use all available hardware threads.

0 Kudos
Mikhail_K_
Beginner
1,849 Views

For memory bounded functions like vdMul I double the performance by setting number of threads manually

0 Kudos
SergeyKostrov
Valued Contributor II
1,849 Views
>>...What is the exact reason why this cannot be fixed to what everyone would obviously expect -- use full power of your CPU/all >>threads by default? Independently of threads per core? Mikhail, That default "problem" could be easily fixed and a different number of threads for MKL could be set with mkl_set_num_threads function. >>...What is the exact reason why this cannot be fixed... Access to L2 cache is a primary reason because it is shared between cores. If you look at specs of any Intel CPU you will see something like: ... Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) ...
0 Kudos
Mikhail_K_
Beginner
1,849 Views

For VM functions I don't think there is any cache reuse...

Is it not possible to differentiate between different parallelization regimes, one for BLAS functions and something else for the VM/VSL pack?

I turned off hyper-threading for now...

 

"Intel® Math Kernel Library (Intel® MKL) accelerates math processing routines that increase application performance and reduce development time."

Seriously, VM functions running at 50% capacity on an Intel processor with hyper-threading (Intel's invention)?

If there is one company that should be able to solve this, it is definitely Intel.

0 Kudos
TimP
Honored Contributor III
1,849 Views

Mikhail Kovalev wrote:

For VM functions I don't think there is any cache reuse...

Is it not possible to differentiate between different parallelization regimes, one for BLAS functions and something else for the VM/VSL pack?

I turned off hyper-threading for now...

 

"Intel® Math Kernel Library (Intel® MKL) accelerates math processing routines that increase application performance and reduce development time."

Seriously, VM functions running at 50% capacity on an Intel processor with hyper-threading (Intel's invention)?

If there is one company that should be able to solve this, it is definitely Intel.

Intel never intended hyperthreading to give a major boost to floating point applications with normal cache locality.  Intel tried a slightly different tack with the MIC KNC but the current KNL returns to the favoritism for 1 thread per core with MKL.  There have also been CPUs without HT, but Intel didn't consider the benefit large enough to continue (if you consider that a solution).

If you read these forums carefully, you will see reports about how hyperthreading gives a small benefit in many cases even though just 1 thread per core is active at the MKL level.

You may have a point that specialized applications which use VML may have low cache locality, but certainly not all of them would be that way.

0 Kudos
Reply