What is the exact reason why

LRaim · ‎03-21-2017

Some time ago I have already opened a similar problem in the premier support. The difference was between MKL and some OS subroutines.
Now in the MKL documentation I can find:
___________________________________________
mkl_get_max_threads
Gets the number of OpenMP* threads targeted for parallelism
____________________________________________________
The following piece of code:
================================================
      NMAXTH = OMP_GET_MAX_THREADS()
!
      NPROC = OMP_GET_NUM_PROCS()
!
      ITH = OMP_GET_THREAD_NUM()
!
!     MX_THREADS = NMAXTH
      NMAXTH0 = NMAXTH
!
      NTMKL = MKL_GET_MAX_THREADS()
======================================================
gives:
NMAXTH=8, NPROC = 8, NTMKL=4.
The running workstation is: Intel Core i7-4810MQ CPU 2.80 Ghz.
Running compiler:
Intel® Parallel Studio XE 2016 Update 4 Composer Edition for Fortran Windows* Integration for Microsoft Visual Studio* 2015, Version 16.0.0063.14, Copyright © 2002-2016 Intel Corporation. All rights reserved.
Intel should clarify differences.

Steve_Lionel · ‎03-21-2017

This would be better asked in the MKL forum.

SergeyKostrov · ‎03-21-2017

Specs for your CPU are at: . http://ark.intel.com/products/78937/Intel-Core-i7-4810MQ-Processor-6M-Cache-up-to-3_80-GHz // # of Cores = 4 - reported by MKL ( MKL_GET_MAX_THREADS ) Note: Cores is a hardware term that describes the number of independent central processing units in a single computing component (die or chip). # of Threads = 8 - reported by OpenMP ( OMP_GET_MAX_THREADS ) Note: A Thread, or thread of execution, is a software term for the basic ordered sequence of instructions that can be passed through or processed by a single CPU core. // My understanding is that Intel never claimed that MKL_GET_MAX_THREADS and OMP_GET_MAX_THREADS should return the same values. What value would I use to get maximum from parallelization using OpenMP or MKL? The answer is 4 because Intel Core i7 4810MQ processor has 4 cores.

Jing_Xu · ‎03-22-2017

mkl_get_max_threads returns the number of OpenMP threads for Intel MKL to use in internal parallel regions. This number depends on whether dynamic adjustment of the number of threads by Intel MKL is disabled (by an environment setting or in a function call):

If the dynamic adjustment is disabled, the function inspects the environment settings and return values of the function calls below in the order they are listed until it finds a non-zero value:
- A call to mkl_set_num_threads_local
- The last of the calls to mkl_set_num_threads or mkl_domain_set_num_threads( …, MKL_DOMAIN_ALL)
- The MKL_DOMAIN_NUM_THREADS environment variable with the MKL_DOMAIN_ALL tag
- The MKL_NUM_THREADS environment variable
- A call to omp_set_num_threads
- The OMP_NUM_THREADS environment variable
If the dynamic adjustment is enabled, the function returns the number of physical cores on your system.

The number of threads returned by this function is a hint, and Intel MKL may actually use a different number.

Reference:

https://software.intel.com/en-us/node/471142

SergeyKostrov · ‎03-24-2017

>>...The number of threads returned by this function is a hint, and Intel MKL may actually use a different number. By default a number of threads used by MKL is equal to the number of cores.

Gregg_S_Intel · ‎03-31-2017

Sergey Kostrov wrote:

>>...The number of threads returned by this function is a hint, and Intel MKL may actually use a different number.

By default a number of threads used by MKL is equal to the number of cores.

Yes, although I am hoping this will change soon for Intel Xeon Phi x200 processors, where best performance may often be 2 threads per core.

Mikhail_K_ · ‎04-02-2017

What is the exact reason why this cannot be fixed to what everyone would obviously expect -- use full power of your CPU/all threads by default? Independently of threads per core?

This is the most tired MKL problem

Gregg_S_Intel · ‎04-11-2017

MKL performance is usually best with 1 hardware thread per core. For many HPC kernels and applications it is best not to use all available hardware threads.

Mikhail_K_ · ‎04-11-2017

For memory bounded functions like vdMul I double the performance by setting number of threads manually

SergeyKostrov · ‎04-13-2017

>>...What is the exact reason why this cannot be fixed to what everyone would obviously expect -- use full power of your CPU/all >>threads by default? Independently of threads per core? Mikhail, That default "problem" could be easily fixed and a different number of threads for MKL could be set with mkl_set_num_threads function. >>...What is the exact reason why this cannot be fixed... Access to L2 cache is a primary reason because it is shared between cores. If you look at specs of any Intel CPU you will see something like: ... Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) ...

Mikhail_K_ · ‎04-17-2017

For VM functions I don't think there is any cache reuse...

Is it not possible to differentiate between different parallelization regimes, one for BLAS functions and something else for the VM/VSL pack?

I turned off hyper-threading for now...

"Intel® Math Kernel Library (Intel® MKL) accelerates math processing routines that increase application performance and reduce development time."

Seriously, VM functions running at 50% capacity on an Intel processor with hyper-threading (Intel's invention)?

If there is one company that should be able to solve this, it is definitely Intel.

TimP · ‎04-17-2017

Mikhail Kovalev wrote:

For VM functions I don't think there is any cache reuse...

Is it not possible to differentiate between different parallelization regimes, one for BLAS functions and something else for the VM/VSL pack?

I turned off hyper-threading for now...

"Intel® Math Kernel Library (Intel® MKL) accelerates math processing routines that increase application performance and reduce development time."

Seriously, VM functions running at 50% capacity on an Intel processor with hyper-threading (Intel's invention)?

If there is one company that should be able to solve this, it is definitely Intel.

Intel never intended hyperthreading to give a major boost to floating point applications with normal cache locality. Intel tried a slightly different tack with the MIC KNC but the current KNL returns to the favoritism for 1 thread per core with MKL. There have also been CPUs without HT, but Intel didn't consider the benefit large enough to continue (if you consider that a solution).

If you read these forums carefully, you will see reports about how hyperthreading gives a small benefit in many cases even though just 1 thread per core is active at the MKL level.

You may have a point that specialized applications which use VML may have low cache locality, but certainly not all of them would be that way.

maximum no of threads from OMP and MKL