I am trying to do some more details analysis about parallelism options/optimization in a dynamic application which uses PARDISO for the solution of a sparse linear system. This simple usage dates back more than 1 year. Now I am trying to find if a dynamic set up of the maximum no of threads can lower the total elapsed time so I added the following statements:
nt = mkl_get_max_threads()
On the workstation used for development: Intel(R) Core(TM) i7-4810MQ CPU 2.80 Hgz the functions returns 4.
On the C++ windows driver which launches the Fortran application with PARDISO the system functions:
numCPU = SysInfo.dwNumberOfProcessors ;
returns for numCPU a value of 8.
Any explanation ?
As you have a 4 core CPU with HyperThreading, MKL will default to use of 4 cores in the hope of maximizing performance. You could follow the usual path of trying MKL_DYNAMIC settings and the like to see if that default is right for your application.
In the latest OpenMP you have available functions to set the number of threads to number of cores as well as setting OMP_PLACES=cores so you could see if that works best for you with paradiso.
Various vendors make different choices on whether OMP_PLACES=cores or OMP_PLACES=threads is the default, or whether their OpenMP even supports those. You definitely need to watch this when trying to optimize number of threads, particularly if you observe that you don't get repeatable performance without these settings.
thanks for your kind answer.
However I hope to get an answer from Intel people. Here the problem is not OMP but MKL as Intel product so one should explain why/how MKL answer is 4 on a workstation with 8 cpu. Task manager shows 8 cpus. In a test made about one month ago using auto-parallelization with the Intel Fortran compiler (described in the related forum) all 8 cpus were working with a 99% saturation but with very bad results in terms of elapsed time to complete the job.
The Core i7-4810MQ processor has 4 physical cores. It supports HyperThreading, so when HyperThreading is enabled it appears to the operating system as 8 "logical processors". Some software may incorrectly refer to these as "cores", but that nomenclature is wrong and confusing. Reference: http://ark.intel.com/products/78937/Intel-Core-i7-4810MQ-Processor-6M-Cache-up-to-3_80-GHz
On most Intel processors, maximum performance for MKL routines is obtained with one thread per physical core. This will show up as 4 logical processors in use and 4 "idle", but for well-tuned code this approach can deliver the best performance. Using 1 logical processor per physical processor also results in less performance variability, because OS routines can run on the "idle" logical processors and cause less interference with the parallel job.
When running 1 thread per physical processor, it is important to make sure that the threads are actually running on different processors. This is typically controlled by the KMP_AFFINITY environment variable. If the software does not manage thread affinity properly, there is a pretty good chance that the OS will make a mistake and schedule two threads on one core and no threads on another.
is a longer Intel writeup on the subject of MKL choice of thread numbers under OpenMP.
We assume you are using the usual OpenMP based MKL rather than the TBB one.
Right, the nt = mkl_get_max_threads() return the number of physical cores are expected behavior. The main reason are that In MKL Official Developer Guide: https://software.intel.com/en-us/node/528551 =>Using Intel® Hyper-Threading Technology
Intel® Hyper-Threading Technology (Intel® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology.
If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation.
Improving Performance with Threading
Intel® Math Kernel Library (Intel® MKL) is extensively parallelized. See OpenMP* Threaded Functions and Problems and Functions Threaded with Intel® Threading Building Blocks for lists of threaded functions and problems that can be threaded.
Intel MKL is thread-safe, which means that all Intel MKL functions (except the LAPACK deprecated routine
?lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other.
If you are using OpenMP* threading technology, you can use the environment variable OMP_NUM_THREADS to specify the number of threads or the equivalent OpenMP run-time function calls. Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management. The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither is used, the OpenMP software chooses the default number of threads.
Thanks for all kind answers from which I can gather quite a lot of information.
However as a product developer I don't want to deal with processor details and do not want to optimize code with respect to one particular processor.
What I am pointing out is that the following answers are, in my opinion, not consistent.
The Intel spec sheet of the Core I7-4810MQ processor says: Core 4 Thread 8.
The windows operating system returns numcpu = 8.
So I would expect to have from MKL_GET_MAX_THREADS an answer equal 8.
Moreover it seems that Intel auto-parallelization behaves differently from MKL in deciding the maximum no of threads to use.
Right, i guess, i can understand your points.
Yes, you can say they are not consistent between MKL_GET_MAX_THREADS and Intel Auto-parallelization, (mainly based on Intel OpenMP libraray) . The MKL_GET_MAX_THREADS is designed to return the number of threads that Intel MKL to use in internal parallel regions for better performance as above reason. In most of cases, MKL will control all used threads automatically. Developer don't need to set or change them.
If you'd like control them yourselves, the Intel MKL threading controls take precedence over the OpenMP controls. By the way, you can use the functions omp_get_max_threads(), which allow you to get expected thread numbers (= thread numbers provided by OS) .
Please see detail https://software.intel.com/en-us/node/583576
You should be able to get similar results to what MKL does by default by setting OMP_PLACES=cores (either by environment variable or function call) and calling omp_get_num_places() for the number of threads to be set. According to earlier posts on this subject, this may have been implemented in Intel 16.0.2 compilers and was announced for 16.0.3 (and 17.0). The reference Ying provided must have been for 16.0.0.
As the discussion showed, the case for making 1 thread per core the default is stronger in MKL, which emphasizes floating point performance, than for OpenMP in general, where there may be applications which don't use primarily optimized floating point code. According to the docs I've seen, only the IBM OpenMP library defaults to OMP_PLACES=cores along with the corresponding number of threads (apparently following the MKL precedent).
Not setting affinity at all is a typical OpenMP default, so as to support multiple uncoordinated applications running together on the same CPUs. MPI implementations for thread funneled OpenMP typically provide for affinity at 1 thread per core as a default.
The extremely poor performance you report for the case where your application was using all the logical threads could have been produced by competition from another application, or possibly by suffering from splitting cache or fill buffers among too many threads. MKL docs never claimed more than 15% performance gain from the MKL default settings.
Thanks for all answers. I have only a last question to Intel people.
It seems that MKL does not use operating system calls to get the number of cores, so where does MKL get the information that for the Core I7 -4810MQ processor the number of cores is 4 ? Does exist a specific library to get these detailed information ?
We use the system call, but take the physical core number instead of logical core numbers.
In details, for Linux we get info from syscalls and read /proc/cpuinfo and for Windows we use api provided by kernel32.dll like GetLogicalProcessorInformationEx, GetActiveProcessorCount, etc