LAPACK memory problem

Zhaoqiang_B_ · ‎02-05-2016

Hi all,

We have a small workstation equipped with four 16 core AMD Opteron 6376 processors running at 2.3 GHz, for a total of 64 cores, and 256 GB memory. While doing tests with INTEL MKL package, we met a problem: When we submitted a single job (requiring one core) which was compiled by ifort and calls MKL LAPACK, it runs much faster than a similar program compiled with gfortran and calling the open source LAPACK. However, when we submitted four of this same program (each requiring one core, totally four cores) simultaneously, the speed was lowered to about 1/3 for each of the jobs. The jobs compiled with gfortran and calling open-source LAPACK did not have this problem.

I heard from others that this may be due to some memory consumption problems. Could anyone suggest me what exactly the problem is? Thanks in advance.

baizq

TimP · ‎02-05-2016

If you run multiple MKL jobs simultaneously, with each one set to use all the "cores" (possibly meaning all the supported hardware thread contexts), you will surely run into issues with cache. Consider running each one in a separate shell, setting a number of threads appropriate to a single CPU, with the affinity set, e.g. by OMP_PROC_BIND, to the thread context numbers of a single CPU. Unless you have an appropriate resource manager, this means the submitters of each task will need to agree on which CPU each one uses.

According my limited knowledge of current style AMD CPUs, you might want 16 threads per CPU if running single precision, or 8 if double.

You didn't say whether you run linux or Windows, besides not saying whether you are comparing single and multiple thread cases. In linux, there are more options to accomplish this, such as submitting the tasks under taskset. Typical linux distributions of lapack are not well optimized even according to the capabilities of current gfortran (as well as being single threaded), so it would be rather embarrassing if you can't get better performance by appropriate use of Intel software capabilities.

If you have such huge problems that each needs more than 64GB, you would expect running them simultaneously to be inefficient,

Zhaoqiang_B_ · ‎02-09-2016

Hi Tim,

Thank you for your reply.

As for your question, I am running the job on Linux (CentOS 6.5). It is not quite clear to me how to make single/multiple thread cases. This is the way we compile/link our source code:

ifort test.f90 -I/opt/intel/mkl/include/ -L/opt/intel/mkl/lib/intel64_lin -L -static-intel -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_lapack95_lp64 -liomp5 -lpthread

Here is the information of our CPUs. The number of threads is 16. http://www.cpu-world.com/CPUs/Bulldozer/AMD-Opteron%206376%20-%20OS6376WKTGGHK.html

I followed your suggestion to export the environment parameter OMP_PROC_BIND as 8. But it did not fix the problem. The time consumed is the same. Also I was returned a warning message "OMP: Warning #42: OMP_PROC_BIND: "8" is an invalid value; ignored" .

Could you please help me to look into the problem? Or suggest me some resources from which I can read and learn about the multi-thread stuff of the MKL package. Thank you in advance.

baizq

Ying_H_Intel · ‎02-13-2016

Hi, baizq,

which lapack function are you calling?

1) threads and the speed was lowered to about 1/3 for each of the jobs

General speaking, the link line -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_lapack95_lp64 -liomp5 -lpthread will invoke MKL internal openmp threading. Which mean

./a.out on small workstation equipped with four 16 core AMD Opteron 6376 processors if without any core affinity setting,

will run with 64 threads.

Could you please try to >export KMP_AFFINITY=verbose

and let us know the output result when single job and 4 job respectively?

(i guess, when you run 4 job simultaneously, each job may invoke 64 threads, so overload the threads, thus, the speed was lowered to about 1/3 for each of the jobs. but it depends on the second questions)

2) You mentioned, "when we submitted four of this same program (each requiring one core, totally four cores) simultaneously.

Could you please describle the details?, like how do you bind one core for one a.out? As i understand, you may want to run four job on four processor (16 cores)?

There is some discussion about CPU usage, in https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789. ;

Best Regards,

Ying