mkl running slow on multi-cpu server

Jason_Jones · ‎09-29-2011

I have an application that uses quite a few mkl functions. It runs VERY fast on most machines in most situations. However, I have run into a couple of situations where it is running between 5-20 TIMES slower. The situation comes up when a newer machine ( I. Single-CPU quad-core i2500k Sandy Bridge, or II. Two-CPU twelve core machine with Xeon X5680 cpus ) is running multiple jobs.

The original code is quite complicated, but I have reproduced it with only a couple of calls to dgemm in a loop. The data size is somewhat large ( around 4 million elements per matrix ).

On an older ( just over 1 year old ) i7 laptop, I see only very minor slowdowns with many jobs running -- maybe 1-2%.

I assume I need to look into pinning my process to a cpu/thread affinity.

1) Does this seem a reasonable working diagnosis?
2) If this does seem reasonable, are their function calls for this to add to the program? Or does this need to be handled by system utilities? We run on Linux and Windows (and are seeing the problem on both OS's.

Thank you to the community for taking a look at my message!

Gennady_F_Intel · ‎09-29-2011

Jason,

What MKL version are you using when running on a newer machine?

>>The data size is somewhat large ( around 4 million elements per matrix ).

Am I understand correct that the size of the matrix would be ~ 2000x2000?

TimP · ‎09-29-2011

This is possible. If you wish to run multiple jobs together efficiently on a multi-CPU, you should look into running each in a separate environment window, with a KMP_AFFINITY and NUM_THREADS setting which will keep each job on its own CPU, in order to avoid thrashing cache.
You could handle it by scripting to set the environment variables for each job, or by set_num_threads and putenv function calls from the application.
Once you use affinity settings to pin a job to certain cores, you must avoid pinning another application to the same cores, as you have defeated the efforts the OS makes to allocate available cores. The job may be more complicated when you run with HyperThread enabled.

Jason_Jones · ‎10-03-2011

Gennady: I am running version 10.2.6 of MKL. Yes, the data size is around 2000 x 2000. This is for my stripped down example. Other jobs with which I am seeing slowdowns are of varying sizes, 200 x 5, 200 x 200, 1000 x 10, etc.

On Linux we compile with GCC ( for now, we are testing icc and hope to find improvements with it ) and linking to libgomp. Is the KMP_AFFINITY still the correct environment variable? Or do I need to use the GCC versions of these?

Also, is this the standard way to handle this? Or should my application be pinning itself to cores internally? I ask because, the guys sending the jobs out are not savvy in low level computer issues and we do not want them to have to deal with setting various environment variables for each job that they queue up.

Thanks again!

TimP · ‎10-03-2011

When running the libmkl_thread and Intel libiomp5, GOMP_CPU_AFFINITY environment variable is recognized and translated to the equivalent KMP_AFFINITY, so you could take your choice. If you are running the gnu_thread library and libgomp, you would use the GOMP_CPU_AFFINITY.
libiomp5 supports the gnu OpenMP function calls; in most cases the performance of libiomp5 and current libgomp is similar, although there are outliers where "your mileage may vary."