running MKL in multithreaded functions

zhangj5 · ‎06-08-2011

Our multithreaded program is using cvm library built on Intel MKL. We observed that when we increased the number of running threads, the program spent more time on locking/unlocking. We are using pthread and we are not using OpenMP. We got hot spots from VTune. The test machine has 8 cores and our program uses 1 thread for main(), 1 loading thread and several computation threads (using MKL). We got the best performance when we used 4 computation threads. Once we increased number of computation threads over 4 the performance began to degrade and "__lll_unlock_wake" and "__lll_lock_wait" became top 2 hot spots. The data inputted into computation threads were independent from each other. We also tried on a dummy thread function instead of using MKL, just doing sqrt() in a loop to make sure we had long enough computation time. We didn't see "__lll_unlock_wake" and "__lll_lock_wait" from the dummy function. We were told by the author of cvm library that the locking/unlocking functions came from MKL. Why does the performance degrade before number of logic threads reaches the number of physic cores? How does MKL use locking/unlocking in multithreaded function? Is there any limitation on number of threads and/or cache? How can we optimize MKL configuration for multithreaded program?

Thank you very much for your help.

TimP · ‎06-08-2011

If you are running individual pthreads on a majority of the cores, it's unlikely that you would find threaded MKL running as well as the "sequential." If you do have enough cores available to accommodate both your pthreads and the MKL threads, you will likely run into random affinity conflicts, thus continuing to see excessive lock activity. I've been told that you should be able to bring your linux pthreads under the control of KMP_AFFINITY e.g. by setting up a tiny OpenMP parallel region and calling omp_get_num_threads. This would enable you to spread out your pthreads and MKL OpenMP threads.

zhangj5 · ‎06-09-2011

Thank you very much for your response. I have changed those parameters in many combinations but I have not seen any improvement on the performance. Could you pleasesend me an example on how to avoid random affinity conflicts?

Thank you.
Jason

AndrewC · ‎06-09-2011

It is my understanding that if you are calling MKL in an OpenMP threaded region MKL will revert to single-threading as nested threading is off. This is certainly what I see.

That is , on a 4 core machine, I only ever see my 4 (omp) threads running when my OpenMP code is calling MKL functions. Outside OpenMP threaded regions, I see 4 threads when calling MKL, as MKL is starting it's own threads.