Our multithreaded program is using cvm library built on Intel MKL. We observed that
when we increased the number of running threads, the program spent
more time on locking/unlocking. We are using pthread and we are not using OpenMP. We got hot spots from VTune. The
test machine has 8 cores and our program uses 1 thread for main(), 1 loading
thread and several computation threads (using MKL). We got the best performance when we used 4 computation threads. Once we increased number of
computation threads over 4 the performance began to degrade and
"__lll_unlock_wake" and "__lll_lock_wait" became top 2 hot spots. The
data inputted into computation threads were independent from each
other. We also tried on a dummy thread function instead of using MKL,
just doing sqrt() in a loop to make sure we had long enough computation
time. We didn't see "__lll_unlock_wake" and "__lll_lock_wait" from the
dummy function. We were told by the author of cvm library
that the locking/unlocking functions came from MKL. Why does the performance degrade before number of logic threads reaches the number of physic cores? How does MKL use locking/unlocking in multithreaded function? Is there any limitation on number of threads and/or cache? How can we optimize MKL configuration for multithreaded program?
If you are running individual pthreads on a majority of the cores, it's unlikely that you would find threaded MKL running as well as the "sequential." If you do have enough cores available to accommodate both your pthreads and the MKL threads, you will likely run into random affinity conflicts, thus continuing to see excessive lock activity. I've been told that you should be able to bring your linux pthreads under the control of KMP_AFFINITY e.g. by setting up a tiny OpenMP parallel region and calling omp_get_num_threads. This would enable you to spread out your pthreads and MKL OpenMP threads.
Thank you very much for your response. I have changed those parameters in many combinations but I have not seen any improvement on the performance. Could you pleasesend me an example on how to avoid random affinity conflicts?
It is my understanding that if you are calling MKL in an OpenMP threaded region MKL will revert to single-threading as nested threading is off. This is certainly what I see.
That is , on a 4 core machine, I only ever see my 4 (omp) threads running when my OpenMP code is calling MKL functions. Outside OpenMP threaded regions, I see 4 threads when calling MKL, as MKL is starting it's own threads.