Multithread performance in MKL 7.0 (sgemm)

laurent-zanoni · ‎04-19-2004

I'm evaluating the MKL 7.0 library. My project is in MS VC6.

I've replaced my own matrix-matrix multiplication (used in neural net computatation) by SGEMM, inourDLL (it's a SDK). On a P4, I've a gain of nearly 30% :-)

As it's a SDK, its user can choose to create many threads, eachusing one of our object. I can't also make any assumptions about the Multithread library the User will choose (Win32, OMP...)

I've seen that there is a default limit of 32 threads in SGEMM (controlled by the ENV variable KMP_ALL_THREADS).

I've raised this limit to 64 with a 'putenv' in the initialization portion of my DLL.

But I wonder: Does setting this limit higher (let's 64 or 128 at most) cause a performance penalty,whenthat many threads are not required. Let's say I create 16 threads, will changing the KMP_ALL_THREADS to 128 cause any harm compared to leaving to 32 ?

Also, in the case of a multi-CPU, what are the best MKL/KMP/OMP env. variables settings ? Again, I can't make any supposition on the final multithreading library chosen by our user.

Intel_C_Intel · ‎04-19-2004

There are two issues, in a sense. The first is whether the routinethat calls sgemm is threaded. If it is, there may be good reason to do the threading there even if the hardware is oversubscribed.

If the threading is done within MKL, you should not use more theads than there are processors as this can slow down the performance a lot when compared to not using more threads than there are processors. The slowdown arises because sgemm uses cache very effectively and switching threads, when sgemm is spread across multiple processors, will cause the data used by the one thread to be discarded from the cache(s) and new data to be loaded.

Most of MKL sets the default number of threads to 1 (the direct sparse solver is, at the moment, an exception to that rule) and the user needs to set the thread count higher if multiple processors are to be used. With the information I have, I would suggest recommending the user not set the number of threads to a value greater than the number of processors in the system.

Bruce

laurent-zanoni · ‎04-20-2004

Thanks for your reply.

I must assume that the routine calling the sgemm will be threaded by the user: the expected application is for speech recognition for Telecom servers, so it's likely to have 32 to 64 channels running in parallel, each one doing it's sgemm. I have to compute one 'forward' pass in neural net (2 matrix-matrix multiplication, typical size being [1 to 4,200] by [200,1000] for the first layer) every 30 ms.

If I assumed no threading in the routine, my standard matrix code is more than fast enough :-)

If I get it right, you'd say to let the user set MKL_NUM_PROCS to the number of real CPUs (is it x2 if CPU is Hyperthreaded ?), and raise KMP_ALL_THREADS to a suitable number like 64 or 128 according to expected number of user's threads running in parallel (Telephony channels).

Are the performances greatly reduced if thread switch occurs while in SGEMM ? Is it possible to force the SGEMM to complete without thread switch, in the case there is a high number of user threads( 64 ) ?

The complete neural net can be quite large (total of a little more than 1MB), so I suppose it does not fit completely in the cache of most of P4 CPUs, but perhaps the load-to-cache time is small compared to the actual multiplication process ?

Message Edited by Laurent-Zanoni on 04-20-2004 01:22 AM