MKL DFT: multi-threading by default?

mpbro · ‎12-11-2007

As I mentioned in another post, I'm running the FFTW wrappers for MKL's DFT. I tested a 500x500x500 real-to-complex-even DFT by running both the native FFTW DFT and the MKL DFT (forward and inverse). In FFTW, I was forcing multi-threading on my 4-CPU node by invoking sfftw_init_threads() and sfftw_plan_with_nthreads(). In MKL, I noticed that by default, it seems to use 4 threads. This is not a complaint--that's a pretty smart code. But could I confirm that this is indeed the case?

FWIW, on this problem, the 4-thread FFTW result was about 2x slower than the 4-thread MKL result. Very impressive!

mpbro · ‎12-11-2007

Section 6-8 of the user's guide just answered this question. This is cool, except if multiple threads increase peak memory usage, causing the system to bog down!

The value of MKL_DYNAMIC is by default set to TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE. MKL_DYNAMIC being TRUE means that Intel MKL will always try to pick what it considers the best number of threads, up to the maximum specified by the user.

TimP · ‎12-12-2007

I guess you may be using a fairly old MKL. In MKL 9 there were serial (non-threaded) libraries as well as the primary threaded ones. In MKL 10 you choose threaded by -lmkl_intel_thread, or non-threaded by -lmkl_sequential.

g_f_thomas · ‎12-12-2007

See Fig. 2 in

http://softwaredispatch.intel.com/?pin=ON29EE32&xid=0A2&t=3&lid=2114

Gerry

mpbro · ‎12-12-2007

Gerry,

Thanks for the link--the results are impressive.

So far, I've found the following for a large (500x500x500) DFT transform pair:

Run time:
FFTW 1 thread/FFTW 4 thread ~ 3.8x
FFTW 4 thread/MKL 4 thread ~ 2x
MKL 1 thread/MKL 4 thread ~ 2.5x

So for this 3D problem, FFTW multi-threaded scales very nicely. However, the factor of 2 between FFTW and MKL is incredibly tantalizing for me. If I could just lick this memory problem...

mpbro · ‎12-12-2007

Hi Tim,

Thanks for the response. I'm actually using MKL v10, but I am linking to -lmkl_intel_thread.