Re: FFT routines and threading

skan95 · ‎04-25-2008

Hello,

I am a newbie to MKL and am trying out the 10.0.011 FFT routines with gcc as my compiler. My PC is a Intel Core 2 PC, and indeed MKL detects that max threads can be 2. My test code is not threaded.

I've run FFTs ranging from 8192 points to 262144 points. When the batch size is 1, and I use mkl_set_num_threads to change the possible thread number, I do not see any performance change. I've tried 1,2 and 4 thread settings.

If I change the batch size to 2,4,8 and 16, I see better performance for the setting of 2 threads. I am not surprised by this as there are only 2 cores on my PC. However, if I monitor the CPU performance using gnome-system-monitor, I only see one core at a time being used at or close to 100%. The other CPU core very occassionally has high usage.

First, can someone tell me whether a batch size of 1 should also experience some mkl threading? From the manual I assumed that the only time that a 1D FFT of batch size 1 would not thread is if its size is not a power of 2.

Second, can you tell me why one of my CPU cores is barely being utilized? Do I need to link to the omp libs to get both cores going with mkl? I do not set any thread related env vars as I call mkl_set_num_threads directly from my code.

Thank you,
skb

Dmitry_B_Intel · ‎04-25-2008

Hello,

One-dimensionalFFT with DFTI_NUMBER_OF_TRANSFORMS set to1will run on the two cores of your system unless it is non-2-power or single-precision or in-place transform or stride is non unit. Of course, you should have linked the application to openmp library (-lguide). Greater performance benefit fromrunning on two cores will show up with 2D and higher-dimensional problems.

Thanks
Dima

bonniegb · ‎06-19-2008

I too am a new user to the MKL libraries. I have an Linux box with 4 dual core processors that claims the Max Num Threads = 16. When my application runs the FFT, it only uses 1 CPU at 100%, leaving the other 15 CPUs idle (monitored via the "top" tool). The software does meet 2 of your criteria to keep in on 1 CPU, namely it is single precision and an in-place FFT. How can I get it to spread across the other CPUs? Should I switch it to not be an in-place FFT? Switching to a double precision FFT would seem to slow it down and take double the memory, but if this would spread the FFT across CPUs would it be faster overall? Or, would adding threading elsewhere in the application be more useful, like adding double buffering to the FFT, the manipulation of the data after the FFT, and the final reverse FFT?

Thanks,
Bonnie

amath · ‎06-20-2008

Hi,

I am using MKL (version 8.1) on Macosx and trying a simple FFT example using multi-threading. My mac has 2 quad-core processors. I had couple of questions:

(1)To use multi-threading for the FFT routine (DFTComputeForward) only, do I need to compile with -openmp ?

(2) If I define the env variable omp_num_threads=8, do I still need to set DFT_NUM_USER_THREADS?

Thanks

Dmitry_B_Intel · ‎06-20-2008

Bonnie,

Utilizing other CPUs can be done by doing non-1D transforms, by doing transforms in bunches (see DFTI_NUMBER_OF_TRANSFORMS configuration parameter), or by doing threading in a coarser way, at application level. In the latter case refer to DFTI_NUMBER_OF_USER_THREADS configuration parameter. Moving to double precision will also utilize other CPUs on 1D out-of-place 2-power transforms.

Thanks,
Dima

Dmitry_B_Intel · ‎06-20-2008

amath,

Setting environment variable OMP_NUM_THREADS=8 should be enough formany transforms be done in parallel when you call DftiComputeForward. Configuration parameter DFTI_NUMBER_OF_USER_THREADSrefers to adifferent way of parrallelization: you will need it if you parallelize your application and want several threads of the application share the same descriptor in the calls to DftiComputeForward.

Thanks,
Dima

bonniegb · ‎06-23-2008

I am still not having any luck. I have changed my code to be a double precision not-inplace forward FFT. The return value from mkl_get_max_threads is 16. I have set every environment variable I can find MKL_NUM_THREADS=16, OMP_NUM_THREADS=16, MKL_DYNAMIC=TRUE & FALSE, OMP_DYNAMIC=TRUE & FALSE, MKL_DOMAIN_NUM_THREADS="MKL_FFT=16", and tried other number of threads. I have also tried to set these in the code using MKL_Set_Num_Threads(16) & MKL_Domain_Set_Num_Threads (16,MKL_FFT). Through all these variations I still am only utilizing 1 CPU and leaving the other 15 idle. What am I missing?

Thanks,
Bonnie

Todd_R_Intel · ‎06-26-2008

Have you tried submitting a simple test case at premier support? That is often the best way for us to get all the details needed to determine if there is something that can be done to get you the scaling you're looking for, or if the case you have is just not parallelized for one reason or another.

-Todd

bonniegb · ‎07-10-2008

I tried building your example code using your makefile & it too only uses 1 of 16 CPUs (monitored using 'top').
REAL_1D_CSS_DOUBLE_EX1.OUT
REAL_1D_CSS_DOUBLE_EX2.OUT
I tried setting the environment variables MKL_NUM_THREADS, MKL_DOMAIN_NUM_THREADS, and OMP_NUM_THREADS to 16 and there was no change. I also toggled the MKL_DYNAMIC and OMP_DYNAMIC variables from 0 to 1.

I put some quick pthreads threading in my code, and it correctly spread across CPUs without setting or changing any environment variable.

Is there some system installation parameter or account parameter that is set wrong?

Still baffled,
Bonnie

k_doshi · ‎09-17-2008

Hi,

So there is no way I can run the 1D FFT (single-precision) on two cores? Even mkl_set_num_threads ( 2 ) won't guarantee that the DFT will run on two cores ?

Thanks
Kavi

Gennady_F_Intel · ‎09-24-2008

Hello Kavi,

The short answer yes, there is no way to do that right now.

As Todd recommended you, Could you please submit a simple test case at premier support.

Really this is the best and fastest way to resolve this kind of issues.

Gennady