topic Can you set a size other than in Software Archive

Poor FFT mkl performance

Vladimir_Dergachev — Thu, 06 Nov 2014 20:09:06 GMT

I am optimizing a new application (written with Xeon Phi in mind) which performs a lot of FFT transforms.

The transforms are done on 512x512 arrays separately in each thread. This works quite well on Xeon host. When running on Xeon Phi in native mode the performance is much slower than expected.

After profiling (screen shot attached) I see that a lot of time is spent in mkl_dft_grasp_user_thread() - can anyone tell me what this function does (I was not able find anything on google) and whether there is any way to mitigate the performance issue.

thank you very much

Vladimir Dergachev

Let me state something that

jimdempseyatthecove — Fri, 07 Nov 2014 20:52:11 GMT

Let me state something that to the uninitiated will seem counter-intuitive.

When using a multi-threaded program, each thread calling MKL, then you are supposed to link with the single-threaded version of MKL. To link with the multi-threaded MKL will cause each calling thread's instance to spawn a new thread pool.

Let me make a caveated here. This has happened enough times that MKL may have been modified to detect this, and instantiate one thread pool. While this may prove to be satisfactory when each user thread intermittently calls MKL, it may be adverse when many user threads concurrently call MKL.

From the name mkl_dft_grasp_user_thread() it is not clear as to which of the two cases is the cause. However, linking with the single-threaded MKL (in this instance) may produce the results you seek.

You may want to experiment using 2, 3, and 4 threads per core.

Jim Dempsey

I did try linking by

Vladimir_Dergachev — Fri, 07 Nov 2014 21:17:37 GMT

I did try linking by specifying --mkl=sequential or --mkl=parallel but I get essentially the same trace. Also in that particular case the application was using only 30 threads leaving plenty of room for extra threads if needed.

best

Vladimir Dergachev

Can you set a size other than

jimdempseyatthecove — Sat, 08 Nov 2014 03:01:29 GMT

Can you set a size other than power of two, say 504x504 or 520x520 (assuming doubles). Use multiple of cache line but not power of 2.

John D. McCalpin wrote good reasons as to avoid power of two sizes on this forum. Search his name, you should find a link to the posting.

Jim Dempsey

You would want each copy of

TimP — Sat, 08 Nov 2014 05:57:53 GMT

You would want each copy of mkl to use a small team of threads, with the total number less than 4 times number of cores. By default, omp_nested is off so you would not likely use enough threads.

Dear Vladimir,

Evgueni_P_Intel — Mon, 10 Nov 2014 03:37:07 GMT

Dear Vladimir,

You may find useful the following article.

Please also consider upgrading to the latest MKL version.

The FFT performance has been improved since MKL 11.1 was released a year ago.

Evgueni.

https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors

For multi-dimensional FFTs

McCalpinJohn — Mon, 10 Nov 2014 15:36:17 GMT

For multi-dimensional FFTs you want to transform vectors with lengths that are powers of 2 for performance, but you also want to pad the data storage so that independent transforms are not accessing vectors that are separated by powers of two.

This is discussed at https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors

Great, thanks for the

Vladimir_Dergachev — Mon, 10 Nov 2014 18:03:52 GMT

Great, thanks for the suggestions !

I am going to try using non-power of 2 image.

However, I would have expected if cache aliasing was a problem I would see a lot of time spent in a function that does computation and a lot of cache misses. But what I see instead is that most of the time is spent in mkl_dft_grasp_user_thread() and it increases sharply with number of threads allocated to the process. Which suggests that the problem is contention of some sort, but why do we need to "grasp" threads even in case of a sequential library ?

It's too bad the source is not available as it is for fftw.

The observed "hot" routine

McCalpinJohn — Mon, 10 Nov 2014 22:46:50 GMT

The observed "hot" routine certainly seems strange for this use case.

Until someone from Intel comments, it might be useful to look at this a couple of different ways:

Can you give us an idea of the absolute performance (e.g., seconds per 512x512 transform) for a few different numbers of independent threads?
Does the execution time change much when you don't profile with VTune?

It would also be helpful to

McCalpinJohn — Mon, 10 Nov 2014 22:50:01 GMT

It would also be helpful to understand what threading model you are using and how the environment is set up. For example, it is important to make sure that the independent threads calling MKL routines don't end up getting bound (in MKL) to the same core.

Can you check if this covers

Jeongnim_K_Intel1 — Mon, 10 Nov 2014 23:18:32 GMT

Can you check if this covers your use case?

https://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/

If the FFT grid is the same for all the FFTs you need, performing multiple FFTs simultaneously is an option.

status=DftiSetValue(my_handle,DFTI_NUMBER_OF_TRANSFORMS,howmany);
...
DftiCommitDescriptor(my_handle); //commit the handle

for(int i=0; i<num_fft; i+=howmany)
  fft(handle,data); // data=starting address of the i-th data on a FFT grid

For instance with 240 threads, one can use howmany=60, which is equivalent to doing 1 FFT on 1 core/4 threads. The optimal howmany will depend on the FFT grid for the memory and speed.

Alternatively, you can use nested OpenMP which looks like this

status=DftiSetValue(my_handle,DFTI_NUMBER_OF_USER_THREADS,num_user_threads);
...
DftiCommitDescriptor(my_handle); //commit the handle

#pragma omp parallel num_threads(num_user_threads);
for(int i=0; i<howmany; ++i)
  fft(my_handle,data); // threaded MKL

One needs to set the property of a handle so that multiple threads can share the same plan (Case 4 in the URL above). I'm concerned that the performance of nested OpenMP is not going to be great unless the envs. are set very carefully.

As John mentioned, the data alignment and padding can have big impacts on the performance. Use DftiSetValue API to fine tune these. See https://software.intel.com/en-us/node/521959

Looks like the problem goes

Vladimir_Dergachev — Tue, 11 Nov 2014 01:51:30 GMT

Looks like the problem goes away if I create a separate plan for each thread, rather than using the same plan.

Given that mkl descriptors are lighter than fftw plans this is not so bad.

Thank you for all the suggestions ! Off to optimize it further..

best

Vladimir Dergachev