Great, thanks for the

Vladimir_Dergachev · ‎11-06-2014

I am optimizing a new application (written with Xeon Phi in mind) which performs a lot of FFT transforms.

The transforms are done on 512x512 arrays separately in each thread. This works quite well on Xeon host. When running on Xeon Phi in native mode the performance is much slower than expected.

After profiling (screen shot attached) I see that a lot of time is spent in mkl_dft_grasp_user_thread() - can anyone tell me what this function does (I was not able find anything on google) and whether there is any way to mitigate the performance issue.

thank you very much

Vladimir Dergachev

jimdempseyatthecove · ‎11-07-2014

Let me state something that to the uninitiated will seem counter-intuitive.

When using a multi-threaded program, each thread calling MKL, then you are supposed to link with the single-threaded version of MKL. To link with the multi-threaded MKL will cause each calling thread's instance to spawn a new thread pool.

Let me make a caveated here. This has happened enough times that MKL may have been modified to detect this, and instantiate one thread pool. While this may prove to be satisfactory when each user thread intermittently calls MKL, it may be adverse when many user threads concurrently call MKL.

From the name mkl_dft_grasp_user_thread() it is not clear as to which of the two cases is the cause. However, linking with the single-threaded MKL (in this instance) may produce the results you seek.

You may want to experiment using 2, 3, and 4 threads per core.

Jim Dempsey

Vladimir_Dergachev · ‎11-07-2014

I did try linking by specifying --mkl=sequential or --mkl=parallel but I get essentially the same trace. Also in that particular case the application was using only 30 threads leaving plenty of room for extra threads if needed.

best

Vladimir Dergachev

jimdempseyatthecove · ‎11-07-2014

Can you set a size other than power of two, say 504x504 or 520x520 (assuming doubles). Use multiple of cache line but not power of 2.

John D. McCalpin wrote good reasons as to avoid power of two sizes on this forum. Search his name, you should find a link to the posting.

Jim Dempsey

TimP · ‎11-07-2014

You would want each copy of mkl to use a small team of threads, with the total number less than 4 times number of cores. By default, omp_nested is off so you would not likely use enough threads.

Evgueni_P_Intel · ‎11-09-2014

Dear Vladimir,

You may find useful the following article.

Please also consider upgrading to the latest MKL version.

The FFT performance has been improved since MKL 11.1 was released a year ago.

Evgueni.

https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors

McCalpinJohn · ‎11-10-2014

For multi-dimensional FFTs you want to transform vectors with lengths that are powers of 2 for performance, but you also want to pad the data storage so that independent transforms are not accessing vectors that are separated by powers of two.

This is discussed at https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors

Vladimir_Dergachev · ‎11-10-2014

Great, thanks for the suggestions !

I am going to try using non-power of 2 image.

However, I would have expected if cache aliasing was a problem I would see a lot of time spent in a function that does computation and a lot of cache misses. But what I see instead is that most of the time is spent in mkl_dft_grasp_user_thread() and it increases sharply with number of threads allocated to the process. Which suggests that the problem is contention of some sort, but why do we need to "grasp" threads even in case of a sequential library ?

It's too bad the source is not available as it is for fftw.

McCalpinJohn · ‎11-10-2014

The observed "hot" routine certainly seems strange for this use case.

Until someone from Intel comments, it might be useful to look at this a couple of different ways:

Can you give us an idea of the absolute performance (e.g., seconds per 512x512 transform) for a few different numbers of independent threads?
Does the execution time change much when you don't profile with VTune?

McCalpinJohn · ‎11-10-2014

It would also be helpful to understand what threading model you are using and how the environment is set up. For example, it is important to make sure that the independent threads calling MKL routines don't end up getting bound (in MKL) to the same core.

Jeongnim_K_Intel1 · ‎11-10-2014

Can you check if this covers your use case?

https://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/

If the FFT grid is the same for all the FFTs you need, performing multiple FFTs simultaneously is an option.

status=DftiSetValue(my_handle,DFTI_NUMBER_OF_TRANSFORMS,howmany);
...
DftiCommitDescriptor(my_handle); //commit the handle

for(int i=0; i<num_fft; i+=howmany)
  fft(handle,data); // data=starting address of the i-th data on a FFT grid

For instance with 240 threads, one can use howmany=60, which is equivalent to doing 1 FFT on 1 core/4 threads. The optimal howmany will depend on the FFT grid for the memory and speed.

Alternatively, you can use nested OpenMP which looks like this

status=DftiSetValue(my_handle,DFTI_NUMBER_OF_USER_THREADS,num_user_threads);
...
DftiCommitDescriptor(my_handle); //commit the handle

#pragma omp parallel num_threads(num_user_threads);
for(int i=0; i<howmany; ++i)
  fft(my_handle,data); // threaded MKL

One needs to set the property of a handle so that multiple threads can share the same plan (Case 4 in the URL above). I'm concerned that the performance of nested OpenMP is not going to be great unless the envs. are set very carefully.

As John mentioned, the data alignment and padding can have big impacts on the performance. Use DftiSetValue API to fine tune these. See https://software.intel.com/en-us/node/521959

Vladimir_Dergachev · ‎11-10-2014

Looks like the problem goes away if I create a separate plan for each thread, rather than using the same plan.

Given that mkl descriptors are lighter than fftw plans this is not so bad.

Thank you for all the suggestions ! Off to optimize it further..

best

Vladimir Dergachev

Poor FFT mkl performance