Calls using FFTW wrapper don't seem to be thread safe

Karl_W_2 · ‎05-01-2013

Hi,

I am attempting to add an OpenMP layer of parallelism into an MPI code writtten in Fortran. However, I am not doing this correctly as it does not appear to be thread safe. I can say this with some confidence as adding an !$OMP CRITICAL region around the library call results in a correct answer. I believe that this is a result of the FFTW wrappers.

Having looked at the MKL documentation I see a mention of "fftw3_mkl.number_of_user_threads", I imagine that this would resolve the issue but I haven't been able to implement it sucuessfully. Is there any chance anyone has a working sample code (written in Fortran) that shows the correct use of this object?

To clarify: I am attempting to execute iterations of a loop that contains calls to the FFT routines in parallel, this is essentially serial FFTs in parallel. I realise that there is an argument for using parallel FFTs here but that would require a substantially larger coding effort given the code surrounding the FFT library calls. That will be my next job.

Many thanks,

Karl

hawkins9 · ‎05-01-2013

I noticed the same problem and have attached a simple program demonstrating it. When linked with FFTW, there are no problems. When linked with MKL and compiled with -fpe0, the program crashes with a floating overflow error. I've tried ifort versions 11.1.073 and 13.0.1 running on Linux x86_64.

Dmitry_B_Intel · ‎05-06-2013

The following workaround may help if you use a compiler supporting BIND statement. Firstly, the Fortran program should declare its use of the global structure declared in file mkl/include/fftw/fftw3_mkl.h, like this:

[fortran]
common/fftw3_mkl/ignore(4),number_of_user_threads
integer*4 :: ignore, number_of_user_threads
bind (c) :: /fftw3_mkl/
[/fortran]

Secondly, the number of threads that are supposed to share concurrently an FFTW plan should be set *before* the plan is created with any of *fftw_plan_* functions:

[fortran]
number_of_user_threads = nthreads
call dfftw_plan_dft(fft, ...)
[/fortran]

Please let me know if this helps.

Thanks
Dima

hawkins9 · ‎05-06-2013

Thanks for the response. The workaround solves the immediate problem. It remains that an FFTW user who counts on the thread-safety of the fftw_execute function and reads the statement from the MKL documentation

The only change required to use Intel MKL through the FFTW3 wrappers is to link your application using FFTW3 against Intel MKL

is in for a surprise. Deeper in the docs I do see the thread safety caveat, though it lacks instructions for Fortran.

In my case I was actually totally ignorant of the MKL FFTW interface but happened to put MKL before FFTW on my link line and ended up spending a lot of time trying to figure out why my program was crashing. Eventually I figured out I was using MKL and just changed my link order. It would be great if the MKL FFTW interface included both the syntax and semantics of FFTW.

Dmitry_B_Intel · ‎05-06-2013

Thank you very much for your comments. We will respond to them in a future release of MKL.

Dima

Karl_W_2 · ‎06-14-2013

Dear Dmitry,

I was wondering if you could give me some feedback regarding progress on this issue? We have implemented the workaround detailed above but see the followng error when compilng for MIC:

x86_64-k1om-linux-ld: Warning: alignment 8 of symbol `fftw3_mkl' in /local/software/intel/2013.2.146/composer_xe_2013.2.146/mkl/lib/mic/libmkl_intel_lp64.so is smaller than 32 in fourier_mod.o

Many thanks,

Karl

Dmitry_B_Intel · ‎06-14-2013

Hi Karl,

Implementation of the fix is in progress. The warning that linker issues is annoying, I agree. The appication shall work nevertheless, because assignment to an integer*4 needs an alignment of multiple of 4, which both 8 and 32 have. For Intel compiler, the following should remove this warning, and the warning about the different size of the symbol:

!dec$ attributes align : 8 :: fftw3_mkl
common/fftw3_mkl/ignore(4),number_of_user_threads,ignore2(7)
integer*4 :: ignore, number_of_user_threads, ignore2
bind (c) :: /fftw3_mkl/

Dima

Karl_W_2 · ‎06-14-2013

I see, I misunderstood the nature of the error but that makes things a lot clearer. Thank you for your help.

Karl_W_2 · ‎05-02-2014

I thought I would put this here rather than create a new thread:

I've reached the point where I am trying to call threaded FFTs from multiple OpenMP threads, unfortunately it is not working. Both levels of parallelism work independently of the other and give a noticeable increase in performance but when combining the two the "normal" OpenMP code works in a nested fashion but the FFTs appear to execute sequentially, giving performance similar to the code when only the upper level of parallelism is used.

I am creating a stripped down code at the moment and will post it once finished if necessary.

Many thanks,

Karl

Dmitry_B_Intel · ‎05-04-2014

Hi Karl,

To enable nested OpenMP parallelism in MKL, environment variable MKL_DYNAMIC should be set to FALSE, or equivalently a call of mkl_set_dynamic(0) should be made, otherwise MKL will not go parallel inside when called from an omp parallel region. And OpenMP runtime itself should be enabled to do nested parallelism (OMP_NESTED=true or omp_set_nested(1)).

If that does not work, please give more details.

Thanks
Dima

Karl_W_2 · ‎05-06-2014

Hi Dima,

That did the trick, thank you very much!

Cheers,

Karl

Karl_W_2 · ‎05-09-2014

Hi Dima,

Do you know if there is a particular reason this wouldn't work on a Xeon Phi? I haven't looked at it in great detail yet but an initial test with a native Phi binary didn't work. I've been told by a colleague that parallel FFTs on the Phi are not currently supported so I just wanted to quickly confirm this before pushing any further.

All the best,

Karl

Evgueni_P_Intel · ‎05-11-2014

Dear Karl,

A priori your code should work just find on Xeon Phi. MKL supports parallel FFTs on Xeon Phi :)

Should you have further questions, please provide us with a build/link line and sample code that behaves unexpectedly.

Thank you.

Evgueni.

Karl_W_2 · ‎05-12-2014

Dear Evgueni,

Thanks for the note, I'll spend a bit more time working on it and get back to you soon. I'll also tell my colleague he needs to try harder!

All the best,

Karl

Karl_W_2 · ‎05-19-2014

Dear Evgueni,

I'm afraid I haven't been able to get this to work and I'm struggling to create a test case.

I just want to confirm that the problem I am seeing is that parallel FFTs work on the Phi when called from a serial region, but not from a parallel region, even when the advice given above is followed. The code does work as expected on a normal processor, although not on all machines.

I will continue to work on a test case but any other advice you may have in the meantime will be gratefully received.

All the best,

Karl

Evgueni_P_Intel · ‎05-20-2014

Dear Karl,

The MKL FFTW3 wrappers are built from the source code in mkl/interfaces/fftw3xf on both Xeon and Xeon Phi.

The MKL FFTW3 wrappers are expected to behave identically on Xeon and on Xeon Phi (native execution.)

Thank you.

Evgueni.

Karl_W_2 · ‎05-28-2014

Dear Evgueni and Dima,

I have managed to see some good scaling in my nested OpenMP code using the following environment variables :D

Essentially it came down to the affinity of the threads. Unfortunately none of the standard affinities give satisfactory performance on a Phi and I had to resort to using proclist. As you can see this ended up quite messy. Would it be possible to request a format such as "n,scattered,m,subcompact" that would automatically generate such affinities?

export OMP_NUM_THREADS=8
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY="VERBOSE,granularity=fine,proclist=[0,1,2,3,4,5,6,7],explicit"
mpirun -np 4 ../../bin/onetep.iridis4.mic onetep_8loop_1data.dat > manual_4mpi_8loop_1data.out

export OMP_NUM_THREADS=16
export KMP_AFFINITY="VERBOSE,granularity=fine,proclist=[0,2,4,6,8,10,12,14,1,3,5,7,9,11,13],explicit"
mpirun -np 4 ../../bin/onetep.iridis4.mic onetep_8loop_2data.dat > manual_4mpi_8loop_2data.out

export OMP_NUM_THREADS=32
export KMP_AFFINITY="VERBOSE,granularity=fine,proclist=[0,4,8,12,16,20,24,28,1,2,3,5,6,7,9,10,11,13,14,15,17,18,19,21,22,23,25,26,27,29,30,31],explicit"
mpirun -np 4 ../../bin/onetep.iridis4.mic onetep_8loop_4data.dat > manual_4mpi_8loop_4data.out

Any comments you may have to improve performance here would be gratefully received.

All the best and thank you for your help,

Karl

Evgueni_P_Intel · ‎06-01-2014

Dear Karl,

In theory, the affinity for OMP_NUM_THREADS=8 is equivalent to verbose,compact.

The affinity for OMP_NUM_THREADS=16 lists only 15 CPUs -- this is likely an error.

Such irregular affinities may indicate data alignment issues.

What rank and dimension(s) of the FFTs are you doing?

Thank you.

Evgueni.