I am attempting to add an OpenMP layer of parallelism into an MPI code writtten in Fortran. However, I am not doing this correctly as it does not appear to be thread safe. I can say this with some confidence as adding an !$OMP CRITICAL region around the library call results in a correct answer. I believe that this is a result of the FFTW wrappers.
Having looked at the MKL documentation I see a mention of "fftw3_mkl.number_of_user_threads", I imagine that this would resolve the issue but I haven't been able to implement it sucuessfully. Is there any chance anyone has a working sample code (written in Fortran) that shows the correct use of this object?
To clarify: I am attempting to execute iterations of a loop that contains calls to the FFT routines in parallel, this is essentially serial FFTs in parallel. I realise that there is an argument for using parallel FFTs here but that would require a substantially larger coding effort given the code surrounding the FFT library calls. That will be my next job.
The following workaround may help if you use a compiler supporting BIND statement. Firstly, the Fortran program should declare its use of the global structure declared in file mkl/include/fftw/fftw3_mkl.h, like this:
integer*4 :: ignore, number_of_user_threads
bind (c) :: /fftw3_mkl/
Secondly, the number of threads that are supposed to share concurrently an FFTW plan should be set *before* the plan is created with any of *fftw_plan_* functions:
number_of_user_threads = nthreads
call dfftw_plan_dft(fft, ...)
Please let me know if this helps.
Thanks for the response. The workaround solves the immediate problem. It remains that an FFTW user who counts on the thread-safety of the fftw_execute function and reads the statement from the MKL documentation
The only change required to use Intel MKL through the FFTW3 wrappers is to link your application using FFTW3 against Intel MKL
is in for a surprise. Deeper in the docs I do see the thread safety caveat, though it lacks instructions for Fortran.
In my case I was actually totally ignorant of the MKL FFTW interface but happened to put MKL before FFTW on my link line and ended up spending a lot of time trying to figure out why my program was crashing. Eventually I figured out I was using MKL and just changed my link order. It would be great if the MKL FFTW interface included both the syntax and semantics of FFTW.
I was wondering if you could give me some feedback regarding progress on this issue? We have implemented the workaround detailed above but see the followng error when compilng for MIC:
x86_64-k1om-linux-ld: Warning: alignment 8 of symbol `fftw3_mkl' in /local/software/intel/2013.2.146/composer_xe_2013.2.146/mkl/lib/mic/libmkl_intel_lp64.so is smaller than 32 in fourier_mod.o
Implementation of the fix is in progress. The warning that linker issues is annoying, I agree. The appication shall work nevertheless, because assignment to an integer*4 needs an alignment of multiple of 4, which both 8 and 32 have. For Intel compiler, the following should remove this warning, and the warning about the different size of the symbol:
!dec$ attributes align : 8 :: fftw3_mkl
integer*4 :: ignore, number_of_user_threads, ignore2
bind (c) :: /fftw3_mkl/
I thought I would put this here rather than create a new thread:
I've reached the point where I am trying to call threaded FFTs from multiple OpenMP threads, unfortunately it is not working. Both levels of parallelism work independently of the other and give a noticeable increase in performance but when combining the two the "normal" OpenMP code works in a nested fashion but the FFTs appear to execute sequentially, giving performance similar to the code when only the upper level of parallelism is used.
I am creating a stripped down code at the moment and will post it once finished if necessary.
To enable nested OpenMP parallelism in MKL, environment variable MKL_DYNAMIC should be set to FALSE, or equivalently a call of mkl_set_dynamic(0) should be made, otherwise MKL will not go parallel inside when called from an omp parallel region. And OpenMP runtime itself should be enabled to do nested parallelism (OMP_NESTED=true or omp_set_nested(1)).
If that does not work, please give more details.
Do you know if there is a particular reason this wouldn't work on a Xeon Phi? I haven't looked at it in great detail yet but an initial test with a native Phi binary didn't work. I've been told by a colleague that parallel FFTs on the Phi are not currently supported so I just wanted to quickly confirm this before pushing any further.
All the best,
A priori your code should work just find on Xeon Phi. MKL supports parallel FFTs on Xeon Phi :)
Should you have further questions, please provide us with a build/link line and sample code that behaves unexpectedly.
I'm afraid I haven't been able to get this to work and I'm struggling to create a test case.
I just want to confirm that the problem I am seeing is that parallel FFTs work on the Phi when called from a serial region, but not from a parallel region, even when the advice given above is followed. The code does work as expected on a normal processor, although not on all machines.
I will continue to work on a test case but any other advice you may have in the meantime will be gratefully received.
All the best,
The MKL FFTW3 wrappers are built from the source code in mkl/interfaces/fftw3xf on both Xeon and Xeon Phi.
The MKL FFTW3 wrappers are expected to behave identically on Xeon and on Xeon Phi (native execution.)
Dear Evgueni and Dima,
I have managed to see some good scaling in my nested OpenMP code using the following environment variables :D
Essentially it came down to the affinity of the threads. Unfortunately none of the standard affinities give satisfactory performance on a Phi and I had to resort to using proclist. As you can see this ended up quite messy. Would it be possible to request a format such as "n,scattered,m,subcompact" that would automatically generate such affinities?
export OMP_NUM_THREADS=8 export I_MPI_PIN_DOMAIN=omp export KMP_AFFINITY="VERBOSE,granularity=fine,proclist=[0,1,2,3,4,5,6,7],explicit" mpirun -np 4 ../../bin/onetep.iridis4.mic onetep_8loop_1data.dat > manual_4mpi_8loop_1data.out export OMP_NUM_THREADS=16 export KMP_AFFINITY="VERBOSE,granularity=fine,proclist=[0,2,4,6,8,10,12,14,1,3,5,7,9,11,13],explicit" mpirun -np 4 ../../bin/onetep.iridis4.mic onetep_8loop_2data.dat > manual_4mpi_8loop_2data.out export OMP_NUM_THREADS=32 export KMP_AFFINITY="VERBOSE,granularity=fine,proclist=[0,4,8,12,16,20,24,28,1,2,3,5,6,7,9,10,11,13,14,15,17,18,19,21,22,23,25,26,27,29,30,31],explicit" mpirun -np 4 ../../bin/onetep.iridis4.mic onetep_8loop_4data.dat > manual_4mpi_8loop_4data.out
Any comments you may have to improve performance here would be gratefully received.
All the best and thank you for your help,
In theory, the affinity for OMP_NUM_THREADS=8 is equivalent to verbose,compact.
The affinity for OMP_NUM_THREADS=16 lists only 15 CPUs -- this is likely an error.
Such irregular affinities may indicate data alignment issues.
What rank and dimension(s) of the FFTs are you doing?