When repeatedly calling MKL's distributed (cluster) DFT library via the FFTW3 interface library, it will fail with a floating-point error with certain combination's of grid sizes and MPI processes (eg, a 1024 x 256 x 1 grid running with 17 MPI processes). This is repeatable, and I have uploaded an example code that demonstrates the problem. I am compiling using "Composer XE 2015" tools (MKL), eg
[jshaw@cl4n074]: make fft_mpi
echo "Building ... fft_mpi"
Building ... fft_mpi
g++ -I/home/jshaw/XCuda-2.4.0/inc -DNDEBUG -O2 -I/home/jshaw/XCuda-2.4.0/inc -I/opt/lic/intel13/impi/4.1.0.024/include64 -I/opt/lic/intel15/composer_xe_2015.1.133/mkl/include -I/opt/lic/intel15/composer_xe_2015.1.133/mkl/include/fftw fft_mpi.cpp -o fft_mpi -L/home/jshaw/XCuda-2.4.0/lib/x86_64-CentOS-6.5 -lXCuda -L/home/jshaw/XCuda-2.4.0/lib/x86_64-CentOS-6.5 -lXCut -L/opt/lic/intel13/impi/4.1.0.024/lib64 -lmpi -L/home/jshaw/fftw-libs-1.1/x86_64-CentOS-6.5/lib -lfftw3x_cdft_lp64 -L/opt/lic/intel15/composer_xe_2015.1.133/mkl/lib/intel64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_lp64 -lpthread -lm -ldl -lirc
[jshaw@cl4n074]: mpirun -np 17 fft_mpi
FFT-MPI for Complex Multi-dimensional Transforms (float)
MPI procs: 17, dim: 1024x256x1, loops: 100 (2 MBytes)
Allocating FFT memory ...
rank: 0 (16592 61 0)
rank: 2 (16592 61 122)
rank: 7 (16592 60 424)
rank: 8 (16592 60 484)
rank: 9 (16592 60 544)
rank: 10 (16592 60 604)
rank: 12 (16592 60 724)
rank: 13 (16592 60 784)
rank: 15 (16592 60 904)
rank: 16 (16592 60 964)
rank: 1 (16592 61 61)
rank: 3 (16592 61 183)
rank: 4 (16592 60 244)
rank: 5 (16592 60 304)
rank: 6 (16592 60 364)
rank: 11 (16592 60 664)
rank: 14 (16592 60 844)
loop: <1fnr> <2fnr> <3fnr> <4fnr> <5fnr> <6fnr> <7fnr> <8fnr> <9fnr> <10fnr> <11fnr> <12fnr> <13fAPPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception (signal 8)
Note that the XCuda libraries referenced in the compile/link line are NOT required for this example to work! The code will run correctly with various grid sizes and numbers of MPI processes. Its not as simple as having too many or too few MPI processes, or the FFT grid being small or large. I regularly run codes with large 3D grids. The issue is that one cannot pick arbitrary grid sizes or MPI ranks (as you should be able to). Any help or thoughts on how this can be fixed are appreciated. Is this a legitimate bug in MKL?
Thank you for clear reproducer. I was able to reproduce the problem.
Yes... it looks like an inconsistency in MKL CFFT for in-place 2D+ transforms. It needs more time to investigate where exactly the problem occurs.
Anyway, you can workaround the problem in two ways (with the same idea -- change algorithm inside CFFT):
- try to use out-of-place transform
- try to set workspace to CFFT (unfortunately it means that MKL FFTW3 MPI wrappers should be modified). in this case CFFT will do in-place transform but with a buffer of the same size (set on user side). the problem disappears in this case too.
BTW, I would recommend to use out-of-place CFFT since it gives much more performance. The thing is that when you ask CFFT to make in-place transform it really does it, i.e. it uses only constant amount of additional memory and hence it requires a lot of small MPI communications. This really slows down Cluster FFT.
I verified that using out-of-place FFT transforms work. I found little difference in the run times, but this was on a shared-memory machine. It may be different with a distributed-memory cluster. Thanks.