MKL's (distributed) FFT library fails with a floating-point error

John_G__S_ · ‎02-26-2015

When repeatedly calling MKL's distributed (cluster) DFT library via the FFTW3 interface library, it will fail with a floating-point error with certain combination's of grid sizes and MPI processes (eg, a 1024 x 256 x 1 grid running with 17 MPI processes). This is repeatable, and I have uploaded an example code that demonstrates the problem. I am compiling using "Composer XE 2015" tools (MKL), eg

[jshaw@cl4n074]: make fft_mpi
echo "Building ... fft_mpi"
Building ... fft_mpi
g++ -I/home/jshaw/XCuda-2.4.0/inc -DNDEBUG -O2 -I/home/jshaw/XCuda-2.4.0/inc -I/opt/lic/intel13/impi/4.1.0.024/include64 -I/opt/lic/intel15/composer_xe_2015.1.133/mkl/include -I/opt/lic/intel15/composer_xe_2015.1.133/mkl/include/fftw fft_mpi.cpp -o fft_mpi -L/home/jshaw/XCuda-2.4.0/lib/x86_64-CentOS-6.5 -lXCuda -L/home/jshaw/XCuda-2.4.0/lib/x86_64-CentOS-6.5 -lXCut -L/opt/lic/intel13/impi/4.1.0.024/lib64 -lmpi -L/home/jshaw/fftw-libs-1.1/x86_64-CentOS-6.5/lib -lfftw3x_cdft_lp64 -L/opt/lic/intel15/composer_xe_2015.1.133/mkl/lib/intel64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_lp64 -lpthread -lm -ldl -lirc

[jshaw@cl4n074]: mpirun -np 17 fft_mpi

FFT-MPI for Complex Multi-dimensional Transforms (float)
MPI procs: 17, dim: 1024x256x1, loops: 100 (2 MBytes)

Allocating FFT memory ...
rank:   0 (16592     61      0)
rank:   2 (16592     61    122)
rank:   7 (16592     60    424)
rank:   8 (16592     60    484)
rank:   9 (16592     60    544)
rank: 10 (16592     60    604)
rank: 12 (16592     60    724)
rank: 13 (16592     60    784)
rank: 15 (16592     60    904)
rank: 16 (16592     60    964)
rank:   1 (16592     61     61)
rank:   3 (16592     61    183)
rank:   4 (16592     60    244)
rank:   5 (16592     60    304)
rank:   6 (16592     60    364)
rank: 11 (16592     60    664)
rank: 14 (16592     60    844)
Initializing ...
Planning ...
loop: <1fnr> <2fnr> <3fnr> <4fnr> <5fnr> <6fnr> <7fnr> <8fnr> <9fnr> <10fnr> <11fnr> <12fnr> <13fAPPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception (signal 8)
[jshaw@cl4n074]:

Note that the XCuda libraries referenced in the compile/link line are NOT required for this example to work! The code will run correctly with various grid sizes and numbers of MPI processes. Its not as simple as having too many or too few MPI processes, or the FFT grid being small or large. I regularly run codes with large 3D grids. The issue is that one cannot pick arbitrary grid sizes or MPI ranks (as you should be able to). Any help or thoughts on how this can be fixed are appreciated. Is this a legitimate bug in MKL?

Evarist_F_Intel · ‎02-27-2015

Hi John,

Thank you for clear reproducer. I was able to reproduce the problem.

Yes... it looks like an inconsistency in MKL CFFT for in-place 2D+ transforms. It needs more time to investigate where exactly the problem occurs.

Anyway, you can workaround the problem in two ways (with the same idea -- change algorithm inside CFFT):

try to use out-of-place transform
try to set workspace to CFFT (unfortunately it means that MKL FFTW3 MPI wrappers should be modified). in this case CFFT will do in-place transform but with a buffer of the same size (set on user side). the problem disappears in this case too.

BTW, I would recommend to use out-of-place CFFT since it gives much more performance. The thing is that when you ask CFFT to make in-place transform it really does it, i.e. it uses only constant amount of additional memory and hence it requires a lot of small MPI communications. This really slows down Cluster FFT.

John_G__S_ · ‎03-04-2015

I verified that using out-of-place FFT transforms work. I found little difference in the run times, but this was on a shared-memory machine. It may be different with a distributed-memory cluster. Thanks.