Poor scaling for real-to-real FFT with OpenMP

Jyotirmoy_B_ · ‎05-13-2017

In the attached file I use MKL to compute a real-to-real FFT using OpenMP for multithreading.

The code is compiled with

icpc -o bench-fft -Wall -O3 -g -march=native -fopenmp bench-fft.cxx -mkl

The machine has 4 cores.

It seems that the code does not scale well with the number of threads.

When run with

OMP_NUM_THREADS=1 ./bench-fft 4194304

the total time taken is 0.1640 user, 0.0440 sys while with

OMP_NUM_THREADS=2 ./bench-fft 4194304

the total time taken is 0.3000 user, 0.0560 sys. So there seems to be a large synchronization overhead since the total CPU time almost doubles.

Is this to be expected or am I doing something wrong in my code.

SergeyKostrov · ‎05-15-2017

>>...Is this to be expected or am I doing something wrong in my code. Try to set KMP_AFFINITY to scatter or compact and use more OpenMP threads. In case of a Linux OS use Htop utility to verify how threads are pinned to cores.