In the attached file I use MKL to compute a real-to-real FFT using OpenMP for multithreading.
The code is compiled with
icpc -o bench-fft -Wall -O3 -g -march=native -fopenmp bench-fft.cxx -mkl
The machine has 4 cores.
It seems that the code does not scale well with the number of threads.
When run with
OMP_NUM_THREADS=1 ./bench-fft 4194304
the total time taken is 0.1640 user, 0.0440 sys while with
OMP_NUM_THREADS=2 ./bench-fft 4194304
the total time taken is 0.3000 user, 0.0560 sys. So there seems to be a large synchronization overhead since the total CPU time almost doubles.
Is this to be expected or am I doing something wrong in my code.