Hi there. I have some questions on linking using the fft mkl libraries on a cluster. I have a code which I have written using the open source FFTW, but now in the cluster I am trying to use the mkl fftw. I've been looking at the examples in the folder dftf, and tried to adapt this into my code.
I would like to clarify some things.
The first thing is, what is called "hand" in the mkl libraries is what in the classic FFTW is called a plan? is it the same thing?
As I am going to make intensive use of this FFT, I committed the transforms values in some subroutine in the code, and shared the "hand" pointers through a common block to the routine where I actually execute the DftiComputeForward and DftiComputeBackward calls. Is this correct?
Now, the thing is that the FFT calculations doesn't seem to be executed in the optimal time that I obtain when I execute these transforms using the open source FFTW on my computer. I have to say that in my desktop computer I don't have the intel compiler, I use the gfortran compiler, so the comparison is between different systems, and different compilers. It is hard to say on an absolute scale which is working better, but what I do see is that the FFTW scaling goes as expected, like order N times Log_2(N) on my computer, and that is what I am not obtaining when I run the code in the cluster.
I know that the intel compiler should be working better than the open source versions, so I would like to know what could be the root of this lose of performance. It is possible that I am not using the mkl routines properly, I'm just trying to learn how to use them. It is possible that I am not compiling the code in the way I should to get the optimization performance I expected.
The way I compiled the code, given that I didn't know how to do it, was looking on how the "examples" provided by intel compiled using the makefile given there. I also tried by including my code in the list file provided with the examples, and then running using the makefile. And it worked, but the thing is if there could be some issue on how these codes are being compiled that is giving me a lose in performance.
Now I'm compiling using something like:
mpiifort -module _results/intel_lp64_parallel_iomp5_libintel64 -I/opt/ohpc/pub/compiler/intel/compilers_and_libraries_2020.2.254/linux/mkl/2021.1.1/include -fpp -qopenmp \
/opt/ohpc/pub/compiler/intel/compilers_and_libraries_2020.2.254/linux/mkl/2021.1.1/lib/intel64/libmkl_intel_lp64.a -Wl,--start-group /opt/ohpc/pub/compiler/intel/compilers_and_libraries_2020.2.254/linux/mkl/2021.1.1/lib/intel64/libmkl_intel_thread.a /opt/ohpc/pub/compiler/intel/compilers_and_libraries_2020.2.254/linux/mkl/2021.1.1/lib/intel64/libmkl_core.a -Wl,--end-group \
-L/opt/ohpc/pub/compiler/intel/compilers_and_libraries_2020.2.254/linux/mkl/2021.1.1/../compiler/lib/intel64 -liomp5 -lpthread -lm -ldl -o _results/intel_lp64_parallel_iomp5_libintel64/mycode.x -mkl
Is there a way to do it more concisely? I am also using MPI, I don't know if that matters.