Dear all, Adding -mkl to my version (for details, see below) of ifort produces problems. This is why I read through the docs and used the intel link advisor about doing an MPI DFT over a computing cluster to come up with the following line: -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_lp64 -lpthread -lm I am curious though, is `mkl_sequential' what i want when working with MPI Fast Fourier transforms? Or, should I use the threaded libraries (`-lmkl_intel_thread')? If I should include threads, do i need to set up manually the number of threads? To be clear, I have a computing cluster consisting of many 8-core processors, so this is definitively not an OpenMP use case, at least not externally. Additionally, if I pass `-align array8byte' to the compiler, will this automagically align all the arrays, including those in `type' definitions as well as any dynamically `allocate'-d ones on 64 bit boundaries? If not, i guess i need to use the provided `malloc' functions in order to make the most out of the MKL, right? COMPILER, MPI, MKL: ifort : 13.1.3 (used as mpif90) mkl : 11.0u5 MPI system: Intel MPI, 4.1.0.024 Thanks in advance!
If you are specifiying the "-mkl" option for ifort, then you do not need "-lmkl_cdft_core -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_lp64 -lpthread -lm". You compile/link command line simply looks like:
ifort -mkl myprog.f ...
The "-mkl" option is a shorthand for spelling out all necessary MKL libraries. The compiler figures out the correct way to link and does it. By default, "-mkl" links your code to parallel MKL libraries. If you want to link to sequential MKL libraries then you specify "-mkl=sequential". If you want to link to cluster MKL libraries (for example, if you are calling the "cluster FFT" routines), then you specify "-mkl=cluster".
On a single compute node, typically using parallel MKL libraries give better performance because multiple threads are spawned to use the multiple cores. By default MKL decides how many threads to spawn. But you can control this by manually setting the env-variable OMP_NUM_THREADS or MKL_NUM_THREADS. There are also API controls for the same purpose that programmers can call from within their code. On a cluster of muticore processors (this is the environment you have), it makes sense to use MPI to distribute the computation across all nodes, and on each node use OpenMP to take advantages of the multiple cores.
The "-align" of Fortran compiler aligns all Fortran arrays except for arrays in COMMON blocks and elements within derived types. However, this is not sufficient for generating aligned vector code for loops. You should also tell the compiler for each loop which arrays will be accessed in an aligned fashion by using directives (!dir$ vector aligned, !dir$ assume_aligned). I highly recommend this article on how to align Fortran arrays for vectorization: http://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization.
By the way, you want to align against 64-byte boundaries, not 64-bit boundaries. So, it should be "-align array64byte".
Thank you for the reply.
The vectorisation documentation you linked looks quite useful.. Still, bulk of the computation in my code happens within the MKL routines. Are you telling me that if I pass -align array64byte to ifort, any array (except in derived types) in my code will be correctly aligned and, if i use `!dir$ assume_aligned' on those arrays prior to passing them as arguments to MKL FFT routines, the MKL FFT routines will be able to take advantage of vectorisation? I am asking this, because somewhere in the MKL docs there is a mention of the function "mkl_malloc" functions as a tool to align the arrays.
My worry when using threaded model over MPI was that i'd spawn (via mpirun) N processes on N cores and then each of those N processes would spawn 8 other (8 being number of cores on each processor) leading to a number of threads larger then the amount of physical cores on any given processor. But, you are saying I shouldn't worry about such scenario, MKL is smart enough to balance OpenMP and MPI model correctly, right? :-)
Thanks for correcting my usage of bits vs bytes. That reminded of how vectorisation actually works :-)
For some reason `-mkl' causes linking problems on my system. This is why I'm using the link advisor to pick out the correct libraries.
The '-align' option of a compiler aligns every array (static array or dynamically allocated array) in your Fortran code. It could be an overkill, for example, when not all arrays in your code need to be aligned. The "mkl_malloc" functions dynamically allocate arrays and align these arrays. They do not help with static arrays.
MPI and OpenMP can coexist in the same application. They are tools for exploiting parallelism at different levels. But if you've decided to spawn N MPI proceses on an N-core node, then you probably don't need multithreading. MKL doesn't have the magic of automatically balancing OpenMP threads and MPI processes. It relies on users to tell it what to do.
If the "-mkl" option doesn't work well in your situation, you can use the link line advisor to get the correct libraries. Based on the usage model you described, you should link with sequential MKL.
Dear Todor K.,
I want to slightly clarify the situation with MKL CDFT, MPI and OpenMP section in MKL.
Usually people run their MPI-application on the whole cluster using one MPI-thread per on real core (i.e. N*8 in your case). So called pure mpi version. They use the following command line: mpiexec –n 8N –perhost 8 ./mpiapp.out. In this case it is good idea to link with sequential mkl (i.e. without OpenMP).
There is another scenario with combining MPI and OpenMP (so called hybrid version). And this scenario may give better performance results for some programs. I recommend use exactly this scenario if you use MKL CDFT (although it works quite good in pure mpi version too). In this case you have to link with parallel mkl to enable OpenMP support. It is good idea to run N MPI-threads with 8 OpenMP threads per MPI-thread if your cluster consists of one socket machines. If machines are 2-socket, then you may try either N MPI-threads x 8 OpenMP threads per MPI-thread or 2*N MPI-threads (one MPI-threads per socket) x 4 OpenMP threads per MPI-thread. E.g. the command line for the last variant if you use IntelMPI could be the following:
mpiexec –n 2N –perhost 2 –genv OMP_NUM_THREADS 4 –genv KMP_AFFINITY compact –genv I_MPI_PIN_DOMAIN socket ./mpiapp.out
Further tuning parameters could be found in Intel MPI documentation (such as different I_MPI_PIN_DOMAIN value, etc).
Also, please do not forget to include –liomp5 in link line if you use parallel mkl.