i hope you are doing good. I am using intel MKL to develop a multi-CPU version of my linear system solver. The setup is as follows:
I have say 8 nodes connected via infiniband. Each node is fitted with a dual quad core xeon. I divide my computation (spmv's, ddots, daxpys) in equal chunks to all these nodes. Now the algorithm (Preconditioned CG) runs on all the nodes and the nodes have to communicate often in betweent he iteration loop to update their information and collaborate to arrive at a solution.
My question is that i use intel MKL to perform all the computations on each of these nodes. How can i make sure that each 'node (with 8 cores on each node) make use of all of its cores when running say spmv or ddot or daxpy or even dnorm?
In order to run 2 ranks of 4 threads each on each node, when your application is linked with mkl_thread, you would set the number of MPI processes to 2x the number of nodes, and pass e.g. -env OMP_NUM_THREADS=4 in your mpiexec.hydra command. MKL would automatically restrict itself by default to 1 thread per core even if you have HT enabled.
If you assign a single MPI process to each node, threaded MKL functions would automatically spread across all the cores, but you would need to pass an appropriate KMP_AFFINITY setting so as to place contiguous OpenMP chunks on each CPU.
You would have to look up the docs for your version of MKL to see if the level 1 BLAS functions you mentioned are threaded. internal threading in level 1 BLAS functions is likely to be useful only for large vector lengths, such as where the cores will be operating on distinct 4KB pages of data.
If you linked against MKL shared objects, you would have to assure that these are visible on each node. This is normally required for the libiomp5.so even if you link against static MKL objects.
thanks timp. i certainly dont have the KMP_AFFINITY parameter set and i use static linking with MKL in the link line advisor.
my vector lengths are at least 2 million doubles so i suppose there will be some improvement. Also like i mentioned i use level2 and 3 routines (spmv) i suppose they will also benefit but do you have some pointers on predicting what kind of sizes will give advantages?
thanks again for your quick response.