How would you justify these execution times (LAPACK, CBLAS, PARDISO)?

Arrigoni__Viviana · ‎09-06-2019

I am running experiments on a cluster where nodes are 2*18-core Intel Xeon.
I have two doubts regarding the execution times that I get from two sets of codes.

First: I wrote two versions of the same code, that uses both MPI and OpenMP. The code is pretty complex but it makes one call to the LAPACK routine dgesv and one to the CBLAS routine dgemv. The two versions differ just from the fact that in one of them I am including lapacke.h and cblas.h, while in the other one I am including the mkl libraries (mkl.h, mkl_blas.h, mkl_cblas.h, mkl_lapacke.h). From what I know, the latter case is faster because these two functions are threaded with OpenMP (https://software.intel.com/en-us/mkl-linux-developer-guide-openmp-threaded-functions-and-problems). I have tested both codes on the same input data (basically a matrix) and with the same configurations: usually I run experiments on N compute nodes of the cluster, each node has N tasks (i.e., distributed processes, and hence I have a total of N*N distributed processes) and to each process I assign 4 CPUs for multi-threading. Here are some results: N = 4 (number of distributed processes = 16), time for code 1 is 0.52 seconds, time for code 2 is 0.16s. N = 7 (number of distributed processes = 49), time for code 1 is 0.5s and time for code 2 is 0.36s. N = 9 (81 distributed processes), time for code 1 is again 0.5 seconds while the time for code 2 is 0,12 seconds. These execution times are averaged but the differences between different experiments are very small.
I was expecting the MKL version to be faster, but I am surprised to see such a big time difference. How would you justify it? Execution times are computed by MPI_Wtime().

Second: I am using the MKL function cluster_sparse_solver that integrates PARDISO routines. Again the execution time is the one given by MPI_Wtime(). I ran experiments using the same configuration like the one described above (for N*N distributed processes, I am reserving N compute nodes in the cluster, with N tasks per node, 4 CPUs per task). Then I read more accurately the notes (page 1741 of the MKL reference manual for C) where, speaking of such function, it is stated that "A hybrid implementation combines Message Passing Interface (MPI) technology for data exchange between parallel tasks (processes) running on different nodes, and OpenMP* technology for parallelism inside each node of the cluster." Does it mean that the configuration that I have used is not valid? I ran more experiments with N*N compute nodes and 1 task per node, 4 CPUs per task, and they were way faster than the corresponding ones with the previous configuration. This might only partially be due to the greater quantity of memory available by reserving one whole node for just one process, since the size of the input data is not so big.

Thank you in advance.