how to confirm the direct path is actually taken

dkokron · ‎01-17-2018

I have built an application that uses dgemm, ddot and daxpy via the PETSc library which was itself configured to use MKL (see below). I also used the MKL_VERBOSE option to confirm that the DGEMM calls use very small matrices (9x9), so I figured disabling error checking would improve performance.

I built PETSc with and without the -DMKL_DIRECT_CALL_SEQ flag.

icc -fPIC -wd1572 -g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -shared -Wl,-soname,libpetsc.so

icc -fPIC -wd1572 -g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ -shared -Wl,-soname,libpetsc.so

Yet a performance profile shows no change in any of the dgemm, ddot and daxpy.

How can I prove that the direct path is actually being taken?

icc version 15.0.3.187

MKL version=11.2.3

PETSc configure command

./configure --prefix=${PETSC_DIR}/${PETSC_ARCH}/install --with-debugging=0 --with-shared-libraries=1 --with-cc=icc --with-fc=ifort --with-cxx=icpc --with-blas-lapack-dir=/nasa/intel/Compiler/2015.3.187/mkl/lib/intel64 --with-scalapack-include=/nasa/intel/Compiler/2015.3.187/mkl/include --with-scalapack-lib="/nasa/intel/Compiler/2015.3.187/mkl/lib/intel64/libmkl_scalapack_lp64.so /nasa/intel/Compiler/2015.3.187/mkl/lib/intel64/libmkl_blacs_sgimpt_lp64.so" --with-cpp=/usr/bin/cpp --with-gnu-compilers=0 --with-vendor-compilers=intel -COPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ" -CXXOPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ" -FOPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -fpp -DMKL_DIRECT_CALL_SEQ" --with-mpi-exec=mpiexec --with-mpi-compilers=0 --with-precision=double --with-sclar-type=real --with-dynamic-loading --with-x=0 --with-x11=0 --download-mumps --download-ptscotch --download-hypre

Konstantin_A_Intel · ‎01-18-2018

Hi there,

I have a few suggestions:

- MKL 11.2.3 is quite an outdated release. Newer releases contain a lot of improved functionality, including improved performance of MKL_DIRECT_CALL. Please try MKL 2018.1 if possible. The same can be applied to compiler.

- MKL_VERBOSE will not work for a function if it does to direct call code path. So, you can compare verbose output of 2 runs (w/ and w/o DC): the difference will give you the calls covered by direct call.

- There's no guarantee that performance of 9x9 matrices will be significantly faster. The smaller the matrix size, the bigger should be the improvement. So, for 9x9 you probably still be able to see some improvement, but not dramatic.

https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call

Regards,

Konstantin