MKL 5 times slower when called from a MEX function in Octave

Garcia__Juan · ‎02-14-2019

Hi,

I'm trying to use Intel MKL library in an Octave MEX function, but the performance that I achieve using some MKL functions such as cblas_cgemm is 5 time slower when called from Octave rather than a compiled C executable. I'm using the same compilation flags for both C code and MEX functions in my testing, where I basically compare the speed of a very simple C matrix multiplication script and the same script wrapped in a MEX function (find this short example attached).

This is how I compile the C code:

gcc -I/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include -Wall -L/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64 -o cgemm_test_c matmult_c.c -lmkl_gnu_thread -lmkl_rt -lmkl_core -lmkl_intel_ilp64 -lgomp -lpthread -lm -ldl

This is how I compile the MEX function:

mex -v -I/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include -Wall -L/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64 -o cgemm_test_mex matmult_c.c matmult_mex.c -lmkl_gnu_thread -lmkl_rt -lmkl_core -lmkl_intel_ilp64 -lgomp -lpthread -lm -ldl

And this is what the mex command is really doing:

gcc -c -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/octave-4.2.2/octave/.. -I/usr/include/octave-4.2.2/octave -I/usr/include/hdf5/serial  -pthread -fopenmp -g -O2 -fdebug-prefix-map=/build/octave-DtqyIg/octave-4.2.2=. -fstack-protector-strong -Wformat -Werror=format-security  -Wall  -I. -I/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include  -DMEX_DEBUG matmult_c.c -o matmult_c.o

gcc -c -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/octave-4.2.2/octave/.. -I/usr/include/octave-4.2.2/octave -I/usr/include/hdf5/serial  -pthread -fopenmp -g -O2 -fdebug-prefix-map=/build/octave-DtqyIg/octave-4.2.2=. -fstack-protector-strong -Wformat -Werror=format-security  -Wall  -I. -I/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include  -DMEX_DEBUG matmult_mex.c -o matmult_mex.o

g++ -I/usr/include/octave-4.2.2/octave/.. -I/usr/include/octave-4.2.2/octave -I/usr/include/hdf5/serial -I/usr/include/mpi  -pthread -fopenmp -g -O2 -fdebug-prefix-map=/build/octave-DtqyIg/octave-4.2.2=. -fstack-protector-strong -Wformat -Werror=format-security -shared -Wl,-Bsymbolic  -Wall -o cgemm_test_mex.mex  matmult_c.o matmult_mex.o   -L/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64 -lmkl_gnu_thread -lmkl_rt -lmkl_core -lmkl_intel_ilp64 -lgomp -lpthread -lm -ldl -L/usr/lib/x86_64-linux-gnu/octave/4.2.2 -L/usr/lib/x86_64-linux-gnu -loctinterp -loctave -Wl,-Bsymbolic-functions -Wl,-z,relro

Test results:

C code: Elapsed time per multiplication: ~1.86 ms

MEX code: Elapsed time per multiplication: ~8.55ms

I have tested different optimisation flags but the results are virtually the same thing. This has been tested in 2 Intel machines with Ubuntu 18.04 and Ubuntu 14.04, yielding very similar results in all cases. MKL environment variables are set as per "source /opt/intel/mkl/bin/mklvars.sh intel64"

Many thanks in advance,

Juan.

Gennady_F_Intel · ‎02-15-2019

that's interesting .... could you check if the performance gap will be the same in the case of square matrixes? m=n=k == for example 8000.

Gennady_F_Intel · ‎02-15-2019

and which version of Octave do you use?