Folowing the article http://software.intel.com/en-us/articles/using-intel-mkl-in-gnu-octave to compile and link mkl libs with Octave.
After installing Octave I've checked with "ldd /usr/local/bin/octave" that all mkl libraries are correctly linked.
- Then export the environment variables to enable MIC automatic offload:
Following the documentacion Setting Environment Variables for Automatic Offload
- Finally I've executed Octave and run a simple matrix multiplication (3000x3000 matrix size, using DGEMM in BLAS mkl libraries). Using micsmc tool, we can see that no coprocessor core it's working, so the automatic offload isn't doing properly .
To ensure that Octave is using mkl dgemm function I've debugged the execution of a simple matrix multiplication. And as expected the function is correctly called: Breakpoint 2, 0x00007ffff1d67980 in dgemm_ () from /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so
But all the work is done in host processor and Xeon Phi coprocessor doesn't do anything.
I've perform one more test using an example dgemm program:
Using the dgemm example included in <install-dir>/Samples/en-US/mkl/tutorials.zip -> dgemm_example.c. I modified the code to call dgemm function instead of cblas_dgemm, after compiling and linking it with mkl libraries after a first test, debugging the application and with environment variable MKL_MIC_ENABLE set to 1 we can see the following line:
0x00007ffff77ad980 in dgemm_ () from /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so
So the simple program dgemm_example.c is calling exactly the same mkl function of libmkl_intel_lp64.so mkl library. And the execution is being perform in the coprocessor with no problem!
Can you give me some support to help us to understand why automatic offload is not working in Octave?
Thanks for the help.
I've checked both environment variables, and them include all the paths. Before running Octave I need to set these variables, executing the command "source /opt/intel/parallel_studio_xe_2013_update3/mkl/bin/mklvars.sh intel64"
The environment variables were set as show below:
# echo $LD_LIBRARY_PATH
# echo $MIC_LD_LIBRARY_PATH
I would think that your Octave configure could correspond to the link command you used successfully.
I don't think the start-group...end-group stuff is necessary when linking with dynamic libraries (it will be needed for static link, but that's potentially more complicated), As Nikita pointed out, the posted white paper has formatting problems for which it is good to have corrections.
Hello Nikita and TimP,
I've make the changes removing sequential libraries and setting mkl threading libraries. In addition i change the line of configure.in file:
AC_CHECK_LIB(mkl_intel_lp64, fftw_plan_dft_1d, [FFTW_LIBS="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -WI, --end-group -liomp5 -lpthread"; with_fftw3=yes],AC_MSG_RESULT("MKL library not found. Octave will use the FFTW3 instead."),[-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread])
AC_CHECK_LIB(mkl_intel_lp64, fftw_plan_dft_1d, [BLAS_LIBS="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -WI, --end-group -liomp5 -lpthread"; with_fftw3=yes],AC_MSG_RESULT("MKL library not found. Octave will use the FFTW3 instead."),[-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread])
Now after run:
./configure --with-blas="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread" --with-lapack="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread"
We can check that BLAS libraries are correctly linked (BLAS libraries: -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread)
After rebuild and reinstall Octave, set the enviroment variables to use automatic offload mode in Xeon Phi and perform the matrix multiplication test (10000x10000 matrix size), we can say that one step to the solution have been done.
The Xeon Phi starts to load some work, all cores start to work at 100% and the memory increase its load, but only for a second, then the Xeon Phi cores stop working, but the multiplication is not finished yet. Running "top" we can see that the host CPU is working at 100%, so I suppose it's being the responsible of matrix multiplication and not the co-processor like we want.
Any idea why Xeon Phi isn't doing the matrix multiplication work?
Thanks for the help!!