Using MKL libs in Octave to run on Xeon PHI

Marcelo_V_ · ‎07-09-2013

Hello,

Folowing the article http://software.intel.com/en-us/articles/using-intel-mkl-in-gnu-octave to compile and link mkl libs with Octave.

After installing Octave I've checked with "ldd /usr/local/bin/octave" that all mkl libraries are correctly linked.

- Then export the environment variables to enable MIC automatic offload:
Following the documentacion Setting Environment Variables for Automatic Offload
export MKL_MIC_ENABLE=1

- Finally I've executed Octave and run a simple matrix multiplication (3000x3000 matrix size, using DGEMM in BLAS mkl libraries). Using micsmc tool, we can see that no coprocessor core it's working, so the automatic offload isn't doing properly .

To ensure that Octave is using mkl dgemm function I've debugged the execution of a simple matrix multiplication. And as expected the function is correctly called: Breakpoint 2, 0x00007ffff1d67980 in dgemm_ () from /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so

But all the work is done in host processor and Xeon Phi coprocessor doesn't do anything.

I've perform one more test using an example dgemm program:

Using the dgemm example included in <install-dir>/Samples/en-US/mkl/tutorials.zip -> dgemm_example.c. I modified the code to call dgemm function instead of cblas_dgemm, after compiling and linking it with mkl libraries after a first test, debugging the application and with environment variable MKL_MIC_ENABLE set to 1 we can see the following line:

0x00007ffff77ad980 in dgemm_ () from /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so

So the simple program dgemm_example.c is calling exactly the same mkl function of libmkl_intel_lp64.so mkl library. And the execution is being perform in the coprocessor with no problem!

Can you give me some support to help us to understand why automatic offload is not working in Octave?

Thanks for the help.

Nikita_S_Intel · ‎07-10-2013

Hello, Could you please provide details about your environment, did you install the following environment variables? export LD_LIBRARY_PATH="/opt/intel/mic/coi/host-linux-release/lib:${LD_LIBRARY_PATH}" export MIC_LD_LIBRARY_PATH="/opt/intel/mic/coi/device-linux-release/lib:${MKLROOT}/lib/mic:${MIC_LD_LIBRARY_PATH}" Thanks, --Nikita

Marcelo_V_ · ‎07-10-2013

Hello,

I've checked both environment variables, and them include all the paths. Before running Octave I need to set these variables, executing the command "source /opt/intel/parallel_studio_xe_2013_update3/mkl/bin/mklvars.sh intel64"

The environment variables were set as show below:

# echo $LD_LIBRARY_PATH
/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/compiler/lib/intel64:/opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64

# echo $MIC_LD_LIBRARY_PATH
/opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/compiler/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/mic

Thanks.

Nikita_S_Intel · ‎07-10-2013

Hello, Could you please provide your test case with Octave usage? Thanks, --Nikita

Marcelo_V_ · ‎07-11-2013

Hello,

I perform some simple matrix multiplication tests.

In Octave:

a = rand (3000,3000)

b = rand (3000,3000)

c = a * b

In attached file it can be see the Octave session to check the function an library used to resolve the multiplication (libmkl_intel_lp64.so)

Thanks for help.

Nikita_S_Intel · ‎07-11-2013

Hello Marcelo, There is a misprint in the article http://software.intel.com/en-us/articles/using-intel-mkl-in-gnu-octave. Octave has to be configured with MKL threading libraries. Please configure Octave as show below and rebuild it: ./configure --with-blas="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread" --with-lapack="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread" Please let me know if automatic offload mode doesn’t work. Thanks, --Nikita

TimP · ‎07-11-2013

I would think that your Octave configure could correspond to the link command you used successfully.

I don't think the start-group...end-group stuff is necessary when linking with dynamic libraries (it will be needed for static link, but that's potentially more complicated), As Nikita pointed out, the posted white paper has formatting problems for which it is good to have corrections.

Marcelo_V_ · ‎07-12-2013

Hello Nikita and TimP,

I've make the changes removing sequential libraries and setting mkl threading libraries. In addition i change the line of configure.in file:

AC_CHECK_LIB(mkl_intel_lp64, fftw_plan_dft_1d, [FFTW_LIBS="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -WI, --end-group -liomp5 -lpthread"; with_fftw3=yes],AC_MSG_RESULT("MKL library not found. Octave will use the FFTW3 instead."),[-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread])

to

AC_CHECK_LIB(mkl_intel_lp64, fftw_plan_dft_1d, [BLAS_LIBS="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -WI, --end-group -liomp5 -lpthread"; with_fftw3=yes],AC_MSG_RESULT("MKL library not found. Octave will use the FFTW3 instead."),[-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread])

Now after run:

./configure --with-blas="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread" --with-lapack="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread"

We can check that BLAS libraries are correctly linked (BLAS libraries: -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread)

After rebuild and reinstall Octave, set the enviroment variables to use automatic offload mode in Xeon Phi and perform the matrix multiplication test (10000x10000 matrix size), we can say that one step to the solution have been done.

The Xeon Phi starts to load some work, all cores start to work at 100% and the memory increase its load, but only for a second, then the Xeon Phi cores stop working, but the multiplication is not finished yet. Running "top" we can see that the host CPU is working at 100%, so I suppose it's being the responsible of matrix multiplication and not the co-processor like we want.

Any idea why Xeon Phi isn't doing the matrix multiplication work?

Thanks for the help!!