Matrix multiplication in Matlab runs slower on the Xeon Phi

Mihail_C_ · ‎11-15-2015

I have a system with the following configuration:

- Xeon Phi 31s1p (57 cores, 8Gb RAM) with MPSS 3.6 installed;

- Intel I7 3820 @3.6GHz;

- Asus P9X79WS (BIOS version 4802 - the most recent);

- 8Gb RAM;

- Sapphire Radeon HD 7990 video card

The operating system is Windows 10 x64.

I've tried the Intel tutorial about automatically offloading work from Matlab to the Phi (Using Intel® Math Kernel Library with MathWorks* MATLAB* on Intel® Xeon Phi™ Coprocessor System).

Since 31s1p has only 8Gb of RAM, I've used set MKL_MIC_MAX_MEMORY=8G instead of the tutorial's set MKL_MIC_MAX_MEMORY=16G.

The MATLAB version is R2015a.

The problem I am experiencing is it is significantly faster to run the plain vanilla matrix multiplication code on the i7 than on the Phi:

A = rand(10000, 10000);

B = rand(10000, 10000); 

tic 

C = A*B; 

toc

The Phi takes anywhere from 300s to 600 seconds, during which time the system is almost frozen. Sometimes the micsmc shows a short spike on the Phi, many other times it shows nothing (it could be that things do happen on the Phi even when the interface shows nothing, but since the system is frozen, the image doesn't get refreshed).

On the other hand, when I run the LAPACK benchmark in native mode on the Phi (no Matlab involved), everything goes as per the tutorial and I get some 700+ Gflops/s

Any idea what is going wrong?

Thanks!

Mihail_C_ · ‎11-15-2015

The thing is we don't have an alternative to the current configuration. That is the only machine which can fit a Xeon Phi, and the only Xeon Phi we have.

What do you suspect when you ask about another processor?

Thanks!

TimP · ‎11-15-2015

In view of your reduced memory allotment, you may wish to try reduction in problem size.

Mihail_C_ · ‎11-15-2015

@Tim Prince:

Well, it takes on average 20 seconds to run the same matrix multiplication on the host (i7 with 8Gb RAM), so something truly strange happens when Matlab offloads it to the Phi.

Therefore while the problem size always influences the running time, it looks like the main suspect is the way the automatic offload happens inside the MKL library.

@Amr S:

When that simple matrix multiplication code runs on the host (on the i7 processor), it takes ~20s. When it runs on the Xeon Phi it takes anywhere between 200s and 600s.

In both cases, Matlab uses the BLAS functions from the Intel MKL to perform the matrix multiplication.

When the code runs on the i7, Matlab uses the MKL version 11.1.1 which comes with Matlab 2015a.

When the code runs on the Phi, it runs with Intel MKL version 11.2.4 (the latest version).

If I disable the Phi (by setting MKL_MIC_ENABLE =0) and run the code on i7 but with Matlab using Intel MKL 11.2.4, it again runs on average for 20s.

Therefore the issue seems to be indeed the way the offloading happens, not with the size of the problem nor with the version of the MKL.

Based on the Intel document referenced in my first post the same code running on the Xeon Phi should take between 2s and 4s.

It would be great if somebody who has Matlab 2015a and Xeon Phi 31sp1 could test the same code and report the result.

Mihail_C_ · ‎11-16-2015

@Amr S:

Matlab 2015a comes with MKL 11.1.1 while the most recent Intel version for MKL is 11.2.4.

One can get the most recent MKL either by updating the Intel products which include it (like the Parallel Studio) or by getting the free version from here: https://software.intel.com/sites/campaigns/nest/

After downloading and installing the latest version of MKL we can make Matlab use it by following the steps in the tutorial: Using Intel® Math Kernel Library with MathWorks* MATLAB* on Intel® Xeon Phi™ Coprocessor System

If we launch Matlab like in the tutorial, it uses the most recent MKL version (11.2.4).

If we launch Matlab normally, it uses the MKL version it came with (11.1.1).

This is why I could test that the host code (the code running on i7) with both versions (11.1.1. and 11.2.4). I had to test the host code with both versions anyway, to be sure the newer version 11.2.4 does work with Matlab 2015a.

By seeing the code on the host works with 11.2.4 when offloading is disabled I could narrow the problem down to the offloading mechanism.

In your case it takes 73 seconds to run the matrix multiplication code offloaded on the Xeon Phi? Or is it 73s running on the i7?

Because if it takes 73s offloaded on Xeon Phi from Matlab 2011, then it would give even more weight to the hypothesis the problem might be with how Matlab 2015a offloads to the Phi.

Frances_R_Intel · ‎11-18-2015

You could try decreasing the memory size - you don't actually have a full 8 GB - you loose some of the memory for the RAM disk. But I doubt that is the problem. Have you tried using the OFFLOAD_REPORT environment variable. It can be set to 1, 2 or 3 depending on how much information you want it to give you. At least that will tell you for sure if it really is the offload that is taking up all that time.

Mihail_C_ · ‎11-19-2015

Frances Roth (Intel) wrote:

You could try decreasing the memory size - you don't actually have a full 8 GB - you loose some of the memory for the RAM disk. But I doubt that is the problem.

Yes, you are right, the RAM disk is not an issue. I did try with other values both for the memory and for the matrix sizes

Frances Roth (Intel) wrote:

Have you tried using the OFFLOAD_REPORT environment variable. It can be set to 1, 2 or 3 depending on how much information you want it to give you. At least that will tell you for sure if it really is the offload that is taking up all that time.

I'd like to try that but googling "OFFLOAD_REPORT Xeon Phi Matlab" resulted in examples for Linux while I am running on a Windows machine.

I suppose the first step would be to set the the environment variable with

set OFFLOAD_REPORT = 3

but I am at a loss about what to do after that. Could you kindly advise me about the rest of the things I need to add to the .bat file?

Thanks!

Mihail_C_ · ‎11-23-2015

Could this be because the OpenCL runtime installed is 14.2 (the latest which supports Xeon Phi on Windows, and which requires MPSS 3.2 or 3.2.3) while my MPSS is 3.6 (the latest available)?

At first glance there should be no reason to behave like that since MKL has nothing to do with OpenCL, but I have noticed that OpenCL also misbehaves when I attempt to execute code on the Xeon Phi (code compiles without problems but then the application crashes).