MATLAB Automatic Offload Elementwise Operations

Christopher_R_3 · ‎12-21-2015

Hey,

I got automatic offload running for MATLAB and was able to test matrix multiplication and some LAPACK functions running on the Phi. However, I was wondering if it was possible to run element-wise operations on the Phi, for example:

A=rand(10000,10000);

B=rand(10000,10000);

C=A.*B;

Note that the command C=A*B is a BLAS call so that automatically offloads, but C=A.*B is not. I could change one of the two matrices to a large diagonal matrix, but that seems inefficient. Is there a MKL of GSL call that would do this, and then maybe I could make a MEX (C) file to do specifically that?

Andrey_Vladimirov · ‎12-29-2015

You can transfer data to and from a Xeon Phi card at a rate of approximately 7 GB/s. To do element-wise matrix-matrix multiplication, you will read each element of A and B once and write each element of C once. This can be done at approximately 130-170 GB/s on XeonPhi, depending on the coprocessor model. You can see that the offload traffic will take roughly 130/7=20 times longer than the computation. In comparison, on the CPU you can read/write matrices at anywhere between 50 to 100 GB/s, so running the element-wise matrix-matrix multiplication on the CPU is going to be at least 50/7=7 times faster than offloading it to Xeon Phi.

Similar reasoning applies to random numbers. For every random number sent to/from the coprocessor you will need to perform only a fixed number of FLOPs. FLOPs are cheap, communication is expensive, so it is faster to do it on the CPU than to offload.

You can generalize this reasoning and see that algorithms that have O(N) memory complexity and O(N) time complexity usually do not offload well. Offload pays off only when for every number sent to Xeon Phi you perform many operations on the card. So, for instance, algorithms with time complexity of O(N*logN) or O(N) will be bottlenecked by communication, but O(N^a) with a>1 will pay off for offload - as long as you have a good parallel algorithm and N is large enough.

We discuss this issue in Episode 5.16 of our video course: https://www.youtube.com/watch?v=7ltqiCGttOs

The full course is available here: http://colfaxresearch.com/cdt-v01/ and it is free

Christopher_R_3 · ‎12-29-2015

I can see how in many cases your reasoning applies, however I am looking to use this for solving nonlinear (S)PDEs. This means that most of the matrices only need to be transferred over once at the beginning of the simulation, and O(TN) where T is the number of time step >> N operations are then performed (with nonlinear evaluations done in parallel). In MATLAB CUDA this is easy to do and results in a remarkable speedup by just keeping the matrices on the GPU since the problem is entirely parallel (all function evaluations are independent at each point in space, then the rest are matrix and element-wise operations), and using parallel random number generators to generate random matrices the randomness is incorporated with a large speedup as well. However, to do this with the Xeon Phi, I would have to (A) have the element-wise operations run on the Phi (B) Tell the Phi to keep the data on there until all computations are done and (C) Do parallel random number generation. I think I am finding that the AO controls are too limited to do this, and so to accomplish this one has to go down and dirty with C/Fortran code which I was looking to avoid.

Andrey_Vladimirov · ‎12-29-2015

I agree, with automatic offload you cannot have data persistence from one offload to another. Doing it with LEO or OpenMP 4.0 from C or Fortran is not that hard, though, and MKL has sparse solvers.