Is there any benchmark for matrix multiplication on MIC?

Jesmin_Jahan_T_ · ‎04-25-2013

Hi,

Is there any benchmark for Matrix Multiplication on MIC? If yes, please share a link with me.

Also I am experiencing a much wired phenomena in my application.

I am trying to develop a O(n^3) matrix multiplication like application.

Let I have a function A. and I am only interested in timing of function A. Function A takes input from a routine which preprocesses the input for A. However, whatever value is supplied to functionA, it always does O(n^3) operations.

Now initially function A was running in 24s for 16384x16384 size. But the result was incorrect because I made some mistake in the preprocessing routine which filled the input matrices with some wrong values.

However when I fixed that preprocessing step to produce correct input, function A started running in 45s? Although it was doing same amount of computations as before!!

Why might that happen??

If anyone has experience with this kind of problem, please share that with me.

Thanks in advance,

Jesmin

TimP · ‎04-26-2013

This whole MIC program is over-run with matrix multiplication benchmarks (linpack, HPL, .....) You're lucky if you were able to see through the forest.

The MKL ?gemm in current versions is set up even in automatic offload mode to avoid extra overhead in the case of pure matrix multiplicaion "beta==0"

You would want to be certain you are running with the -ftz option enabled e.g. when compiling your main program. It's on by default, but options you may need such as -fp-model source will require you to append -ftz.

Also, when you put new data in, if you incurred overflow or produced NaN that would slow it down.

Jesmin_Jahan_T_ · ‎04-26-2013

Thanks TimP.

Just to make sure, do you mean that I have to compile with -ftz -fp-model option?

Actually to debug, I tried a test case where I assigned 0 to all matrix entries and multiply them. That also needed 45 s instead of 24s!

It is very mysterious to me and I am not able to figure out what could be the reason behind this drastic increase in running time.

Thanks again,

Jesmin

TimP · ‎04-26-2013

-ftz is a default but it's over-ridden by some other options, so it seems ike something to check, e.g. by adding it to your options. You've indicated nothing to show you need a -fp-model option, but you ddin't show your options.

I shouldn't have assumed that you compiled consistently with -O3 or whatever you chose, but that's also something to check. Remember the default without -g is -O2 or -O0 with -g.

TimP · ‎04-26-2013

If you're relying on -parallel, remember that this option isn't recommended for MIC. I have seen it work, but it's not expected to be as effective as other threading options. You will need to fiddle with the number of threads and affinity placement options when using this or any option involving the OpenMP library.

If you are doing a standard matrix multiplication, of course you would compare yours with those based on the provided matrix multiplication intrinsics and libraries, as described in the references.

Jesmin_Jahan_T_ · ‎04-26-2013

Thanks TimP.

I am not relying on -parallel. I am using cilk plus not openmp. The algorithm is a divide and conquer based matrix multiplication.

In my job script I am using

export OFFLOAD_INIT=on_start

export MIC_ENV_PREFIX=MIC

export MIC_CILK_NWORKERS=244

export CILK_NWORKERS=16

export KMP_AFFINITY=scatter

export MKL_MIC_ENABLE=1

export MKL_MIC_WORKDIVISION0=1 MKL_MIC_WORKDIVISION1=0

Although it may sound stupid but where can I get those implementations based on the provided matrix multiplication intrinsics and libraries?

Thanks a lot for help,

Best Regards,

Jesmin