Any thoughts on this? Is

Bharat_N_ · ‎03-17-2013

I am trying to use the Larrabee intrinsics to mimic MKL's ZGEMM performance on the MIC and my main bottleneck is this:

For C=A*B, if I assume 'A' is in row major, 'B' is in column major and 'C' is in column major, I can use register tiling by taking 4 rows of 'A' and mutliply them with 1 column of 'B' i.e read 4 blocks of 'A' and 'B' into two 512-bit vectors(with each block containing two doubles, the real and imaginary parts), multiply them using a combination of swizzles and FMA/S and get the result in a 512-bit vector.

So this will be of the form: |im40|re40|im30|re30|im20|re20|im10|re10| /*A[0]*B[0]*/. Similarly I'll get 3 more vectors for A[1]*B[0], A[2]*B[0], A[3]*B[0].

So, I'll have the following 4 vectors:

|im40|re40|im30|re30|im20|re20|im10|re10| /*A[0]*B[0]*/

|im41|re41|im31|re31|im21|re21|im11|re11| /*A[1]*B[0]*/

|im42|re42|im32|re32|im22|re22|im21|re21| /*A[2]*B[0]*/

|im43|re43|im33|re33|im23|re23|im13|re13| /*A[3]*B[0]*/

which need to be permuted into:

|im13|re13|im21|re21|im11|re11|im10|re10|

|im23|re23|im22|re22|im21|re21|im20|re20|

|im33|re33|im32|re32|im31|re31|im30|re30|

|im43|re43|im42|re42|im41|re41|im40|re40|

I can now add these 4 vectors to get one 512-bit vector, which I can then store as 4 contiguous blocks of C[0]. However, the above permutation causes a significant overhead and my performance is just ~180 Gflops.

If however, I assume 'A' also as column major, I can simply broadcast the first block of 'B' into one vector, and multiply this with the first column of 'A' i.e.

|im3|re3|im2|re2|im1|re1im0|re0| /*|A[3][0]|A[2][0]|A[1][0]|A[0][0]|*/

im0|re0|im0|re0|im0|re0|im0|re0| /*B[0][0] broadcasted*/

When these two get multiplied I get the first 4 blocks of C[0] and I continue in a similar manner.

This gives a much better performance of ~360 Gflops.

Assuming that MKL's ZGEMM gives performance > 800 Gflops, how can I improve my code algorithmically to mimic MKL's performance? Any ideas?

Bharat_N_ · ‎03-19-2013

Any thoughts on this? Is there any graph which shows the performance of ZGEMM using Intel's MKL similar to:

http://software.intel.com/sites/default/files/MKL_1101_KNC_GEMM.png

Sumedh_N_Intel · ‎03-20-2013

Posting on behalf of zhang zhang:

We don't have a MKL ZGEMM chart. It would just be very similar to the DGEMM chart.

The 800 GFLOPS performance is a result of Intel’s proprietary optimizations technologies, which include aggressive optimizations targeting maximum vectorization and FP units utilization, cache utilization, minimized TLB misses, minimal memory access, multithreading, among many other things. We appreciate your effort of programming using MIC intrinsics, which is for sure the way to go to implement performance critical algorithms. But for simple algorithms like GEMM, we strongly suggest you to use existing proved solutions, for example, MKL, instead of re-inventing the wheel and taking on your own all the burdens and risks of maintaining the code for many generations of MIC architectures to come in future. If by any chance you have problems using MKL, please do let us know. We will help.

Bharat_N_ · ‎03-20-2013

Thanks to both of you! This was helpful.

ZGEMM -> How does MKL do it?