Reproducing Xeon Phi Linpack (GEMM) results

David_F_8 · ‎05-06-2016

Hello all,

I am trying to reproduce the Matrix Multiply results presented in the following website and I am not getting the same results.

http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-linpack-stream.html

Attached is the modified file from the I am starting from the code that comes with the MKL library (under: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\examples) with no buffer reuse and doing initially single precision computations.

Does anyone know if this is the code used for the benchmark or if there is a specific linpack library that I should be using, like the one found here:

https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite

The Xeon Phi Model I am using is the 7200P with 61 cores and 16GB RAM.

Also, it curious that at 30000 rank matrices (~10.1 GB for the three matrices) the MIC reserves the memory (checking with the micsmc and ssh-ing into the MIC and using the top command) but performs no computations and it seems to hang.

Best regards,

David

Ying_H_Intel · ‎05-08-2016

Hi David,

Yes, when talking about the Linpack Performance, we usually mean the Linpack ,which can be download from https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite and you can find them under MKL install folders.

You mentioned , 30000 rank matrices (~10.1 GB for the three matrices) the MIC reserves the memory (checking with the micsmc and ssh-ing into the MIC and using the top command) but performs no computations and it seems to hang. do you mean the linpack or the examples? anyway, if it doesn't works, you may try smaller and see if it works.

Best Regards,
Ying H.

Intel MKL Support

David_F_8 · ‎05-09-2016

Hello Ying,

When I mentioned the 30000 rank matrices I was referring to the Compiler Assisted offload code found in the examples that come with the MKL library (not the automatic offload ones).

I am running the "sgemm.c" program on an Intel Xeon Phi 7200P (with 16GB of RAM) and after reaching this matrix size it hangs.

I experience this also before the also 25000 which at double precision yields ~15GB, still within the memory capacity of the MIC. The memory consumption mentioned is taking into account that the GEMM operation uses Doubles and 3 matrices of the same rank.

This I though was curious since I am still not occupying all the memory in the MIC, and the OS and other related process take up around 300K in the MIC.

Best regards,

David Fernandez

TimP · ‎05-09-2016

The matrices are of rank 3 (according to Fortran terminology). I believe MKL may allocate a temporary working matrix, which would prevent the coprocessor from using all of the on-board memory for your matrices and the MPSS and offloaded code, even if using the beta==0 option to suppress downloading the output C matrix of dgemm. The benchmark quotations should state what size was found to give the quoted Gflops rating, and you wouldn't expect to be able to go much beyond that.

David_F_8 · ‎05-09-2016

Hello Tim,

So if I understand correctly you believe that MKL is allocating temporary matrix space, which sounds reasonable, even though I would have thought that MKL would be doing some sort of blocking on the matrices to overlap computation/communication times and thus would require only small buffers inside the MIC (granted that the matrices might eventually be stored completely in the MIC, hence the memory allocation procedure).

Your observations brings me back to my original question in trying to reproduce the Matrix-Matrix multiplication published in the following webpage:

http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-linpack-stream.html

The matrices there go up to (43072 x 43072) which comes up to ~14GB in double precision, that made me think that the MIC would be able to hold my 25Kx25K matrices.

Would it be possible to see the Linpack code used to generate these results (the SGEMM & DGEMM)?

BTW, my beta is not 0, so I always assume that there will be an update and need to load onto the MIC the three matrices.

Best regards,

David

David_F_8 · ‎05-09-2016

Hello all,

I have been putting some numbers on memory consumption and I realize they are disorganized, so here is a more organized version:

For Matrix Rank: 25000 (Double precision) <- this fails

Memory 1 matrix: 25000^2 * 8 / 1e9 = ~5GB
Memory 3 matrices (required in DGEMM ) = ~15GB

Of course, then the 30Kx30K (~7.2GB) matrices should also fail since it would involve around 21.6GB of memory which exceeds the MIC memory. This is the reason why I thought that the Linpack (MKL based?) version was doing some kind of data blocking since the reported size of (43072 x 43072) would not fit as is in MIC memory.

Bests,

David

TimP · ‎05-09-2016

The only viewable relevant source code is the public BLAS, e.g. on netlib.org. Intel holds their own modifications proprietary, probably including data blocking not in the reference source, and translation to C++ with simd intrinsics.