I am trying to reproduce the Matrix Multiply results presented in the following website and I am not getting the same results.
Attached is the modified file from the I am starting from the code that comes with the MKL library (under: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\examples) with no buffer reuse and doing initially single precision computations.
Does anyone know if this is the code used for the benchmark or if there is a specific linpack library that I should be using, like the one found here:
The Xeon Phi Model I am using is the 7200P with 61 cores and 16GB RAM.
Also, it curious that at 30000 rank matrices (~10.1 GB for the three matrices) the MIC reserves the memory (checking with the micsmc and ssh-ing into the MIC and using the top command) but performs no computations and it seems to hang.
Yes, when talking about the Linpack Performance, we usually mean the Linpack ,which can be download from https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite and you can find them under MKL install folders.
You mentioned , 30000 rank matrices (~10.1 GB for the three matrices) the MIC reserves the memory (checking with the micsmc and ssh-ing into the MIC and using the top command) but performs no computations and it seems to hang. do you mean the linpack or the examples? anyway, if it doesn't works, you may try smaller and see if it works.
Intel MKL Support
The matrices are of rank 3 (according to Fortran terminology). I believe MKL may allocate a temporary working matrix, which would prevent the coprocessor from using all of the on-board memory for your matrices and the MPSS and offloaded code, even if using the beta==0 option to suppress downloading the output C matrix of dgemm. The benchmark quotations should state what size was found to give the quoted Gflops rating, and you wouldn't expect to be able to go much beyond that.
So if I understand correctly you believe that MKL is allocating temporary matrix space, which sounds reasonable, even though I would have thought that MKL would be doing some sort of blocking on the matrices to overlap computation/communication times and thus would require only small buffers inside the MIC (granted that the matrices might eventually be stored completely in the MIC, hence the memory allocation procedure).
Your observations brings me back to my original question in trying to reproduce the Matrix-Matrix multiplication published in the following webpage:
The matrices there go up to (43072 x 43072) which comes up to ~14GB in double precision, that made me think that the MIC would be able to hold my 25Kx25K matrices.
Would it be possible to see the Linpack code used to generate these results (the SGEMM & DGEMM)?
BTW, my beta is not 0, so I always assume that there will be an update and need to load onto the MIC the three matrices.
I have been putting some numbers on memory consumption and I realize they are disorganized, so here is a more organized version:
For Matrix Rank: 25000 (Double precision) <- this fails
- Memory 1 matrix: 25000^2 * 8 / 1e9 = ~5GB
- Memory 3 matrices (required in DGEMM ) = ~15GB
Of course, then the 30Kx30K (~7.2GB) matrices should also fail since it would involve around 21.6GB of memory which exceeds the MIC memory. This is the reason why I thought that the Linpack (MKL based?) version was doing some kind of data blocking since the reported size of (43072 x 43072) would not fit as is in MIC memory.
The only viewable relevant source code is the public BLAS, e.g. on netlib.org. Intel holds their own modifications proprietary, probably including data blocking not in the reference source, and translation to C++ with simd intrinsics.