Regarding MKL performance and GFLOPS of processor

Ying_H_Intel · ‎10-20-2013

We have some articles discussing the performance test on MKL. for example, Tips to measure the performance of Intel® MKL with small matrix sizes.

SG have several interesting questions regarding the MKL performance and GFLOPS of processors under the article. I copy the comments here so more developer can involve.

The performance reported here seems to be half as good as it should be. Here's what I mean: graph on page http://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library is for comparable CPU without AVX. Graph on this page is with AVX. But both graphs report similar performance -- but one would expect chip with AVX to perform twice as better than chip without AVX.

Here's my guess about the discrepancy: For the Graph on this page, the "F" in "GFlops" referrs to double-precision floating point, but for the Graph on the other page, the "F" is single-precision floating point. One can confirm the interpretation of "F" for this page based on the equation provided in this page. For the intrepretation of "F" on the other page, note that the other page says Peak Performance of 3.2 GHz SSE3 chip using 8 threads is 102.4 GFlops; dividing 102.4 by 3.2 GHz gives 32 floating point operations per clock per 8 threads -- or 4 floating point operations per clock per core; 4 floating point operations per clock per core on SSE3 chip means single precision floating point operations. But this para is just my guess -- would be nice if someone from Intel confirmed it or clarified the apparent missing performance on using AVX.

Ying_H_Intel · ‎10-20-2013

Hi Sergey,

Thanks you comments, the sample was attached in the paper (please click the Download button at the end of pape). You may test it with 2048x2048.

@SG.
Thanks you for asking I check them in details.
first, Right, the GLOPS in the article is based on double floating point, .
Actually, i7-2600k can run two avx (265bit) in one cycle, one MUL, one ADD, so totally, 2x4=8 double floating point operation, so peak Performance of 3.4 GHz avx chip using 4 threads is 3.4x8x4 =108.8 GFlops.
for that paper, the processor: w5580, one sse3 (128bit) have 4 double floating point operation, the peak Performance of 3.2 sse3 chip using 4 threads is 3.2x4x4=51.2 GFlops.
But the figure on have 8 threads, which should mean that the test is based on 2 packages ( we will ask the author to confirm it).

So you can expect chip with avx to perform ~2x on FLOPS than chip without AVX.

Best Regards,
Ying
Please see formal doc : peak FLOP about i7-2600K: http://www.intel.com/support/processors/sb/CS-032814.htm?wapkw=peak+flops

and peak FLOP for Xeon 5580 : http://www.intel.com/support/processors/xeon/sb/CS-020863.htm?wapkw=peak+flops

and forum dicussion in http://software.intel.com/en-us/forums/topic/291765

Ying_H_Intel · ‎10-20-2013

Hello Ying,

Thanks for the response, but please clarify: SSE3 does _not_ have fused multiply-add, so how come it can perform 4 double floating point operations?:

one sse3 (128bit) have 4 double floating point operation

Ying_H_Intel · ‎10-20-2013

Hello SG,

The problem is not about FMA (officially, it was supported in AVX2), but as Max and Tim said, current processor can issue 1 mulply and 1 add at one cycle (or you can take it as two SSE units). W5580 is from Nehalem (core i7) family, it can perform 4 double floating point operations.

Best Regards,
Ying

Ying_H_Intel · ‎10-20-2013

Hello Ying,

Thanks for the info on being able to issue 1 multiply and 1 add in one cycle. Few more questions: (1) Is it also possible to issue two FMA instructions in the same clock? If so, wouldn't the FMA based implementation be 4 times faster than the SSE based implementation? (2) You mention

.. as Max and Tim said, ...

where exactly is their statement?

Ying_H_Intel · ‎10-20-2013

Hi SG,

The Max and Tim's discussion are in http://software.intel.com/en-us/forums/topic/291765.

Regarding FMA in avx2, right, it is possible to issue 2 FMA in one clock, and so there are canbe 4 times faster than SSE based implementation.

The GFLOPS of processor was in documentation : http://www.intel.com/support/processors/sb/CS-017346.htm.

But unlucky, for latest processor (haswell, which support FMA), we can't find corresponding doc. (may be submerged in manual or other spec docs). i did search, the one may be helpful. http://software.intel.com/en-us/forums/topic/394248

and you can do the test with the sample if you have such processor.

Best Regards,

Ying

Bernard · ‎10-20-2013

>>> Is it also possible to issue two FMA instructions in the same clock? >>>

Yes it is possible.The best performance could be achieved when the instructions are not dependent on each other thus two ports will be servicing FMA code in parallel.