Solved: Haswell GFLOPS - Page 4

caosun · ‎06-26-2013

Hi Intel Experts:

I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

I get some information from internet that:

Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

I have two questions here:

1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

2. Does Haswell have TWO FMA?

Thank you very much for any comments.

Best Regards,

Sun Cao

caosun · ‎07-01-2013

Hi Sergey:

You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

View solution in original post

SergeyKostrov · ‎07-09-2013

>>...my results are for double precision... while you appear to running single precision... I used cblas_sgemm because I needed to compare performance for 4Kx4K and 8Kx8K cases on a Pentium 4 system with just 1GB of physical memory. 16Kx16K exceeds 2GB limitation for a 32-bit system. I'll do a quick comparison of performance for cblas_sgemm and cblas_dgemm later, however it is not my top priority. Speaking about these 5 algorithms ( Classic, Strassen HBC, MKL's cblas_sgemm, Fortran's MATMUL and Kroneker Based ) I've finally done what I wanted to compare for a long time. By the way, Fortran's MATMUL and Kronecker Based cases are using double precision floating point data types.

SergeyKostrov · ‎07-09-2013

Attached is a txt-file with test results. Thanks.

perfwise · ‎07-09-2013

Sergey.. in DGEMM.. you are performing the matrix computation..C = C + A x B. You didn't incude the addition of C. BLAS exists for a purpose to standardize Linear Algebra operations and that is my focus. So if you measured DGEMM Iin MKL it will be 1/2 as fast as DGEMM. HPL uses DGEMM.. and the title of this thread is Haswell GFLOPs. Is your Kroneker routine doing what DGEMM does... explicitly. I googled it but found that the Kroneker product is not dgemm

http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCgQFjAA&url=http%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FKronecker_product&ei=TLLcUZHzH5DRqwGD-oG4Bg&usg=AFQjCNELri6W15wGinM7HrPsZaEG89bZcA&sig2=epdFeLw6NAt5Xfa7dh6fSQ

I must confess I don't understand what you are achieving in your study. My focus is purely in understanding what the DGEMM performance of Haswell is.. with whatever algorithm you use, so long as it does the matrix computation C = C + A x B.

perfwise · ‎07-09-2013

Sergey,

The title of this post was Haswell GFLOPs. My interest is in "standardized BLAS routines" which drive LAPACK and many other high-performance applications. DGEMM does the matrix operation of C = C + A * B. When you update C, you have M * N addition operations which yields the formula I told you earlier which is 2 * M * K * N in the generic sense. That's the FLOP count for a traditional matrix mulitplication algorithm, and it's how the industry measures FLOPs. Now.. running SGEMM is completely not comparable to running DGEMM when comparing the time to do arithmetic, so it's just not comparable at all. I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product, which isn't DGEMM. DGEMM runs HPL which is what the scientific community uses to measure GFLOPs. So my recommendations to you are to standardize the problem you're running. Are all these results at the same precision and the same operation. MATMUL is doing what DGEMM is (close enough) and so is MKL (if you were running DGEMM rather than SGEMM). The other results you quote, if they're not DGEMM then they're not comparable to my results. It's just common sense. If your Kroneker operation is DGEMM, then you've got something interesting, but I suspect you're not doing a traditional matrix mulitplication and thus it's not a 1:1 correspondence and it's less interesting to me.

Perfwise

SergeyKostrov · ‎07-09-2013

Let's finalize our discussion about matrix multiplication algorithms. >>...DGEMM does the matrix operation of C = C + A * B... ?GEMM does more multiplications and additions by design: C = alpha*A*B + beta*C However, this is ?GEMM specific and I'm talking about a generic case, like C = A * B, and nothing else. I don't know any ISO-like standard accepted in industry regarding measuring performance of some software and everybody has its own solution(s). ( In reality I know how ISO 8001 works for X-Ray imaging software... Very-very strict... ) >>...I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product... This is Not a regular Kronecker Product and that algorithm is described and I gave you a weblink earlier ( see one of my previous post ). The Kronecker Based algorithm for matrix multiplication is a really high performance algorithm implemented in Fortran by another software developer ( Vineet Y - http://software.intel.com/en-us/user/798062 ). >>... I suspect you're not doing a traditional matrix mulitplication... Once again, take a look at a document posted on the webpage I've mentioned and a description of the algorithm is available.

perfwise · ‎07-10-2013

Sergey... the Kroneker algorithm you point to says one of the matrices needs to be represented as a Kroneker product of 2 smaller matrices. While that may be applicable in some cases it is not generally applicable.

SergeyKostrov · ‎07-10-2013

>>?GEMM does more multiplications and additions by design: >> >>C = alpha*A*B + beta*C Would I consider that as a generic case? No. Have we reached the bottom of the ocean? Yes.

levicki · ‎07-26-2013

Sergey Kostrov wrote:

>>>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.
>>
>>Igor, I've used Linpack and these numbers are more consistent with Intel's numbers

Igor, Did you get 116 GFlops number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?

No I did not get the result off the web, I run the test myself using LinX AVX.

Number of cores is 4 (Haswell 4770K with HTT disabled).

perfwise · ‎07-29-2013

I just ran my SB/IV dgemm and I measured 98.7 GFLOPs @ 3.4 GHz. If you scale it to 4.0 GHz then I get the same performance you quoted.. 116 GFOPs. Just another data point Sergey..

Perfwise

Abhishek_J_ · ‎08-18-2016

Where can I find the GFlops for Haswell , (Xeon E5 2697 v3)?
I had a look at the numbers http://www.intel.com/content/dam/support/us/en/documents/processors/xeon/sb/xeon_E5-2600.pdf , but it does not show for E5-2697