- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Intel Experts:
I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?
I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?
I get some information from internet that:
Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.
Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions
I have two questions here:
1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?
2. Does Haswell have TWO FMA?
Thank you very much for any comments.
Best Regards,
Sun Cao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey:
You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm
Link Copied
- « Previous
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey.. in DGEMM.. you are performing the matrix computation..C = C + A x B. You didn't incude the addition of C. BLAS exists for a purpose to standardize Linear Algebra operations and that is my focus. So if you measured DGEMM Iin MKL it will be 1/2 as fast as DGEMM. HPL uses DGEMM.. and the title of this thread is Haswell GFLOPs. Is your Kroneker routine doing what DGEMM does... explicitly. I googled it but found that the Kroneker product is not dgemm
I must confess I don't understand what you are achieving in your study. My focus is purely in understanding what the DGEMM performance of Haswell is.. with whatever algorithm you use, so long as it does the matrix computation C = C + A x B.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
The title of this post was Haswell GFLOPs. My interest is in "standardized BLAS routines" which drive LAPACK and many other high-performance applications. DGEMM does the matrix operation of C = C + A * B. When you update C, you have M * N addition operations which yields the formula I told you earlier which is 2 * M * K * N in the generic sense. That's the FLOP count for a traditional matrix mulitplication algorithm, and it's how the industry measures FLOPs. Now.. running SGEMM is completely not comparable to running DGEMM when comparing the time to do arithmetic, so it's just not comparable at all. I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product, which isn't DGEMM. DGEMM runs HPL which is what the scientific community uses to measure GFLOPs. So my recommendations to you are to standardize the problem you're running. Are all these results at the same precision and the same operation. MATMUL is doing what DGEMM is (close enough) and so is MKL (if you were running DGEMM rather than SGEMM). The other results you quote, if they're not DGEMM then they're not comparable to my results. It's just common sense. If your Kroneker operation is DGEMM, then you've got something interesting, but I suspect you're not doing a traditional matrix mulitplication and thus it's not a 1:1 correspondence and it's less interesting to me.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey... the Kroneker algorithm you point to says one of the matrices needs to be represented as a Kroneker product of 2 smaller matrices. While that may be applicable in some cases it is not generally applicable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.
>>
>>Igor, I've used Linpack and these numbers are more consistent with Intel's numbersIgor, Did you get 116 GFlops number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?
No I did not get the result off the web, I run the test myself using LinX AVX.
Number of cores is 4 (Haswell 4770K with HTT disabled).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just ran my SB/IV dgemm and I measured 98.7 GFLOPs @ 3.4 GHz. If you scale it to 4.0 GHz then I get the same performance you quoted.. 116 GFOPs. Just another data point Sergey..
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Where can I find the GFlops for Haswell , (Xeon E5 2697 v3)?
I had a look at the numbers http://www.intel.com/content/dam/support/us/en/documents/processors/xeon/sb/xeon_E5-2600.pdf , but it does not show for E5-2697
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
- Next »