Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Haswell GFLOPS

caosun
New Contributor I
9,056 Views

Hi Intel Experts:

    I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

    I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

    I get some information from internet that: 

        Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

        Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

    I have two questions here:

    1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

    2. Does Haswell have TWO FMA? 

    Thank you very much for any comments.

Best Regards,

Sun Cao

0 Kudos
1 Solution
caosun
New Contributor I
8,806 Views

Hi Sergey:

    You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

View solution in original post

0 Kudos
72 Replies
SergeyKostrov
Valued Contributor II
4,713 Views
>>...Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide >>AVX addition and 8-wide AVX multiplication... If you have Haswell and Ivy Bridge systems you could easily evaluate their real performance and you need to use a Vec_samples.zip sample from Intel Parallel Studio XE 2013.
0 Kudos
caosun
New Contributor I
4,713 Views

Hi Sergey:

    I do not have Haswell systems now.

    Even I have it, it will be very helpful if Intel could provide me more information.

Best Regards,

Sun Cao

0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>...Does Haswell have TWO FMA?.. There are 6 different groups of FMA instructions ( 60 instructions in total ) and please take a look at: software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available
0 Kudos
Bernard
Valued Contributor I
4,713 Views

Haswell execution engine has two Ports dedicated also  to FMA(one FMA per port) instructions(Port0 and Port1) so you have doubled bandwidth of gflops/cycle.

On Haswell one FMA operation combines  multiplication and  addidtion when compared to previous architecture such a operation could stall two ports when executing at the same time.

0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>I do not have Haswell systems now. >> >>Even I have it, it will be very helpful if Intel could provide me more information... I agree with that. As soon as you have a Haswell system you could do a veri quick evaluation of performance with Vec_samples.zip from ..\Composer XE\Samples\en_US\C++ folder ( for a Windows platform ) Here are some additional technical details: Compiler options: /O3 /Qstd=c99 /Qrestrict /Qipo ... #define ALIGNED #define NOALIAS #define NOFUNCCALL // Note: Inlining ... [ Test 1 - No Vectorization & No Inlining & No IPO & /O2 are used - Release ] ROW:256 COL: 256 Execution time is 12.750 seconds GigaFlops = 0.673720 Sum of result = 1279224.000000 [ Test 2 - Vectorization & Alignment & Inlining & IPO & /O3 are used - Release ] ROW:256 COL: 256 Execution time is 4.734 seconds GigaFlops = 1.814519 Sum of result = 1279224.000000 As you can see Test 2 is ~2.7 times faster then Test 1.
0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>>>...i7-3630QM's GFlops is 76.8 (Base)... >> >>GigaFlops = 1.814519 By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores. Let me know if you're interested to see numbers for Ivy Bridge system.
0 Kudos
Bernard
Valued Contributor I
4,713 Views

>>>By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores.>>>

Are those results obtained from testing Vec_samples?

Afaik Pentium 4 cannot calculate at the same time fadd and fmul.Haswell core is able to  schedule for execution one FMA(two fp instructions) per one thread it is a tremendous improvement in raw processing power when compared to Pentium 4

0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>Are those results obtained from testing Vec_samples? Yes and you could take a look at it because the project is in Samples folder.
0 Kudos
Bernard
Valued Contributor I
4,713 Views

Thanks

0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)... Sun Cao, I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number?
0 Kudos
Bernard
Valued Contributor I
4,713 Views

Actually on Ivy Bridge you have 1 wide fadd/cycle and 1 wide fmul/cycle  it can be either SP(8 flops) or DP(4 flops) and mulitplied by 4 cores and by clock grequency 2.4 ghz = 76.8 Gflops.

0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)... >> >>Sun Cao, >> >>I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number? This is how it looks like in reality: [ Test 1 on a system with Pentium 4 ] [ SSE2 - 32-bit Intel C++ compiler options - 1 CPU used ] Note: For all test cases /O3 /QaxSSE2 /Qstd=c99 options are used GigaFlops = 1.808407 - GigaFlops = 1.814136 - /Qrestrict /Qansi-alias GigaFlops = 1.844917 - /Qrestrict /Qansi-alias /Qipo GigaFlops = 1.851279 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4 GigaFlops = 1.889559 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 GigaFlops = 2.147484 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*) GigaFlops = 1.814519 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 GigaFlops = 1.929022 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 GigaFlops = 0.628287 - /Qrestrict /Qansi-alias /Qparallel GigaFlops = 0.628333 - /Qrestrict /Qansi-alias /Qipo /Qparallel (*) - Best result
0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
[ Test 2 on a system with Ivy Bridge ] [ AVX - 64-bit Intel C++ compiler options - 1 CPU used ] Note: For all test cases /O3 /QaxAVX /Qstd=c99 options are used GigaFlops = 11.228673 - GigaFlops = 11.228673 - /Qrestrict /Qansi-alias GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo GigaFlops = 9.326748 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4 GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*) GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 [ AVX - 64-bit Intel C++ compiler options - 8 CPUs used ] GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qparallel (*) GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qipo /Qparallel (*) Note: 60.33316 = 7.541646 * 8 (*) - Best result As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.
0 Kudos
caosun
New Contributor I
8,807 Views

Hi Sergey:

    You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>...You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm. Hi, Thank you and I'll take a look.
0 Kudos
Bernard
Valued Contributor I
4,713 Views

>>>As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.>>>

It could be theoretical peak performance bandwidth.Real application can affect this result by introducing memory stalls or instruction interdependencies.

0 Kudos
levicki
Valued Contributor I
4,713 Views

Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.

0 Kudos
SergeyKostrov
Valued Contributor II
4,713 Views
>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL... Thanks for the tip regarding Linpack. I did a verification using older version of Linpack and numbers for Pentium 4 are 4x (!) lower: ... Mflops 580.59 532.56 578.32 587.83 532.69 Average 562.40 ... That is 0.562Gflops and it was just a quick verification of my numbers.
0 Kudos
Bernard
Valued Contributor I
4,713 Views

>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL..>>>

Haswell can pose a challenge for low end GPUs in terms of DP Gflops.

0 Kudos
SergeyKostrov
Valued Contributor II
3,898 Views
>>...Haswell can pose a challenge for low end GPUs in terms of DP Gflops... What challenge? And why should it be a concern regarding GPUs? I really didn't understand what you wanted to say. Personally I'm trying to evaluate performance differences between 3 major lines of CPUs: Pentium 4, Ivy Bridge and Haswell.
0 Kudos
Reply