Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Haswell GFLOPS

caosun
New Contributor I
14,646 Views

Hi Intel Experts:

    I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

    I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

    I get some information from internet that: 

        Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

        Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

    I have two questions here:

    1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

    2. Does Haswell have TWO FMA? 

    Thank you very much for any comments.

Best Regards,

Sun Cao

0 Kudos
1 Solution
caosun
New Contributor I
14,396 Views

Hi Sergey:

    You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

View solution in original post

0 Kudos
72 Replies
SergeyKostrov
Valued Contributor II
6,607 Views
>>...Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide >>AVX addition and 8-wide AVX multiplication... If you have Haswell and Ivy Bridge systems you could easily evaluate their real performance and you need to use a Vec_samples.zip sample from Intel Parallel Studio XE 2013.
0 Kudos
caosun
New Contributor I
6,607 Views

Hi Sergey:

    I do not have Haswell systems now.

    Even I have it, it will be very helpful if Intel could provide me more information.

Best Regards,

Sun Cao

0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>...Does Haswell have TWO FMA?.. There are 6 different groups of FMA instructions ( 60 instructions in total ) and please take a look at: software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available
0 Kudos
Bernard
Valued Contributor I
6,607 Views

Haswell execution engine has two Ports dedicated also  to FMA(one FMA per port) instructions(Port0 and Port1) so you have doubled bandwidth of gflops/cycle.

On Haswell one FMA operation combines  multiplication and  addidtion when compared to previous architecture such a operation could stall two ports when executing at the same time.

0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>I do not have Haswell systems now. >> >>Even I have it, it will be very helpful if Intel could provide me more information... I agree with that. As soon as you have a Haswell system you could do a veri quick evaluation of performance with Vec_samples.zip from ..\Composer XE\Samples\en_US\C++ folder ( for a Windows platform ) Here are some additional technical details: Compiler options: /O3 /Qstd=c99 /Qrestrict /Qipo ... #define ALIGNED #define NOALIAS #define NOFUNCCALL // Note: Inlining ... [ Test 1 - No Vectorization & No Inlining & No IPO & /O2 are used - Release ] ROW:256 COL: 256 Execution time is 12.750 seconds GigaFlops = 0.673720 Sum of result = 1279224.000000 [ Test 2 - Vectorization & Alignment & Inlining & IPO & /O3 are used - Release ] ROW:256 COL: 256 Execution time is 4.734 seconds GigaFlops = 1.814519 Sum of result = 1279224.000000 As you can see Test 2 is ~2.7 times faster then Test 1.
0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>>>...i7-3630QM's GFlops is 76.8 (Base)... >> >>GigaFlops = 1.814519 By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores. Let me know if you're interested to see numbers for Ivy Bridge system.
0 Kudos
Bernard
Valued Contributor I
6,607 Views

>>>By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores.>>>

Are those results obtained from testing Vec_samples?

Afaik Pentium 4 cannot calculate at the same time fadd and fmul.Haswell core is able to  schedule for execution one FMA(two fp instructions) per one thread it is a tremendous improvement in raw processing power when compared to Pentium 4

0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>Are those results obtained from testing Vec_samples? Yes and you could take a look at it because the project is in Samples folder.
0 Kudos
Bernard
Valued Contributor I
6,607 Views

Thanks

0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)... Sun Cao, I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number?
0 Kudos
Bernard
Valued Contributor I
6,607 Views

Actually on Ivy Bridge you have 1 wide fadd/cycle and 1 wide fmul/cycle  it can be either SP(8 flops) or DP(4 flops) and mulitplied by 4 cores and by clock grequency 2.4 ghz = 76.8 Gflops.

0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)... >> >>Sun Cao, >> >>I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number? This is how it looks like in reality: [ Test 1 on a system with Pentium 4 ] [ SSE2 - 32-bit Intel C++ compiler options - 1 CPU used ] Note: For all test cases /O3 /QaxSSE2 /Qstd=c99 options are used GigaFlops = 1.808407 - GigaFlops = 1.814136 - /Qrestrict /Qansi-alias GigaFlops = 1.844917 - /Qrestrict /Qansi-alias /Qipo GigaFlops = 1.851279 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4 GigaFlops = 1.889559 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 GigaFlops = 2.147484 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*) GigaFlops = 1.814519 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 GigaFlops = 1.929022 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 GigaFlops = 0.628287 - /Qrestrict /Qansi-alias /Qparallel GigaFlops = 0.628333 - /Qrestrict /Qansi-alias /Qipo /Qparallel (*) - Best result
0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
[ Test 2 on a system with Ivy Bridge ] [ AVX - 64-bit Intel C++ compiler options - 1 CPU used ] Note: For all test cases /O3 /QaxAVX /Qstd=c99 options are used GigaFlops = 11.228673 - GigaFlops = 11.228673 - /Qrestrict /Qansi-alias GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo GigaFlops = 9.326748 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4 GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*) GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 [ AVX - 64-bit Intel C++ compiler options - 8 CPUs used ] GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qparallel (*) GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qipo /Qparallel (*) Note: 60.33316 = 7.541646 * 8 (*) - Best result As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.
0 Kudos
caosun
New Contributor I
14,397 Views

Hi Sergey:

    You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>...You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm. Hi, Thank you and I'll take a look.
0 Kudos
Bernard
Valued Contributor I
6,607 Views

>>>As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.>>>

It could be theoretical peak performance bandwidth.Real application can affect this result by introducing memory stalls or instruction interdependencies.

0 Kudos
levicki
Valued Contributor I
6,607 Views

Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.

0 Kudos
SergeyKostrov
Valued Contributor II
6,607 Views
>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL... Thanks for the tip regarding Linpack. I did a verification using older version of Linpack and numbers for Pentium 4 are 4x (!) lower: ... Mflops 580.59 532.56 578.32 587.83 532.69 Average 562.40 ... That is 0.562Gflops and it was just a quick verification of my numbers.
0 Kudos
Bernard
Valued Contributor I
6,607 Views

>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL..>>>

Haswell can pose a challenge for low end GPUs in terms of DP Gflops.

0 Kudos
SergeyKostrov
Valued Contributor II
5,792 Views
>>...Haswell can pose a challenge for low end GPUs in terms of DP Gflops... What challenge? And why should it be a concern regarding GPUs? I really didn't understand what you wanted to say. Personally I'm trying to evaluate performance differences between 3 major lines of CPUs: Pentium 4, Ivy Bridge and Haswell.
0 Kudos
Reply