Solved: Haswell GFLOPS

caosun · ‎06-26-2013

Hi Intel Experts:

I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

I get some information from internet that:

Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

I have two questions here:

1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

2. Does Haswell have TWO FMA?

Thank you very much for any comments.

Best Regards,

Sun Cao

caosun · ‎07-01-2013

Hi Sergey:

You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

View solution in original post

SergeyKostrov · ‎06-27-2013

>>...Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide >>AVX addition and 8-wide AVX multiplication... If you have Haswell and Ivy Bridge systems you could easily evaluate their real performance and you need to use a Vec_samples.zip sample from Intel Parallel Studio XE 2013.

caosun · ‎06-27-2013

Hi Sergey:

I do not have Haswell systems now.

Even I have it, it will be very helpful if Intel could provide me more information.

Best Regards,

Sun Cao

SergeyKostrov · ‎06-27-2013

>>...Does Haswell have TWO FMA?.. There are 6 different groups of FMA instructions ( 60 instructions in total ) and please take a look at: software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available

Bernard · ‎06-28-2013

Haswell execution engine has two Ports dedicated also to FMA(one FMA per port) instructions(Port0 and Port1) so you have doubled bandwidth of gflops/cycle.

On Haswell one FMA operation combines multiplication and addidtion when compared to previous architecture such a operation could stall two ports when executing at the same time.

SergeyKostrov · ‎06-29-2013

>>I do not have Haswell systems now. >> >>Even I have it, it will be very helpful if Intel could provide me more information... I agree with that. As soon as you have a Haswell system you could do a veri quick evaluation of performance with Vec_samples.zip from ..\Composer XE\Samples\en_US\C++ folder ( for a Windows platform ) Here are some additional technical details: Compiler options: /O3 /Qstd=c99 /Qrestrict /Qipo ... #define ALIGNED #define NOALIAS #define NOFUNCCALL // Note: Inlining ... [ Test 1 - No Vectorization & No Inlining & No IPO & /O2 are used - Release ] ROW:256 COL: 256 Execution time is 12.750 seconds GigaFlops = 0.673720 Sum of result = 1279224.000000 [ Test 2 - Vectorization & Alignment & Inlining & IPO & /O3 are used - Release ] ROW:256 COL: 256 Execution time is 4.734 seconds GigaFlops = 1.814519 Sum of result = 1279224.000000 As you can see Test 2 is ~2.7 times faster then Test 1.

SergeyKostrov · ‎06-29-2013

>>>>...i7-3630QM's GFlops is 76.8 (Base)... >> >>GigaFlops = 1.814519 By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores. Let me know if you're interested to see numbers for Ivy Bridge system.

Bernard · ‎06-30-2013

>>>By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores.>>>

Are those results obtained from testing Vec_samples?

Afaik Pentium 4 cannot calculate at the same time fadd and fmul.Haswell core is able to schedule for execution one FMA(two fp instructions) per one thread it is a tremendous improvement in raw processing power when compared to Pentium 4

SergeyKostrov · ‎06-30-2013

>>Are those results obtained from testing Vec_samples? Yes and you could take a look at it because the project is in Samples folder.

Bernard · ‎07-01-2013

Thanks

SergeyKostrov · ‎07-01-2013

>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)... Sun Cao, I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number?

Bernard · ‎07-01-2013

Actually on Ivy Bridge you have 1 wide fadd/cycle and 1 wide fmul/cycle it can be either SP(8 flops) or DP(4 flops) and mulitplied by 4 cores and by clock grequency 2.4 ghz = 76.8 Gflops.

SergeyKostrov · ‎07-01-2013

>>>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)... >> >>Sun Cao, >> >>I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number? This is how it looks like in reality: [ Test 1 on a system with Pentium 4 ] [ SSE2 - 32-bit Intel C++ compiler options - 1 CPU used ] Note: For all test cases /O3 /QaxSSE2 /Qstd=c99 options are used GigaFlops = 1.808407 - GigaFlops = 1.814136 - /Qrestrict /Qansi-alias GigaFlops = 1.844917 - /Qrestrict /Qansi-alias /Qipo GigaFlops = 1.851279 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4 GigaFlops = 1.889559 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 GigaFlops = 2.147484 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*) GigaFlops = 1.814519 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 GigaFlops = 1.929022 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 GigaFlops = 0.628287 - /Qrestrict /Qansi-alias /Qparallel GigaFlops = 0.628333 - /Qrestrict /Qansi-alias /Qipo /Qparallel (*) - Best result

SergeyKostrov · ‎07-01-2013

[ Test 2 on a system with Ivy Bridge ] [ AVX - 64-bit Intel C++ compiler options - 1 CPU used ] Note: For all test cases /O3 /QaxAVX /Qstd=c99 options are used GigaFlops = 11.228673 - GigaFlops = 11.228673 - /Qrestrict /Qansi-alias GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo GigaFlops = 9.326748 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4 GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*) GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 [ AVX - 64-bit Intel C++ compiler options - 8 CPUs used ] GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qparallel (*) GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qipo /Qparallel (*) Note: 60.33316 = 7.541646 * 8 (*) - Best result As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.

caosun · ‎07-01-2013

Hi Sergey:

You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

SergeyKostrov · ‎07-01-2013

>>...You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm. Hi, Thank you and I'll take a look.

Bernard · ‎07-01-2013

>>>As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.>>>

It could be theoretical peak performance bandwidth.Real application can affect this result by introducing memory stalls or instruction interdependencies.

levicki · ‎07-02-2013

Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.

SergeyKostrov · ‎07-02-2013

>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL... Thanks for the tip regarding Linpack. I did a verification using older version of Linpack and numbers for Pentium 4 are 4x (!) lower: ... Mflops 580.59 532.56 578.32 587.83 532.69 Average 562.40 ... That is 0.562Gflops and it was just a quick verification of my numbers.

Bernard · ‎07-02-2013

>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL..>>>

Haswell can pose a challenge for low end GPUs in terms of DP Gflops.

SergeyKostrov · ‎07-02-2013

>>...Haswell can pose a challenge for low end GPUs in terms of DP Gflops... What challenge? And why should it be a concern regarding GPUs? I really didn't understand what you wanted to say. Personally I'm trying to evaluate performance differences between 3 major lines of CPUs: Pentium 4, Ivy Bridge and Haswell.