Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Haswell GFLOPS

caosun
New Contributor I
6,822 Views

Hi Intel Experts:

    I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

    I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

    I get some information from internet that: 

        Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

        Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

    I have two questions here:

    1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

    2. Does Haswell have TWO FMA? 

    Thank you very much for any comments.

Best Regards,

Sun Cao

0 Kudos
1 Solution
caosun
New Contributor I
6,572 Views

Hi Sergey:

    You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

View solution in original post

0 Kudos
72 Replies
Bernard
Valued Contributor I
1,007 Views

>>>What challenge? And why should it be a concern regarding GPUs? I really didn't understand what you wanted to say.>>>

It was only general comment.

I meant in terms of raw DP Gflops processing power Haswell microarchitecture is closing gap with lower end GPU's so in foreseable future it can be used to perform software rendering.

0 Kudos
Bernard
Valued Contributor I
1,007 Views

Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249Gflops.

0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>It was only general comment. >> >>I meant in terms of raw DP Gflops processing power Haswell microarchitecture is closing gap with lower >>end GPU's so in foreseable future it can be used to perform software rendering... That sounds really interesting and who is defining that foreseable future and who is going to use lower end GPUs with Haswell systems?
0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL. Igor, I've used Linpack and these numbers are more consistent with Intel's numbers: ... Ivy Bridge - Performance Summary (GFlops) Average = 71.9007 ... Pentium 4 - Performance Summary (GFlops) Average = 1.9561 ... Ivy Bridge performance is also closely matches to what Caosun posted, that is 76.8GFlops ( as far as I understood this is Intel's number ).
0 Kudos
Bernard
Valued Contributor I
1,007 Views

>>>That sounds really interesting and who is defining that foreseable future and who is going to use lower end GPUs with Haswell systems?>>>

I am talking about raw performance comparision between Haswell and some lower to mid range GPU's.Usage of cpu for software rendering is already a reality.

http://www.inartis.com/products/kribi%203D%20Engine/Default.aspx

>>>who is defining that foreseable future>>>

Probably Intel by releasing wider architecturally execution engine designs.

0 Kudos
Bernard
Valued Contributor I
1,007 Views

>>>Ivy Bridge - Performance Summary (GFlops) Average = 71.9007>>>

Close to theoretical peak.


0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL. >> >>Igor, I've used Linpack and these numbers are more consistent with Intel's numbers Igor, Did you get 116 GFlops number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?
0 Kudos
bronxzv
New Contributor II
1,007 Views

Sergey Kostrov wrote:
Igor, Did you get 116 GFlops number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?

in case you are interested I published a result of mine here: http://www.realworldtech.com/forum/?threadid=134512&curpostid=134594
I measured better than 93% efficiency with a workset entirely in the L1D and a compute:load:store ratio of 11:1:1

 

0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>>>...Igor, Did you get 116 GFlops number.. >> >>...I measured 104.407 Gflops ( 112 Gflops peak )... Results are consistent and the difference is ~3.45% ( it is acceptible ). My question is the same: How many cores were used during the test?
0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>...I published a result of mine here: http://www.realworldtech.com/forum/?threadid=134512&curpostid=134594 One more thing regarding a thread: Rumor mill: 512-bit AVX3 in Skylake I don't consider it as a rumor. A header file with some 512-bit-stuff could be found in Intel Parallel Studio XE 2013 ( ..\Compiler\Include folder ) and I know about it since December 2012.
0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>...A header file with some 512-bit-stuff could be found in Intel Parallel Studio XE 2013... zmmintrin.h
0 Kudos
bronxzv
New Contributor II
1,007 Views

Sergey Kostrov wrote:
Results are consistent and the difference is ~3.45% ( it is acceptible ). during the test?

it's neither the same test nor the same test platform (even CPU frequencies look different) so IMHO there is no point to compare the results

Sergey Kostrov wrote:
My question is the same: How many cores were used during the test?

the test I reported above was with a single thread on a single core and only vfmadd213ps as compute instructions, I can't comment on the other test though

0 Kudos
bronxzv
New Contributor II
1,007 Views

Sergey Kostrov wrote:
I don't consider it as a rumor. A header file with some 512-bit-stuff could be found in Intel Parallel Studio XE 2013 ( ..\Compiler\Include folder ) and I know about it since December 2012.

this header is for Xeon Phi targets

 

0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>...it's neither the same test nor the same test platform (even CPU frequencies look different) so IMHO there is >>no point to compare the results... What was the point of mentioning or posting these results? A simple test based on just one instruction vfmadd213ps can not be considered as a valid one, or as a test that really evaluates performance of some system. Additions are always faster then Multiplications and everybody knows that. If you have a Haswell system then, as Igor recommended, a Linpack from MKL could be used ( I've verified it on P4 and IB systems and it gives right numbers ). >>...this header is for Xeon Phi targets It doesn't say anything at the beginning and some time ago I've asked Intel software engineers what is it for. Unfortunately, my question was Not answered.
0 Kudos
bronxzv
New Contributor II
1,007 Views

Sergey Kostrov wrote:
What was the point of mentioning or posting these results?

well, this thread is named "Haswell GFLOPS" and this test of mine measures Haswell GFLOPS so I suppose it is at least somewhat relevant

Sergey Kostrov wrote:
A simple test based on just one instruction vfmadd213ps can not be considered as a valid one

I don't get what you mean, any test wanting to max out GFLOPS on Haswell will use only FMA instructions for computations

0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
>>...I don't get what you mean, any test wanting to max out GFLOPS on Haswell... Run Linpack benchmark utility from MKL installation to verify your numbers. Post results as soon as it is done.
0 Kudos
bronxzv
New Contributor II
1,007 Views

Sergey Kostrov wrote:

>>...I don't get what you mean, any test wanting to max out GFLOPS on Haswell...

Run Linpack benchmark utility from MKL installation to verify your numbers. Post results as soon as it is done.

as already explained the tests aren't comparable so one can't be used to verify the other, mine is with higher compute:load/store ratio than LINPACK, I use an unrealistic very high compute:load:store 11:1:1 ratio as mentioned in my post at RWT, the goal was to come close to the 2x FMA vs ADD+MUL theoretical speedup

0 Kudos
SergeyKostrov
Valued Contributor II
1,007 Views
What Haswell system do you have? >>...as already explained the tests aren't comparable so one can't be used to verify the other... I understand it and I don't want to compare and I simply would be glad to see some numbers from Linpack utility. If I would have a Haswell system I would do the test without any problems. When testing my Ivy Bridge system with two different Linpack benchmark utilities ( from Intel and Non Intel from another source ) only Intel's utility gave very consistent results. Once again, why wouldn't you try to run it? If you don't have MKL I could upload all content of a ..\mkl\benchmarks folder. Once again, I don't what to compare Linpack number with a number from your own test. I want to compare your Haswell Linpack number with my Ivy Bridge Linpack number and with my Pentium 4 Linpack number. This is a content of ..\mkl\benchmarks folder: help.lpk lininput_xeon32 lininput_xeon64 linpack_xeon32.exe linpack_xeon64.exe runme_xeon32.bat runme_xeon64.bat xhelp.lpk ( all files are about 6MB in total )
0 Kudos
bronxzv
New Contributor II
1,007 Views

Sergey Kostrov wrote:
What Haswell system do you have?

4770K / 16 GB DDR3-2400 memory / Corsair H110 cooler / ASUS Z87-Pro mobo

Sergey Kostrov wrote:
I understand it and I don't want to compare and I simply would be glad to see some numbers from Linpack utility.

if these are easy to run I can have a try, I'm downloading Studio XE 2013 for Windows Update 4 right now (1.11 GB, ETA 1hr 43 min !) so I'll have the latest MKL (the one in C++ Composer XE 2013 Update 5), is it the same version you are interested in ?

0 Kudos
bronxzv
New Contributor II
969 Views

Sergey Kostrov wrote:

linpack_xeon32.exe
linpack_xeon64.exe
runme_xeon32.bat
runme_xeon64.bat

I'm just finished running these two tests (MKL released with Composer XE 2013 Update 5 / default MKL bench .bat files / Windows 8 pro 64-bit / CPU @ 4 GHz / realtime process priority), xeon64 is incredibly long to run, pretty boring since there isn't any feedback about its progress, anyway you'll see the result files attached, hope it will be helpful for your purpose

0 Kudos
SergeyKostrov
Valued Contributor II
969 Views
>>...I'm just finished running these two tests... Thank you very much! I'll also post results for systems with Ivy Bridge and Pentium 4 for comparison.
0 Kudos
Reply