- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Intel Experts:

I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

I get some information from internet that:

Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

I have two questions here:

1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

2. Does Haswell have TWO FMA?

Thank you very much for any comments.

Best Regards,

Sun Cao

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Sergey:

You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**249 Gflops**. Iliya, Where / how did you get that number? Please explain because it is more than twice greater than the best number in the 2nd test of

**bronxzv**(

**104.2632 GFlops**). Igor's number (

**116 GFlops**) is very close to

**bronxzv**'s number ( ~10% difference ).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sergey Kostrov wrote:

[ Iliya Polak wrote ]

>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.Iliya,

Where / how did you get that number?

he simply mentions DP theoretical peak at 3.9 GHz and 4 cores (8 DP flop per FMA instruction, 2 FMA per clock), i.e 3.9*4*8*2 = 249.6 Gflops

note that in my own report I mentioned SP theoretical peak at 3.5 GHz and 1 core (16 SP flop per FMA instruction, 2 FMA per clock), i.e 3.5*1*16*2 = 112 GFlops

with my configuration MKL LINPACK efficiency = 104.3/249.6 = ~41.8 %

my own FMA microbenchmark efficiency = 104.407/112 = ~93.2 %

as explained this is because my own test has very low load/store activity vs FMA computations, and most of these load/stores are from/to the L1D cache

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

All,

I have not seen any results from MKL on Haswell. I've ported my dgemm to use FMA3 which Haswell supports. On 1 core, with a fixed frequency @ 3.4 GHz, I am achieving 90% efficiency in DGEMM. On 4 cores, i'm achieving 82% efficiency, or 179 GFLOPs. Those HPL efficiencies Sergey are quite poor on HW. On HPL on SB/IB.. I think 90% efficiency is a good number (my DGEMM on those arch is 95-96% efficient and you loose 5-7% in a well tuned HPL from the DGEMM efficiency). Later I may just post the DGEMM test in case any interested parties are interested in running it. Just thought I'd let others know you can get ~50 GFLOPs on 1 core at 3.4 GHz. I'm not observing that the efficiency is scaling though with multiple cores.. yet. Lastly these are preliminary numbers.

I thought I'd also post that I've not been able to get a read bandwidth from the L1 that saturates at 64B per clk.. at 3.4 GHz, I've achieved 58.5 B/clk of read bandwidth. Likewise.. my efforts to maximize the copy bandwidth haven't been successful, I've achieved 58.2 B/clk of copy bandwidth. L2 bandwidth is no where near 64B per clk.. but arount 246B per clk is what I've achieved for read bandwidth. If you have any results on cache io.. on your hardware.. let me know.

Perfwise

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Ok.. I've got my dgemm at 91-92% efficiency.. on 1-core. That's ~50 GFLOPs on 1 HW core at 3.4 GHz. 4-core numbers on a 8000 cubed DGEMM are 186.6 GFLOPs, which is a hair over 85% efficiency. Power.. at idle is ~45 W.. when running this code it's 140W. Interesting. Might be able to get a bit more.. out of it. Just thought I'd update.. as what I've obersved so far in terms of Haswell high performance code efficiency.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**MKL**on Haswell. I've ported my

**dgemm**... I could post performance results of

**MKL**'s

**sgemm**and

**dgemm**functions on

**Ivy Bridge**for 4Kx4K, 8Kx8K and 16Kx16K matricies ( in seconds Not in GFlops ). >>... I've got

**my dgemm**at 91-92% efficiency... What algorithm do you use?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sergey,

Building for IB is pointless.. it doesn't use FMA3 which you need to use to max out the MFLOPs. Also.. I focus on DGEMM and then HPL. HPL is limited by DGEMM performance but on a particular set of problems where if you consider a DGEMM of a MxK upon a KxN matrix.. M and N are large and K is small. For the boeing sparse solver.. N can also be small. K is somewhat tunable.. and is a blocking parameter. I just ran my dgemm for SB/IB and upon a 8000 x 128 x 8192 [M x K x N] problem.. it achieved 24.3 GFOPs on 1 core at 3.4 GHz.. which is ~90% efficiency for those 2 arch. For an 8000 x 8000 x 8000 problem I get over 100 GFLOPS on 4 IB cores. For HW.. running a similar problem I'm getting 45.7 GFOPs on 1 core at 3.4 GHz, which is 84% efficiency. Running with K=256 I get 46.5 GFLOPs (85.5% efficiency) and with K=384 I get 48.5 GFLOPS (89% efficiency). Assymptotic efficiency is 92.5%, about 3% below that of SB and IB.. but it's somewhat expected. Amdahl's law is coming into place and the overheads of doing this "real" computation.. are chewing a bit into the efficiency. I think I'll improve it as time goes on.. but just thought I'd throw it out there what I have measured/achieved.. to see if anyone else has some real numbers. On a HW with 4 cores running at 3.4 GHz on a 16000 x 8192 x 8192 problem I just achieved 190.4 GFLOPs, or 87.5% efficiency. I'd expect Intel to do better than my 2 days worth of tuning on a full DGEMM.

As far as algorithm.. I'd rather not divuldge my techniques but there's lots of documentation on this subject.. and a long history of a few but good people doing it in the past, much fewer in the present. It's my own code and just a hobby.. but you or others should try doing it yourself. You'll learn alot about performance which isn't documented or discussed.. and you'll be a better tweaker for it.

Perfwise

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Time Complexity for Matrix Multiplication Algorithms**: Virginia Vassilevska Williams O( n^2.3727 ) Coppersmith-Winograd O( n^2.3760 ) Strassen O( n^2.8070 ) <= O( n^log2(7) ) Strassen-Winograd O( n^2.8070 ) <= O( n^log2(7) ) Classic O( n^3.0000 ) A fastest algorithm I've ever used / tested is

**Kronecker based**Matrix Multiplication implemented by one of IDZ user in Fortran language. Details could be found here: http://www.geosci-model-dev-discuss.net/5/3325/2012/gmdd-5-3325-2012.html >>...As far as algorithm.. I'd rather not divuldge my techniques but there's lots of documentation on this subject.. and a >>long history of a few but good people doing it in the past, much fewer in the present.

**It's my own code and just a hobby**... Just posts results for a couple of cases in seconds since it will be easier to compare. I will post results for

**Ivy Bridge**system for

**4Kx4K**,

**8Kx8K**and

**16Kx16K**matricies ( in seconds ) using MKL's

**dgemm**,

**Kronecker based**and

**Strassen HBC**Matrix Multiplication algorithms. The reason I post results in seconds because I need to know exactly that product of two matricies could be calculated in some limited period of time. Results in GFlops are useless in many cases because I hear all the time questions like How long does it take to compute a product of two matricies with some dimensions NxN?... Note:

**Strassen HBC**stands for

**Strassen Heap Based Complete**and it is optimized for application in Embedded environments.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sergey,

From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra, which is 2 * N^3, or to be precise 2 * M * K * N. These other methods you mention, entail lowered numerical accuracites, greater memory useage or difficulties in implementation which give rise to fewer flops, but lower ipc and lower performance. Maybe I'll try them someday.. but from my experience.. and that of those people I've worked with in the past 20 years.. I've not found them widely applied. So now that you know how I'm measuring FLOP count and you know I'm running at 3.4 GHz, which btw on this topic I've yet to mention but I'd only post results with a frozen frequency rather than include those with turbo boost, you can determine the # of clks or seconds it takes on HW to do a computation of DGEMM. I measured the following today:

SIZE^3, 1-core GFLOPs, 1-core TIME(s), 4-core GFLOPs, 4-core TIME(s)

4000, 50.3, 2.54, 172.8, 0.74

8000, 50.4, 20.3, 186.6, 5.49

16000, 51.3, 159.7, 192.7, 42.5

I think it's important to note.. that square problems are not very useful in DGEMM.. you need to focus on the other sizes I mentioned in the previous posts.. for "practical" solvers

Perfwise

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sergey Kostrov wrote:

[ Iliya Polak wrote ]

>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.Iliya,

Where / how did you get that number?

Please explain because it is more than twice greater than the best number in the 2nd test of

bronxzv(104.2632 GFlops). Igor's number (116 GFlops) is very close tobronxzv's number ( ~10% difference ).

Sorry for late answer(neverending problems with backup laptop)

Those numbers are theoretical peak bandwidth as @bronxzv explained in his answer.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It could be interesting to run that benchamrk under VTune.I am interested in seeing clockticks per instruction retired ratio.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**[ Tests Set #1 - Part A ]**

***** Ivy Bridge CPU 2.50 GHz 1-core *****

**[ 4096x4096 ]**Kroneker Based 1.93 seconds MKL 3.68 seconds ( cblas_sgemm ) Strassen HBC 11.62 seconds Fortran 20.67 seconds ( MATMUL ) Classic 31.36 seconds

**[ 8192x8192 ]**Kroneker Based 11.26 seconds MKL 29.34 seconds ( cblas_sgemm ) Strassen HBC 82.03 seconds Fortran 138.57 seconds ( MATMUL ) Classic 252.05 seconds

**[ 16384x16384 ]**Kroneker Based 81.52 seconds MKL 237.76 seconds ( cblas_sgemm ) Strassen HBC 1160.80 seconds Fortran 1685.09 seconds ( MATMUL ) Classic 2049.87 seconds

***** Haswell CPU 3.50 GHz 1-core *****

**[ 4000x4000 ]**Perfwise 2.54 seconds

**[ 8000x8000 ]**Perfwise 20.30 seconds

**[ 16000x16000 ]**Perfwise 159.70 seconds

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**[ Tests Set #1 - Part B - All Results Combined ]**

**[ 4096x4096 ]**Kroneker Based 1.93 seconds (*) Perfwise 2.54 seconds ( 4000x4000 ) (**) MKL 3.68 seconds ( cblas_sgemm ) (*) Strassen HBC 11.62 seconds (*) Fortran 20.67 seconds ( MATMUL ) (*) Classic 31.36 seconds (*)

**[ 8192x8192 ]**Kroneker Based 11.26 seconds (*) Perfwise 20.30 seconds ( 8000x8000 ) (**) MKL 29.34 seconds ( cblas_sgemm ) (*) Strassen HBC 82.03 seconds (*) Fortran 138.57 seconds ( MATMUL ) (*) Classic 252.05 seconds

**[ 16384x16384 ]**Kroneker Based 81.52 seconds (*) Perfwise 159.70 seconds ( 16000x16000 ) (**) MKL 237.76 seconds ( cblas_sgemm ) (*) Strassen HBC 1160.80 seconds (*) Fortran 1685.09 seconds ( MATMUL ) (*) Classic 2049.87 seconds (*)

**Note:**(*) Ivy Bridge CPU 2.50 GHz 1-core (**) Haswell CPU 3.50 GHz 1-core

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**[ Tests Set #2 - Part A ]**

***** Ivy Bridge CPU 2.50 GHz 4-core *****

**[ 4096x4096 ]**Kroneker Based 0.41 seconds MKL 1.21 seconds ( cblas_sgemm ) Fortran 3.95 seconds ( MATMUL ) Classic 7.48 seconds Strassen HBC N/A seconds

**[ 8192x8192 ]**Kroneker Based 1.49 seconds ( 8100x8100 ) MKL 8.34 seconds ( cblas_sgemm ) Fortran 29.49 seconds ( MATMUL ) Classic 60.73 seconds Strassen HBC N/A seconds

**[ 16384x16384 ]**Kroneker Based 10.27 seconds MKL 66.58 seconds ( cblas_sgemm ) Fortran 246.28 seconds ( MATMUL ) Classic 534.65 seconds Strassen HBC N/A seconds

***** Haswell CPU 3.50 GHz 4-core *****

**[ 4000x4000 ]**Perfwise 0.74 seconds

**[ 8000x8000 ]**Perfwise 5.49 seconds

**[ 16000x16000 ]**Perfwise 42.50 seconds

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**[ Tests Set #2 - Part B - All Results Combined ]**

**[ 4096x4096 ]**Kroneker Based 0.41 seconds (*) Perfwise 0.74 seconds ( 4000x4000 ) (**) MKL 1.21 seconds ( cblas_sgemm ) (*) Fortran 3.95 seconds ( MATMUL ) (*) Classic 7.48 seconds (*) Strassen HBC N/A seconds (***)

**[ 8192x8192 ]**Kroneker Based 1.49 seconds ( 8100x8100 ) (*) Perfwise 5.49 seconds ( 8000x8000 ) (**) MKL 8.34 seconds ( cblas_sgemm ) (*) Fortran 29.49 seconds ( MATMUL ) (*) Classic 60.73 seconds (*) Strassen HBC N/A seconds (***)

**[ 16384x16384 ]**Kroneker Based 10.27 seconds (*) Perfwise 42.50 seconds ( 16000x16000 ) (**) MKL 66.58 seconds ( cblas_sgemm ) (*) Fortran 246.28 seconds ( MATMUL ) (*) Classic 534.65 seconds (*) Strassen HBC N/A seconds (***)

**Note:**(*) Ivy Bridge CPU 2.50 GHz 4-core (**) Haswell CPU 3.50 GHz 4-core (***) There is no Multi-threaded version

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**[ Tests Set #3 ]**

***** Pentium 4 CPU 1.60 GHz 1-core - Windows XP Professional 32-bit *****

**[ 4096x4096 ]**MKL 31.23 seconds ( cblas_sgemm ) Strassen HBC 143.69 seconds (*) Classic 183.66 seconds Fortran N/A seconds ( MATMUL ) Kroneker Based N/A seconds

**[ 8192x8192 ]**MKL 254.54 seconds ( cblas_sgemm ) Classic 1498.43 seconds Strassen HBC N/A seconds Fortran N/A seconds ( MATMUL ) Kroneker Based N/A seconds

**[ 16384x16384 ]**Classic N/A seconds MKL N/A seconds ( cblas_sgemm ) Strassen HBC N/A seconds Fortran N/A seconds ( MATMUL ) Kroneker Based N/A seconds

**Note:**(*) Excessive usage of Virtual Memory and significant negative performance impact

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sergey... my results are for double precision... while you appear to running single precision, at least in MKL since you are timing sgemm rather than dgemm. You should have an apples to apples comparison. If you are in sp for your timings then you should 1/2 my times since my GFOPs would double.

PErfwise

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Also... freeze your freq to 2.5 GHz... to avoid including boosting. I always do that to discern real arch ipc performance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**2 * M * K * N**... >>... Take a

**A(4x4)***

**B(4x4)**case and then count on a paper number of additions and multiplications. You should get

**112**Floating Point Operations ( FPO ). Then calculate using your formula and you will get 2 * 4 * 4 * 4 =

**128**and this doesn't look right. This is why: Let's say we have two matricies A[ MxN ] and B[ RxK ]. A product is C[ MxK ]. [

**M**xN ] * [ Rx

**K**] = [

**MxK**] If

**M=N=R=K**, that is both matricies are square, then Total number of Floating Point Operations ( TFPO ) should be calculated as follows:

**TFPO = N^2 * ( 2*N - 1 )**For example, TFPO( 2x2 ) = 2^2 * ( 2*2 - 1 ) = 12 TFPO( 3x3 ) = 3^2 * ( 3*2 - 1 ) = 45 TFPO( 4x4 ) = 4^2 * ( 4*2 - 1 ) = 112 TFPO( 5x5 ) = 5^2 * ( 5*2 - 1 ) = 225 and so on.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page