Solved: Haswell GFLOPS - Page 3

caosun · ‎06-26-2013

Hi Intel Experts:

I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

I get some information from internet that:

Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

I have two questions here:

1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

2. Does Haswell have TWO FMA?

Thank you very much for any comments.

Best Regards,

Sun Cao

caosun · ‎07-01-2013

Hi Sergey:

You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

View solution in original post

SergeyKostrov · ‎07-04-2013

[ Iliya Polak wrote ] >>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops. Iliya, Where / how did you get that number? Please explain because it is more than twice greater than the best number in the 2nd test of bronxzv ( 104.2632 GFlops ). Igor's number ( 116 GFlops ) is very close to bronxzv's number ( ~10% difference ).

bronxzv · ‎07-04-2013

Sergey Kostrov wrote:

[ Iliya Polak wrote ]
>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.

Iliya,

Where / how did you get that number?

he simply mentions DP theoretical peak at 3.9 GHz and 4 cores (8 DP flop per FMA instruction, 2 FMA per clock), i.e 3.9*4*8*2 = 249.6 Gflops

note that in my own report I mentioned SP theoretical peak at 3.5 GHz and 1 core (16 SP flop per FMA instruction, 2 FMA per clock), i.e 3.5*1*16*2 = 112 GFlops

with my configuration MKL LINPACK efficiency = 104.3/249.6 = ~41.8 %

my own FMA microbenchmark efficiency = 104.407/112 = ~93.2 %

as explained this is because my own test has very low load/store activity vs FMA computations, and most of these load/stores are from/to the L1D cache

SergeyKostrov · ‎07-05-2013

Attached are several files with results of performance tests using Intel LINPACK Benchmark for 32-bit and 64-bit systems with Haswell, Ivy Bridge and Pentium 4 CPUs.

perfwise · ‎07-06-2013

All,

I have not seen any results from MKL on Haswell. I've ported my dgemm to use FMA3 which Haswell supports. On 1 core, with a fixed frequency @ 3.4 GHz, I am achieving 90% efficiency in DGEMM. On 4 cores, i'm achieving 82% efficiency, or 179 GFLOPs. Those HPL efficiencies Sergey are quite poor on HW. On HPL on SB/IB.. I think 90% efficiency is a good number (my DGEMM on those arch is 95-96% efficient and you loose 5-7% in a well tuned HPL from the DGEMM efficiency). Later I may just post the DGEMM test in case any interested parties are interested in running it. Just thought I'd let others know you can get ~50 GFLOPs on 1 core at 3.4 GHz. I'm not observing that the efficiency is scaling though with multiple cores.. yet. Lastly these are preliminary numbers.

I thought I'd also post that I've not been able to get a read bandwidth from the L1 that saturates at 64B per clk.. at 3.4 GHz, I've achieved 58.5 B/clk of read bandwidth. Likewise.. my efforts to maximize the copy bandwidth haven't been successful, I've achieved 58.2 B/clk of copy bandwidth. L2 bandwidth is no where near 64B per clk.. but arount 246B per clk is what I've achieved for read bandwidth. If you have any results on cache io.. on your hardware.. let me know.

Perfwise

perfwise · ‎07-06-2013

Ok.. I've got my dgemm at 91-92% efficiency.. on 1-core. That's ~50 GFLOPs on 1 HW core at 3.4 GHz. 4-core numbers on a 8000 cubed DGEMM are 186.6 GFLOPs, which is a hair over 85% efficiency. Power.. at idle is ~45 W.. when running this code it's 140W. Interesting. Might be able to get a bit more.. out of it. Just thought I'd update.. as what I've obersved so far in terms of Haswell high performance code efficiency.

SergeyKostrov · ‎07-07-2013

>>... I have not seen any results from MKL on Haswell. I've ported my dgemm... I could post performance results of MKL's sgemm and dgemm functions on Ivy Bridge for 4Kx4K, 8Kx8K and 16Kx16K matricies ( in seconds Not in GFlops ). >>... I've got my dgemm at 91-92% efficiency... What algorithm do you use?

perfwise · ‎07-07-2013

Sergey,

Building for IB is pointless.. it doesn't use FMA3 which you need to use to max out the MFLOPs. Also.. I focus on DGEMM and then HPL. HPL is limited by DGEMM performance but on a particular set of problems where if you consider a DGEMM of a MxK upon a KxN matrix.. M and N are large and K is small. For the boeing sparse solver.. N can also be small. K is somewhat tunable.. and is a blocking parameter. I just ran my dgemm for SB/IB and upon a 8000 x 128 x 8192 [M x K x N] problem.. it achieved 24.3 GFOPs on 1 core at 3.4 GHz.. which is ~90% efficiency for those 2 arch. For an 8000 x 8000 x 8000 problem I get over 100 GFLOPS on 4 IB cores. For HW.. running a similar problem I'm getting 45.7 GFOPs on 1 core at 3.4 GHz, which is 84% efficiency. Running with K=256 I get 46.5 GFLOPs (85.5% efficiency) and with K=384 I get 48.5 GFLOPS (89% efficiency). Assymptotic efficiency is 92.5%, about 3% below that of SB and IB.. but it's somewhat expected. Amdahl's law is coming into place and the overheads of doing this "real" computation.. are chewing a bit into the efficiency. I think I'll improve it as time goes on.. but just thought I'd throw it out there what I have measured/achieved.. to see if anyone else has some real numbers. On a HW with 4 cores running at 3.4 GHz on a 16000 x 8192 x 8192 problem I just achieved 190.4 GFLOPs, or 87.5% efficiency. I'd expect Intel to do better than my 2 days worth of tuning on a full DGEMM.

As far as algorithm.. I'd rather not divuldge my techniques but there's lots of documentation on this subject.. and a long history of a few but good people doing it in the past, much fewer in the present. It's my own code and just a hobby.. but you or others should try doing it yourself. You'll learn alot about performance which isn't documented or discussed.. and you'll be a better tweaker for it.

Perfwise

SergeyKostrov · ‎07-07-2013

>>...Building for IB is pointless.. it doesn't use FMA3... I don't get it and Ivy Bridge systems will be around for a long time. There is nothing wrong when comparing results for major lines of Intel microarchitectures even if some microarchitecture doesn't support some set of instructions! >>...Assymptotic efficiency is 92.5%... What Time Complexity do you assume as a base? Theory is very strict here and there is nothing classified for matrix multiplication algorithms regarding Time Complexity. For example, here is a table: Time Complexity for Matrix Multiplication Algorithms: Virginia Vassilevska Williams O( n^2.3727 ) Coppersmith-Winograd O( n^2.3760 ) Strassen O( n^2.8070 ) <= O( n^log2(7) ) Strassen-Winograd O( n^2.8070 ) <= O( n^log2(7) ) Classic O( n^3.0000 ) A fastest algorithm I've ever used / tested is Kronecker based Matrix Multiplication implemented by one of IDZ user in Fortran language. Details could be found here: http://www.geosci-model-dev-discuss.net/5/3325/2012/gmdd-5-3325-2012.html >>...As far as algorithm.. I'd rather not divuldge my techniques but there's lots of documentation on this subject.. and a >>long history of a few but good people doing it in the past, much fewer in the present. It's my own code and just a hobby... Just posts results for a couple of cases in seconds since it will be easier to compare. I will post results for Ivy Bridge system for 4Kx4K, 8Kx8K and 16Kx16K matricies ( in seconds ) using MKL's dgemm, Kronecker based and Strassen HBC Matrix Multiplication algorithms. The reason I post results in seconds because I need to know exactly that product of two matricies could be calculated in some limited period of time. Results in GFlops are useless in many cases because I hear all the time questions like How long does it take to compute a product of two matricies with some dimensions NxN?... Note: Strassen HBC stands for Strassen Heap Based Complete and it is optimized for application in Embedded environments.

perfwise · ‎07-08-2013

Sergey,

From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra, which is 2 * N^3, or to be precise 2 * M * K * N. These other methods you mention, entail lowered numerical accuracites, greater memory useage or difficulties in implementation which give rise to fewer flops, but lower ipc and lower performance. Maybe I'll try them someday.. but from my experience.. and that of those people I've worked with in the past 20 years.. I've not found them widely applied. So now that you know how I'm measuring FLOP count and you know I'm running at 3.4 GHz, which btw on this topic I've yet to mention but I'd only post results with a frozen frequency rather than include those with turbo boost, you can determine the # of clks or seconds it takes on HW to do a computation of DGEMM. I measured the following today:

SIZE^3, 1-core GFLOPs, 1-core TIME(s), 4-core GFLOPs, 4-core TIME(s)

4000, 50.3, 2.54, 172.8, 0.74

8000, 50.4, 20.3, 186.6, 5.49

16000, 51.3, 159.7, 192.7, 42.5

I think it's important to note.. that square problems are not very useful in DGEMM.. you need to focus on the other sizes I mentioned in the previous posts.. for "practical" solvers

Perfwise

SergeyKostrov · ‎07-08-2013

>>...I measured the following today: >> >>SIZE^3, 1-core GFLOPs, 1-core TIME(s), 4-core GFLOPs, 4-core TIME(s) >> >>4000, 50.3, 2.54, 172.8, 0.74 >> >>8000, 50.4, 20.3, 186.6, 5.49 >> >>16000, 51.3, 159.7, 192.7, 42.5 Exactly for I wanted to see and I understood that tests are done on a Haswell system. I will post results for Ivy Bridge later today. >>...These other methods you mention, entail lowered numerical accuracites, greater memory useage or difficulties in implementation >>which give rise to fewer flops, but lower ipc and lower performance... There are lots of speculative talks on different Internet forums about some matrix multiplication algorithms, like Coppersmith-Winograd and Strassen, especially by people who never tried to implement these two algorithms. I've implemented four different versions of Strassen algorithm and additional memory usage is by design of that recursive algorithm because it needs to partition source matricies down to some threshold limit. In theory this is 2x2 and in practice this is N / 8. For example, in case of 4096x4096 matricies this is 512x512 ( 4096 / 8 = 512 ).

Bernard · ‎07-08-2013

Sergey Kostrov wrote:

[ Iliya Polak wrote ]
>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.

Iliya,

Where / how did you get that number?

Please explain because it is more than twice greater than the best number in the 2nd test of bronxzv ( 104.2632 GFlops ). Igor's number ( 116 GFlops ) is very close to bronxzv's number ( ~10% difference ).

Sorry for late answer(neverending problems with backup laptop)

Those numbers are theoretical peak bandwidth as @bronxzv explained in his answer.

Bernard · ‎07-08-2013

It could be interesting to run that benchamrk under VTune.I am interested in seeing clockticks per instruction retired ratio.

SergeyKostrov · ‎07-09-2013

>>...It could be interesting to run that benchamrk under VTune... Good luck with that.

SergeyKostrov · ‎07-09-2013

[ Tests Set #1 - Part A ] *** Ivy Bridge CPU 2.50 GHz 1-core *** [ 4096x4096 ] Kroneker Based 1.93 seconds MKL 3.68 seconds ( cblas_sgemm ) Strassen HBC 11.62 seconds Fortran 20.67 seconds ( MATMUL ) Classic 31.36 seconds [ 8192x8192 ] Kroneker Based 11.26 seconds MKL 29.34 seconds ( cblas_sgemm ) Strassen HBC 82.03 seconds Fortran 138.57 seconds ( MATMUL ) Classic 252.05 seconds [ 16384x16384 ] Kroneker Based 81.52 seconds MKL 237.76 seconds ( cblas_sgemm ) Strassen HBC 1160.80 seconds Fortran 1685.09 seconds ( MATMUL ) Classic 2049.87 seconds *** Haswell CPU 3.50 GHz 1-core *** [ 4000x4000 ] Perfwise 2.54 seconds [ 8000x8000 ] Perfwise 20.30 seconds [ 16000x16000 ] Perfwise 159.70 seconds

SergeyKostrov · ‎07-09-2013

[ Tests Set #1 - Part B - All Results Combined ] [ 4096x4096 ] Kroneker Based 1.93 seconds (*) Perfwise 2.54 seconds ( 4000x4000 ) (**) MKL 3.68 seconds ( cblas_sgemm ) (*) Strassen HBC 11.62 seconds (*) Fortran 20.67 seconds ( MATMUL ) (*) Classic 31.36 seconds (*) [ 8192x8192 ] Kroneker Based 11.26 seconds (*) Perfwise 20.30 seconds ( 8000x8000 ) (**) MKL 29.34 seconds ( cblas_sgemm ) (*) Strassen HBC 82.03 seconds (*) Fortran 138.57 seconds ( MATMUL ) (*) Classic 252.05 seconds [ 16384x16384 ] Kroneker Based 81.52 seconds (*) Perfwise 159.70 seconds ( 16000x16000 ) (**) MKL 237.76 seconds ( cblas_sgemm ) (*) Strassen HBC 1160.80 seconds (*) Fortran 1685.09 seconds ( MATMUL ) (*) Classic 2049.87 seconds (*) Note: (*) Ivy Bridge CPU 2.50 GHz 1-core (**) Haswell CPU 3.50 GHz 1-core

SergeyKostrov · ‎07-09-2013

[ Tests Set #2 - Part A ] *** Ivy Bridge CPU 2.50 GHz 4-core *** [ 4096x4096 ] Kroneker Based 0.41 seconds MKL 1.21 seconds ( cblas_sgemm ) Fortran 3.95 seconds ( MATMUL ) Classic 7.48 seconds Strassen HBC N/A seconds [ 8192x8192 ] Kroneker Based 1.49 seconds ( 8100x8100 ) MKL 8.34 seconds ( cblas_sgemm ) Fortran 29.49 seconds ( MATMUL ) Classic 60.73 seconds Strassen HBC N/A seconds [ 16384x16384 ] Kroneker Based 10.27 seconds MKL 66.58 seconds ( cblas_sgemm ) Fortran 246.28 seconds ( MATMUL ) Classic 534.65 seconds Strassen HBC N/A seconds *** Haswell CPU 3.50 GHz 4-core *** [ 4000x4000 ] Perfwise 0.74 seconds [ 8000x8000 ] Perfwise 5.49 seconds [ 16000x16000 ] Perfwise 42.50 seconds

SergeyKostrov · ‎07-09-2013

[ Tests Set #2 - Part B - All Results Combined ] [ 4096x4096 ] Kroneker Based 0.41 seconds (*) Perfwise 0.74 seconds ( 4000x4000 ) (**) MKL 1.21 seconds ( cblas_sgemm ) (*) Fortran 3.95 seconds ( MATMUL ) (*) Classic 7.48 seconds (*) Strassen HBC N/A seconds (***) [ 8192x8192 ] Kroneker Based 1.49 seconds ( 8100x8100 ) (*) Perfwise 5.49 seconds ( 8000x8000 ) (**) MKL 8.34 seconds ( cblas_sgemm ) (*) Fortran 29.49 seconds ( MATMUL ) (*) Classic 60.73 seconds (*) Strassen HBC N/A seconds (***) [ 16384x16384 ] Kroneker Based 10.27 seconds (*) Perfwise 42.50 seconds ( 16000x16000 ) (**) MKL 66.58 seconds ( cblas_sgemm ) (*) Fortran 246.28 seconds ( MATMUL ) (*) Classic 534.65 seconds (*) Strassen HBC N/A seconds (***) Note: (*) Ivy Bridge CPU 2.50 GHz 4-core (**) Haswell CPU 3.50 GHz 4-core (***) There is no Multi-threaded version

SergeyKostrov · ‎07-09-2013

Just for comparison these are results for Pentium 4... [ Tests Set #3 ] *** Pentium 4 CPU 1.60 GHz 1-core - Windows XP Professional 32-bit *** [ 4096x4096 ] MKL 31.23 seconds ( cblas_sgemm ) Strassen HBC 143.69 seconds (*) Classic 183.66 seconds Fortran N/A seconds ( MATMUL ) Kroneker Based N/A seconds [ 8192x8192 ] MKL 254.54 seconds ( cblas_sgemm ) Classic 1498.43 seconds Strassen HBC N/A seconds Fortran N/A seconds ( MATMUL ) Kroneker Based N/A seconds [ 16384x16384 ] Classic N/A seconds MKL N/A seconds ( cblas_sgemm ) Strassen HBC N/A seconds Fortran N/A seconds ( MATMUL ) Kroneker Based N/A seconds Note: (*) Excessive usage of Virtual Memory and significant negative performance impact

perfwise · ‎07-09-2013

Sergey... my results are for double precision... while you appear to running single precision, at least in MKL since you are timing sgemm rather than dgemm. You should have an apples to apples comparison. If you are in sp for your timings then you should 1/2 my times since my GFOPs would double.

PErfwise

perfwise · ‎07-09-2013

Also... freeze your freq to 2.5 GHz... to avoid including boosting. I always do that to discern real arch ipc performance.

SergeyKostrov · ‎07-09-2013

>>... >>From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra, >>which is 2 * N^3, or to be precise 2 * M * K * N... >>... Take a A(4x4) * B(4x4) case and then count on a paper number of additions and multiplications. You should get 112 Floating Point Operations ( FPO ). Then calculate using your formula and you will get 2 * 4 * 4 * 4 = 128 and this doesn't look right. This is why: Let's say we have two matricies A[ MxN ] and B[ RxK ]. A product is C[ MxK ]. [ MxN ] * [ RxK ] = [ MxK ] If M=N=R=K, that is both matricies are square, then Total number of Floating Point Operations ( TFPO ) should be calculated as follows: TFPO = N^2 * ( 2*N - 1 ) For example, TFPO( 2x2 ) = 2^2 * ( 2*2 - 1 ) = 12 TFPO( 3x3 ) = 3^2 * ( 3*2 - 1 ) = 45 TFPO( 4x4 ) = 4^2 * ( 4*2 - 1 ) = 112 TFPO( 5x5 ) = 5^2 * ( 5*2 - 1 ) = 225 and so on.