- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Intel Experts:
I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?
I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?
I get some information from internet that:
Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.
Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions
I have two questions here:
1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?
2. Does Haswell have TWO FMA?
Thank you very much for any comments.
Best Regards,
Sun Cao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey:
You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
[ Iliya Polak wrote ]
>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.Iliya,
Where / how did you get that number?
he simply mentions DP theoretical peak at 3.9 GHz and 4 cores (8 DP flop per FMA instruction, 2 FMA per clock), i.e 3.9*4*8*2 = 249.6 Gflops
note that in my own report I mentioned SP theoretical peak at 3.5 GHz and 1 core (16 SP flop per FMA instruction, 2 FMA per clock), i.e 3.5*1*16*2 = 112 GFlops
with my configuration MKL LINPACK efficiency = 104.3/249.6 = ~41.8 %
my own FMA microbenchmark efficiency = 104.407/112 = ~93.2 %
as explained this is because my own test has very low load/store activity vs FMA computations, and most of these load/stores are from/to the L1D cache
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All,
I have not seen any results from MKL on Haswell. I've ported my dgemm to use FMA3 which Haswell supports. On 1 core, with a fixed frequency @ 3.4 GHz, I am achieving 90% efficiency in DGEMM. On 4 cores, i'm achieving 82% efficiency, or 179 GFLOPs. Those HPL efficiencies Sergey are quite poor on HW. On HPL on SB/IB.. I think 90% efficiency is a good number (my DGEMM on those arch is 95-96% efficient and you loose 5-7% in a well tuned HPL from the DGEMM efficiency). Later I may just post the DGEMM test in case any interested parties are interested in running it. Just thought I'd let others know you can get ~50 GFLOPs on 1 core at 3.4 GHz. I'm not observing that the efficiency is scaling though with multiple cores.. yet. Lastly these are preliminary numbers.
I thought I'd also post that I've not been able to get a read bandwidth from the L1 that saturates at 64B per clk.. at 3.4 GHz, I've achieved 58.5 B/clk of read bandwidth. Likewise.. my efforts to maximize the copy bandwidth haven't been successful, I've achieved 58.2 B/clk of copy bandwidth. L2 bandwidth is no where near 64B per clk.. but arount 246B per clk is what I've achieved for read bandwidth. If you have any results on cache io.. on your hardware.. let me know.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok.. I've got my dgemm at 91-92% efficiency.. on 1-core. That's ~50 GFLOPs on 1 HW core at 3.4 GHz. 4-core numbers on a 8000 cubed DGEMM are 186.6 GFLOPs, which is a hair over 85% efficiency. Power.. at idle is ~45 W.. when running this code it's 140W. Interesting. Might be able to get a bit more.. out of it. Just thought I'd update.. as what I've obersved so far in terms of Haswell high performance code efficiency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
Building for IB is pointless.. it doesn't use FMA3 which you need to use to max out the MFLOPs. Also.. I focus on DGEMM and then HPL. HPL is limited by DGEMM performance but on a particular set of problems where if you consider a DGEMM of a MxK upon a KxN matrix.. M and N are large and K is small. For the boeing sparse solver.. N can also be small. K is somewhat tunable.. and is a blocking parameter. I just ran my dgemm for SB/IB and upon a 8000 x 128 x 8192 [M x K x N] problem.. it achieved 24.3 GFOPs on 1 core at 3.4 GHz.. which is ~90% efficiency for those 2 arch. For an 8000 x 8000 x 8000 problem I get over 100 GFLOPS on 4 IB cores. For HW.. running a similar problem I'm getting 45.7 GFOPs on 1 core at 3.4 GHz, which is 84% efficiency. Running with K=256 I get 46.5 GFLOPs (85.5% efficiency) and with K=384 I get 48.5 GFLOPS (89% efficiency). Assymptotic efficiency is 92.5%, about 3% below that of SB and IB.. but it's somewhat expected. Amdahl's law is coming into place and the overheads of doing this "real" computation.. are chewing a bit into the efficiency. I think I'll improve it as time goes on.. but just thought I'd throw it out there what I have measured/achieved.. to see if anyone else has some real numbers. On a HW with 4 cores running at 3.4 GHz on a 16000 x 8192 x 8192 problem I just achieved 190.4 GFLOPs, or 87.5% efficiency. I'd expect Intel to do better than my 2 days worth of tuning on a full DGEMM.
As far as algorithm.. I'd rather not divuldge my techniques but there's lots of documentation on this subject.. and a long history of a few but good people doing it in the past, much fewer in the present. It's my own code and just a hobby.. but you or others should try doing it yourself. You'll learn alot about performance which isn't documented or discussed.. and you'll be a better tweaker for it.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra, which is 2 * N^3, or to be precise 2 * M * K * N. These other methods you mention, entail lowered numerical accuracites, greater memory useage or difficulties in implementation which give rise to fewer flops, but lower ipc and lower performance. Maybe I'll try them someday.. but from my experience.. and that of those people I've worked with in the past 20 years.. I've not found them widely applied. So now that you know how I'm measuring FLOP count and you know I'm running at 3.4 GHz, which btw on this topic I've yet to mention but I'd only post results with a frozen frequency rather than include those with turbo boost, you can determine the # of clks or seconds it takes on HW to do a computation of DGEMM. I measured the following today:
SIZE^3, 1-core GFLOPs, 1-core TIME(s), 4-core GFLOPs, 4-core TIME(s)
4000, 50.3, 2.54, 172.8, 0.74
8000, 50.4, 20.3, 186.6, 5.49
16000, 51.3, 159.7, 192.7, 42.5
I think it's important to note.. that square problems are not very useful in DGEMM.. you need to focus on the other sizes I mentioned in the previous posts.. for "practical" solvers
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
[ Iliya Polak wrote ]
>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.Iliya,
Where / how did you get that number?
Please explain because it is more than twice greater than the best number in the 2nd test of bronxzv ( 104.2632 GFlops ). Igor's number ( 116 GFlops ) is very close to bronxzv's number ( ~10% difference ).
Sorry for late answer(neverending problems with backup laptop)
Those numbers are theoretical peak bandwidth as @bronxzv explained in his answer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It could be interesting to run that benchamrk under VTune.I am interested in seeing clockticks per instruction retired ratio.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey... my results are for double precision... while you appear to running single precision, at least in MKL since you are timing sgemm rather than dgemm. You should have an apples to apples comparison. If you are in sp for your timings then you should 1/2 my times since my GFOPs would double.
PErfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also... freeze your freq to 2.5 GHz... to avoid including boosting. I always do that to discern real arch ipc performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
![](/skins/images/98E68944C1FF703B8AC50091329B92AF/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page