Judging by raw performance

sagrailo · ‎01-30-2010

Are GFLOPS numbers for Intel processors reported here for single or double precision floating point operations? Roughly, I'd expect peak GFLOPS to be reported as product of the number of processor cores, number of SSE units, and clock cycle frequency, and then multiplied with number of SSE operands (2 for double precision, 4 for single precision). So I'd say numbers reported above are for single precision operations, however I'm not sure is it possible to execute several floating point operations (for example, multiply and add) during single clock cycle...

TimP · ‎01-31-2010

I did attempt to post the comment that current Intel CPUs do permit both a parallel multiply and a parallel add to issue on the same cycle. I believe there are CPUs still in production where this is not so.

As Max said, there is a plan to increase the rate by use of wider registers in a future product; however, there will be stronger requirements on data alignment, and even with those observed, many applications will be limited by a less than doubling of data transfer rates.

Max_L · ‎02-05-2010

Hi, the intent of the documentation you referred to is export compliance, disclaimer here http://www.intel.com/support/processors/sb/cs-017346.htm says THESE CALCULATIONS ARE PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, and Ill refrain from commenting on those numbers.

But if you are interested in actual peak theoretical (and in fact achievable, unlike many of those cited by our GPU producing friends) numbers you can take a look at this my older post http://software.intel.com/en-us/forums/showpost.php?p=60696 , to give you a quick answer, for couple recent generations of Intel CPUs it is 8 SIMD SP FP operations/cycle (4 SIMD SP ADD + 4 SIMD SP MUL) and 4 SIMD DP FP ops/cycle (2 SIMD DP ADD + 2 SIMD DP MUL), and both will double with next-generation 'Sandy Bridge' microarchitecture of Intel processors.

-Max

sagrailo · ‎02-10-2010

Thanks for your reply. The numbers you mentioned (for example: 8 SIMD SP FP operations/cycle) are per core, right? So Core 2 Duo is actually capable of 16 SIMD SP FP operations/cycle, and Core 2 Quad is capable of 32 SIMD SP FP operations/cycle?

TimP · ‎02-10-2010

I believe this is true of the Penryn peak performance, as well as Core I7. My Core 2 duo isn't capable of issuing multiply and add on the same cycle.

Max_L · ‎02-11-2010

sagrailo, you are right, it is per core.

tim18, I'm not sure why you think it cannot, it is Pentium4 which could not issue MUL and ADD same cycle as both are onthe same exec port.Numbers above are true for all Core 2 Duo CPUs(including 65-nm CPUs, akacore codenames 'Merom'/'Conroe' as well as45-nm aka core codename 'Penryn'), for Core i7 and for Xeon models based on same code u-arch too.

-Max

mr_nuke · ‎02-12-2010

The numbers cited by our GPU friends are actually quite achievable.Alsothose from our green friendsdon't include FLOPs from the special function units (those that do sqrt, ln, etc), and as a result, a carefully written algorithm can in fact achieve a higher performance than that stated by our GPU friends.

I am working with both GPU and CPU code, and I can tell you that many sub-$100 cards will beat Intel's $3200 Dual Quad Core Xeon top of the line system. Going back to the peak GFLOP/s issue, it's actually much harder to achieve good performance with SSE.

First of all, you have no direct control of the cache, and you have to rely on speculation, "dear CPU, please keep this data in the cache and don't go to memory every time I need it". That's at best, speculative cache control,which changes between architectures, is only partially reliable, and so on. On the GPU, Ihave full control of what I want in the cache, and ofmemory operations, and as an added bonus, those optimizations areentirely valid between architectures.

Second, the GPUcan (andin fact will)run more than the "2 threads per core" akin to Hyper-Threading. That means it easily has enough instructions to keep the ALUs occupied at almost all times. The CPU on the other hand, often says "oh*, I need to fetch some more data, and the other thread is waiting on some too", leaving the SSE units or FPU idle.

Also, I can use severy more registers on the GPU, and many times avoid going to the memory, or even cache. Using 32 register is not uncommon, which means I can keep operands ready-when-needed in more complex algorithms. Compare that to the 16SSE registers, and you'll know what I mean. I should also mention that GPU instructions do not overwrite their input operands, so there is no need for movaps/movapd-equivalent instructions between registers, which do slightly impact performance on the CPU-side.

In real-life situations, it's easy for me to get 65% of the quoted theoretical GPU GFLOP/s, whereas on 25% of the CPU theoretical GFLOPS/sis an achievement in itself. Of course, I can, and have written idiotic calculations that get the within 1 GFLOPS/s of the theoretical GPU FLOPS/s... umh, not as much on the CPU.

So, IMHO, the numbers our GPU friends give us, are far more believable than what our CPU buddies want us to think.

Alex

Max_L · ‎02-14-2010

Hello Alex,

You have a few controversial statements in the post, I will try to address a couple. If I understand you right, you are saying that graphics cards beat CPUs in FLOPS use efficiency, you will certainly have hard time looking for any results to confirm this.

Although architecture does provide only 8 or 16 general purpose + 8 or 16 vector registers available _architecturally_, underneath microarchitecture in fact has more than a hundred of them, x86 assembler serves currently and perfectly as a higher level language of sorts it is advanced out of order microarchitecture which is taking care of renaming/allocating hardware registers, hiding latency and pipelining execution of dependency chains, it more than compensates for relatively modest number of registers provided by ISA.

Best regards,

-Max

mr_nuke · ‎02-14-2010

Hello Max,

I don't need to look for results to confirm my statements; I already have the results at hand.

FLOP use efficiency

Two9800GTs @1.62 GHZ:

Theoretical performance: 816.48 GFLOP/s (including FLOPs from the special function units(SFU), which are not included in the numbers stated by NVIDIA)
Theoretical performance as calculated by NVIDIA: 725.76 GFLOP/s
Peak sustained performance: 464 GFLOP/s
FLOP use efficiency: 56.8% (including SFU FLOPs), 63.9% (excluding SFU FLOPs)

Core i7-920 @2.8GHz, x3 channel DDR3-1600:

Theoretical performance: 89.6 GFLOP/s (according to your statements about add and mul in 1 clock cycle)
Peak sustained performance 30 GFLOP/s (after many sleepless nights of optimizations)
FLOP use efficiency: 33.5%

I used an electrostatics simulation for this test, which is a real-life problem. Source code available via svn at http://code.google.com/p/electromag-with-cuda should you wish to verify my claims.

Clock cycle use efficiency

To get a little more technical, I checked the so-called assembly code for the GPU-side calculations. I have 14 instructions that execute 18 FLOPs. 5 Five of them are MADs, which, due to data dependencies,the CPU cannot simulate by executing the coresponding mul and add in the same clock cycle due to data dependencies. Two of them are areciprocal and multiplication, which represents adivision (and thus only counts a 1 FLOP). There are also four other instructions that perform reads from shared memory (AKA cache on the CPU), but, as we will see, their execution is completely hidded by the pipeline.

Here's the math:using 2 GPUs at 1.62GHz, each with 112 add-mul-mad units, we exectute

2 * 1.62 * 112 = 362.88 billion instructions (which I shall refer to as Gcycles)

Assuming 18 FLOP/s for each 14 instructions, we would get a maximum achieveable performance of

18 FLOPs / 14 cycles * 362.88 Gcycles = 466.56 GFLOP/s

Dividing 464 by 466.56we get 99.5% clock cycle usage efficiency.

On the CPU side we have 2.8 * 4 = 11.2 Gcycles

I have exactly 18 SSE instructions that do meaningful FLOPs, so that is 72 FLOPs per 18 cycles

Therefore, the maximum we can get (being leniant, and forgetting about the ability to do a mad and a mul in the same cycle) is 78 FLOP / 18 cycles * 11.2Gcycles = 44.8 GFLOP/s.

That comes to 30/44.8 = 67.0% clock cycle usage efficiency.

Now let's look from your perspective. We have 8 add or sub, which should execute on the add unit, and 8 muls, which should execute on the mul unit, 1 div and 1 sqrt. If the add and mul were to magically execute at the same time, all in only 8 clock cycles, then we put two more for the div and sqrt (we are interested in efficiency, so we don't care if they actually take 200 clock cycles each). Then we have 72 FLOPs in only 10 cycles (WOW, sounds great). Doing the math we come to a mere 37.2% efficiency.

Also note that my results indicate that the GPU is so efficient at hiding pipeline and execution latency, that sqrt appears af it executed in only 1 clock cycle.

CPU architecture

I did figure out that Hyperthreading needed a second set of registers to become feasible. Still, you quote over a hundred registers, which are not available to me as a programmer, and neither are they directly available to a compiler. Thus, in order to be able to take advantage of those registers, I have to speculate that doing things this way, will use those registers this achieve what I wish. This "speculative optimization" is akin to driving blindfolded on the freeway, until you hear the least number of crashes.

Also, I do not wish to bring the "cache pollution" argument, as I have no conclusive data that GPUs don't suffer from that. I just wish to note that while processing an 820MB dataset in 90 seconds at 464GFLOP/s, the GPUs still had to handle my displays.

Conclusion

Unless I've completely missed something, or I have written the most horrible CPU-side code, my problem is handled much more efficiently by NVIDIA's architecture, which is not Fermi, nor GT200, but the same, slightly improved version of the old 8800GT, which in terms of computing, is much older than the state-of the art Nehalem, yet still obscenely more efficient, and my data proves it. I cannot speak in regards to ATI hardware, as I have not had a chance to test it.

Of course, there are problems which are more eficiently handled by the CPU, butnot manyare fundamentally parallel.

I started this discussion in response to what I saw as a sarcastic comment, saying thatCPU GFLOP/s are reasonable,"and in fact achievable, unlike many of those cited by our GPU producing friends". Those numbers arethe absolut maximum achievable in absolutely perfect conditions, and in fact, repuslive asymptotes for real-life problems. I have given you a real case that is better handled by the GPU. You may show me a contrary case, or even point out some optimizations that I might have missed on the CPU side.

The reality is that the CPU and GPU are the most efficient devices at what they were designed to do. On my Core i7, I can run 8 threads of Prime95, run a webserver, several download managers, two to three virtual machines, a software RAID5, and game in at the same time. I couldn't do this without an efficient CPU. Ihave shelled out four years worth of savings to get my i7 system, because I needed something efficient at that. I got my GPUs because I needed something efficient at a different task.

I don't like accusations flying around that GPUs are not efficient at what they do, or that Intel is dragging itself on its knees to catch up with the GPU crowd. If we keep launching such unreasonable remarks, then we may just installl our operating systems on the GPU, and use the CPU to play Crysis.

And to end with a rethorical question, if Intel is so keen on tightly hugging the High-FLOP/s market, why doesn't it develop a device specific to that, rather than shoving an x86 (or IA64) down our throats? A year and a half ago, when I first heard of the Terascale project, itseemed likethe biggest "Ha! I told you so!!!" to NVIDIA. 1 TFLOP/s at only 62W is something that the HPC crowd will appreciate much more than 50GFLOP/s at 130W.

==Alex

nikolai_ · ‎02-14-2010

Wait a second, go to NVIDIA's website for tesla and you'll see 933 Gflops SP FP for their Tesla card

can you achieve 933Gflops? ABSOLUTELY NOT!

their ridiculous claims of being able to achieve MAD + MUL on a clock is only at best able to reach 75%

Max's initial statement is well received (especially when you factor in the ludicrous speedups of multiple orders of magnitude running rampant over NVIDIA's website)

Furthermore, of course there will be programs better suited for GPU over CPU, and yours seems to fit well with the GPU. If we look at a much more common operation, let's say SGEMM, well the CPU can reach up 92% of theoretical and GPU only 60% (and this is assuming just MAD per clock (i.e. not 933 Gflops))

mr_nuke · ‎02-15-2010

Ni Nikolai,

I will not comment the results you quote without being able to verify them. Perhaps you could provide a link to the source code or paper where you found them.

I will however try to adress your statement about the 933 GFLOP/s. No, you will never reach that in real-life applications, unless your problem involves computing MAD after MUL after MAD after MUL, where the result of one operation does not depend on the previous. I have spent quite some time on the NVIDIA forums, heard from other GPU and HPC prograamers, and NVIDIA engineers, and I believe I know the architecture slightly better than most people. GT200(AKA compute 1.3)has extra units for the mul, and thus is capable of issuing a MAD and aMUL in the same clock cycle. There just aren't enough real-life problems that do that.

Give me a GT200, and I will give you conclusive results. Also please cite the numbers you quote. I am researching performance vs. precision vs. efficiency on CPU and GPU, and I am interested in getting a broader look at the issue.

==Alex

nikolai_ · ‎02-15-2010

Again, I'd like to point out that Max's statement of NVIDIA listing unachievable theoretical flops is 100% correct

You cannot achieve 933 Gflops in "ideal" situation, let alone real life application. The dual issue of MAD+MUL is not achievable not only on the GT200 cards but on all of them. And by ideal situations, I mean you literally just do computation on garbage values in the registers in a massively unrolled loop. If it's just MAD, then you can get 99% but dual issue is ~70-75%. NVIDIA quotes theoretical flops that are NOT achievable.

here's some quick code (and results for 8800 and gtx260) you can try on any nvidia cuda capable card

http://forums.nvidia.com/index.php?showtopic=104498&st=20&p=581105&#entry581105

here's a possible explination for dual issue problem from Sylvain Collange:

Peak throughput from registers should be around 128B/clock on G80, and probably higher on GT200.
Note that dual issuing MAD+MUL with all inputs coming from different registers requires 160B/clock (5 registers read at the same time). This is my personal explanation for the difficulty to reach peak flops on G80...

so basically he's speculating that, yes there are SFU units that can do the MUL, but NVIDIA are unable to feed the FPUs with enough data on each clock, rendering them nearly useless

As for the SGEMM results, it's from a paper which i'm sure you're familiar with:

http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf

in general, gpus aren't that much better than cpus as nvidia would like you think

mr_nuke · ‎02-15-2010

Thanks for the references. I will study those in more detail when I have the chance, and abstain from commenting on them until I come to my own conclusion.

==Alex

tthsqe · ‎02-16-2010

mr_nuke, if you only getting 30% of the peak flops, your code might need some adjusting - on a mandelbrot set calculation I've seen the core i7 get 84% efficiency (3.3 flop / clock). To get this, you have to increase theparallelism in your code.

mr_nuke · ‎02-16-2010

I agree there may be some problems with caching/memory. Otherwise It's eight OpenMP threads,a 16-byte memory read for 288 FLOPs (72 SSE instructions doing meaningful math), and the core is fully veritcally vectorized. I doubt the problem is parallelism.

Both intrinsics and inline assembly (here it'sone 16 byteread per 72 FLOPs, but everythingfits snugly into registers)yield about the same performance (which surprisingly, is close to what Intel's compiler will yield for non-vectorized C++ code). I would like to get a peek at possible extra optimization, but it's probably better I start a new thread for that (and I have spent way too much time optimizing the CPU code instead of doing meaningful work).

==Alex

tthsqe · ‎02-19-2010

Could you post some of the problem code? (the link you gave above does not work for me)

mr_nuke · ‎02-19-2010

Sure. Just give me a couple of hours to isolate the code with a working test case. As it is right now it's intricately tied to almost everything else.

==Alex

mr_nuke · ‎02-19-2010

Ok, I've created a new thread here:

http://software.intel.com/en-us/forums/showthread.php?t=72115

I feel I would be hijacking this one to post the code here.

Bernard · ‎01-04-2013

Judging by raw performance measured in gflops of single hyperthreaded core when the single logical core is handling integer data(for example loop counter fused cmp/jmp and dec instructions)executed on Port 5 and second logical core is handling double-floating point four scalar 4D component vector addition executed with the help of AVX 256 - bit vector instructions when not saturated it is possible to achieve theoretical throughput of 8 DP flops per cycle on Port 1 i.e ~24 gflops on single core.Multiplied by four physical cores you can reach almost 96 gflops. Here is good article :http://software.intel.com/en-us/forums/topic/291765

GFLOPS numbers advertised by Intel