Peak on XEON and Opteron

Alfredo · ‎04-11-2007

Hi,
I may be slightly OT but I couldn't imagine any better place to ask this. I want to compute the theoretical peak performance for both a dual-core 3.0 GHz Woodcrest and a dual-core 2.2 GHz Opteron processors.
Googling around, it looks like, besides the differenze in clock speed, there is a factor of two that I cannot understand.
For the Woodcrest I assume that it is possible to do vector fused multiply-add fp operations. That would make 2*2=4 operations/cycle (2 because of vector and 2 because of fused madd) and thus (there's another factor of 2 coming from the dual-core) 3.0*4*2=24 Gflop/s theoretical peak. Is there anything wrong in this?

Now, for the Opteron I would do the same just replacing 3.0 with 2.2 but it turns out to be wrong because google says that the peak is exactly half as much (I didn't find any explanation for this). Aren't opterons capable of doing vector fused madd? They have SSE3 so...

Is there anybody here that can help me with this?
Also, if you know any place/forum where I can redirect my question, please let me know.

Regards

Alfredo Buttari

jimdempseyatthecove · ‎04-11-2007

Alfredo,

A theoretical peak performance is seldom seen except by a carefully crafted benchmark written by the party wishing to amplify the peak performance. There are many issues at hand in a real application that exhibits (much) less than peak performance.

Of particular interest to you is the nature of _your_ application and how it relates to memory access patterns. Intel FP calculations are supposedly faster in the core, but Opteron, in NUMA configurations can deliver data faster to the pipeline. As to which system is faster... this depends on your applicaiton.

Jim Dempsey

BTW I have a4-core Opteron system 2x270 Dual Core processors.
I would be willing to accept a donation of a 2 x 4 Core Xeon 3.0 GHz system...

TimP · ‎04-11-2007

There is no fused multiply-add in any currently available Xeon or Opteron product. Peak performance for Woodcrest would be achievedwhen both a parallel multiply and a parallel add can be retiredfor eachclock cycle. For double precision, that would reach your total of 4 operations per cycle per core.

As I understand it, current Opteron requires 2 cycles toissue the 4 operations, as there is no full parallelism. Parallel SSE2 operations are split between 2 fp units, and peak double precisionrate is the same as for serial SSE2 operations.

SSE3 has nothing to do with this, except that all SSE3 machines of course include SSE2. The first Intel SSE3 CPUs could issue a parallel multiply only every other clock cycle, and add only on cycles not taken by multiply,as I understood it. Peak fp rate per clock cycle was the same as Opteron, but the clock rates generally were significantly higher.

It is more difficult in practice to approach the peak fp rate on Woodcrest than on Opteron. The peak rate on Woodcrest can't be sustained unless half the operands are loop invariant register operands, as well as all operations parallel SSE. Practical problems also are affected by bus data transfer and such issues where there are differences between the brands.

MKL DGEMM should be capable of reaching 90% of peak for certain sizes of problems. In my experience, that would be the only semi-practical situation where peak fp rate has relevance. So, if you are interested in demonstrated MKL performance, you could go to the MKL forum, after searching for posted articles on MKL.

Alfredo · ‎04-11-2007

tim18:
There is no fused multiply-add in any currently available Xeon or Opteron product. Peak performance for Woodcrest would be achievedwhen both a parallel multiply and a parallel add can be retiredfor eachclock cycle. For double precision, that would reach your total of 4 operations per cycle per core.

As I understand it, current Opteron requires 2 cycles toissue the 4 operations, as there is no full parallelism. Parallel SSE2 operations are split between 2 fp units, and peak double precisionrate is the same as for serial SSE2 operations.

Tim,
so, if I got it right, you are saying that even if there's no fused madd, it is still possible to execute on mul and one add every clock cycle on XEON processors. Where do this parallelism come from? are there multiple ALUs?
Thanks a lot

Alfredo

PS
GotoBLAS is usually capable of delivering higher performance than MKL and, thus, get closer to the theoretical peak on DGEMM.