throughput for FMA instructions

MarkC_Intel · ‎12-22-2010

Hello,

We just released Intel Software Development Emulator version 3.88:

http://www.intel.com/software/sde

The release notes summarize the many changes that have been implemented.

It supports the "Post-32nm processor instructions" added in revision 008 of the AVX programmers reference on this page:

http://www.intel.com/software/avx

Particularly the RDRAND, VCVTPS2PH and VCVTPH2PS. The instructions {RD,WR}{FS,GS}BASE are of limited utility in the emulator as they require more changes/support from the operating system.

bronxzv · ‎02-17-2011

is the throughputreported by SDE for FMA instructions correct (for the forthcoming 1st implementation in Ivy Bridge or Haswell) ?

capens__nicolas · ‎06-08-2011

Quoting bronxzv

is the throughputreported by SDE for FMA instructions correct (for the forthcoming 1st implementation in Ivy Bridge or Haswell) ?

What throughput does it report?

bronxzv · ‎06-08-2011

No idea, I asked this questionbecauseIthought that I will be soon able to compile FMA3 code with the Intel compiler(the FMA intrinsicswere already well documented inthe C++ XE 12.0 documentation back in February). When trying to compile a test FMA3 path it appeared that it's not supported (no single header with the intrinsics). Now it's rather odd but the FMA intrinsics are still not availableinthe XE 12.0 update 4, i.e. the documentation (*1)is wrong for several months now.

*1: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/win/intref_cls/common/intref_bk_avx_fma.htm

capens__nicolas · ‎06-09-2011

Are just the C++ intrinsics missing, or does SDE not support FMA instructions yet at all (using binary code)?

Anyway, I really don't think Ivy Bridge will feature FMA support. The problem is that with two FMA units per core, the register file would require six read ports to sustain maximum throughput. According to Agner Fog's tests, they've only just increased it to four ports. Increasing it to six likely requires changes which are too significant to include in Ivy Bridge (although I'd love to be wrong).

Another solution would be to keep the current register file and fetch the third operand in another cycle. That would limit the peak sustainable performance, but most code has plenty of instructions with fewer input operands and integer instructions mixed in anyway. Also perhaps the bypass network can provide the extra operands most of the time. In any case, I fear we'll have to wait till Haswell for FMA support.

Note that although vblendps can take three register input operands, it actually appears to be split into two uops: one for creating the mask, and one for the actual blend. So it's only a narrow third operand, and it doesn't require a register file access. This also explains why this variant of the instruction has half the throughput and twice the latency.

bronxzv · ‎06-09-2011

>Are just the C++ intrinsics missing, or does SDE not support FMA instructions yet at all (using binary code)?

I'm not sure because I was too lazy to test it without a supporting compiler. AFAIK the SDE documentation refer to the AVX specs including FMA soFMA3 must be supported,but since the compiler documentation is wrong (and unfixed 4 months after complaining) the SDE documentation may be wrong as well.

>Anyway, I really don't think Ivy Bridge will feature FMA support

indeed, provided the compiler is not yet releasedFMA3 is probably not for Ivy Bridge, thoughthere are slides whereIvy Bridgeis qualified as a"Tick+"and talking about "enhanced AVX support", maybe the enhancement to AVX are only the FP16 <-> FP32 conversions and the RND generator but it looks very slim, I hope that at least the L1/L2 cache bandwidth was increased (doubled) to have somegood speedups with AVX-256 code,at last. Increasing the cache bandwidth is more urgent than FMA IMHO and a pre-requisite for a full fledged 2 FMA per clock implementation.

TimP · ‎06-09-2011

Ivy Bridge is primarily a shrink to 22nm, not instruction set enhancement. Possibilities include larger L2 per core, or more cores, better thermal vs. clock speed characteristics. No hope has been extended for significant change in L2 bandwidth at this stage. As you say, more L2 bandwidth would help many applications see a gain for AVX.

bronxzv · ‎06-09-2011

>Ivy Bridge is primarily a shrink to 22nm, not instruction set enhancement.

my understandingis that at least the seldom "Post-32nm processor instructions"will beincluded in Ivy, much like every "Tick" in the past has benefitet from some ISA enhancements

capens__nicolas · ‎06-10-2011

Quoting bronxzv

indeed, provided the compiler is not yet releasedFMA3 is probably not for Ivy Bridge, thoughthere are slides whereIvy Bridgeis qualified as a"Tick+"and talking about "enhanced AVX support", maybe the enhancement to AVX are only the FP16 <-> FP32 conversions and the RND generator but it looks very slim, I hope that at least the L1/L2 cache bandwidth was increased (doubled) to have somegood speedups with AVX-256 code,at last. Increasing the cache bandwidth is more urgent than FMA IMHO and a pre-requisite for a full fledged 2 FMA per clock implementation.

I believe the Tick+ refers to simultaneously introducing 22 nm and FinFET technology, not any kind of architectural change. FP16 and RND instructions are pretty minor extensions that only affect a few components. I'm not expecting much else, since the move to 22 nm + FinFET is a major leap by itself and the whole Tick-Tock idea is to spread the risks. FMA support not only requires the extra operand but also higher cache bandwidth. Ivy Bridge will be a great refresh of Sandy Bridge, but I'm looking forward to what Haswell will bring.

If if includes significant changes to the cache hierarchy, there's actually a spark of hope that it includes gather/scatter support as well, making Haswell the first feature-complete thoughput-oriented CPU architecture...

bronxzv · ‎06-10-2011

>I believe the Tick+ refers to simultaneously introducing 22 nm and FinFET technology

I don't think so, the new fab processaccount forthe "Tick" not for the "+". New process technologies are always introduced (at Intel) at a new node,for example copper interconnects at 0.13um, strained siliconat 90nm, high-k + metal gates at 45nm, etc.

From what we know the "+"may befor :
- DX11 and OpenCL support in the iGPU +increased EU count("next Gen Intel HD Graphics")
- Next Gen Quick Sync"
- "Ultra-Performance Configurable TDP" with the new "docked mode"
- "Post-32nm processor instructions" (not officially announced for Ivy Bridge but obvious from the name)
- Better peformance for AVX-256 code thanks to uarch improvements (my wild guess after seing a slide mentioning "enhanced AVX acceleration")
- A marketing gimmick, after all Penryn was qualified as a simple"Tick" with 47 new instructions in the ISA and other uarch changes like the radix-16 divider and the vastly improved shuffle engine

capens__nicolas · ‎06-12-2011

Quoting bronxzv

I don't think so, the new fab processaccount forthe "Tick" not for the "+". New process technologies are always introduced (at Intel) at a new node,for example copper interconnects at 0.13um, strained siliconat 90nm, high-k + metal gates at 45nm, etc.

From what we know the "+"may befor :
- DX11 and OpenCL support in the iGPU +increased EU count("next Gen Intel HD Graphics")
- Next Gen Quick Sync"
- "Ultra-Performance Configurable TDP" with the new "docked mode"
- "Post-32nm processor instructions" (not officially announced for Ivy Bridge but obvious from the name)
- Better peformance for AVX-256 code thanks to uarch improvements (my wild guess after seing a slide mentioning "enhanced AVX acceleration")
- A marketing gimmick, after all Penryn was qualified as a simple"Tick" with 47 new instructions in the ISA and other uarch changes like the radix-16 divider and the vastly improved shuffle engine

FinFET really is a major leap. They could have gone 22 nm completely without it, but after ten years of R&D decided to introduce it simultaneously with this new node. That's definitely a Tick+ to me. Note that the competition isn't expected to use non-planar transistors till around the 14 nm node. So it's not something to make 22 nm feasible, it's something extra. And it's no small feature. It cuts power consumption in half, or offers 30% higher performance. It's practically combines the advantages of two process generations into one; hence Tick+ is a fitting name.

I still seriously doubt it indicates any other change:

- The IGP has evolved independently from the Tick-Tock model before.
- Next gen Quick Sync probably adds support for WebM. A nice addition but hardly major in the bigger picture.
- Configurable TDP is sort of a consequence of FinFET. You can choose between much lower power consumption while on the road, or a nice speed boost while docked.
- There's only talk of "enhanced AVX support", which is likely merely the FP16 and RND instructions.
- While Penryn indeed added 47 new instructions, supporting these merely required changes to the ALUs and decoder. The architecture itself is unaffected. Likewise super shuffle was a welcome addition but these sort of things just required the transistor budget to become feasible.

So Tick+ really seems to indicate an extra large Tick, not a Tick with architectural changes.

bronxzv · ‎06-12-2011

>There's only talk of "enhanced AVX support"

FYI the original slide says "enhanced AVX *acceleration*"

http://www.google.com/#sclient=psy&hl=en&safe=off&source=hp&q=%22enhanced+AVX+acceleration%22

capens__nicolas · ‎06-13-2011

This blog post confirms that Ivy Bridge will merely add FP16 and RND instructions, while Haswell will be a major new architecture with gather and FMA support:http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/

This also confirms the Tick+ really just refers to FinFET. Which in turn means it's a major technological breakthrough on its own...

bronxzv · ‎06-13-2011

>This also confirms the Tick+ really just refers to FinFET.

I can't see howthisvery cool Haswell disclosureconfirmsanything about what"Tick+" is or is not for Ivy Bridge,now I will beinterested to hear what people in the know have to say about this "+" if they are allowed.

I personnalyhope for some IPC increase for AVX code in Ivy Bridge thanks to uarch improvements, Idon't think thatincreased clock speed alone willl explainthe 20% performance boost reported by a lot of sites such as this one:
http://www.xbitlabs.com/news/cpu/display/20110203150914_Intel_s_Next_Gen_Ivy_Bridge_to_Offer_20_30_Performance_Boost_Over_Sandy_Bridge_Report.html

capens__nicolas · ‎06-14-2011

Quoting bronxzv

I can't see howthisvery cool Haswell disclosureconfirmsanything about what"Tick+" is or is not for Ivy Bridge,now I will beinterested to hear what people in the know have to say about this "+" if they are allowed.

I personnalyhope for some IPC increase for AVX code in Ivy Bridge thanks to uarch improvements, Idon't think thatincreased clock speed alone willl explainthe 20% performance boost reported by a lot of sites such as this one:
http://www.xbitlabs.com/news/cpu/display/20110203150914_Intel_s_Next_Gen_Ivy_Bridge_to_Offer_20_30_Performance_Boost_Over_Sandy_Bridge_Report.html

FinFET is a perfectly good explanation for the Tick+ designation. It's by far the biggest novelty for Ivy Bridge we know about. I see little point in looking any further with something like that fully confirmed and detailed. Non-planar technology radically changes semiconductor scaling behavior.

20% performance increase can easily be achieved with a higher Turbo Boost frequency. Note once again that FinFET allows significantly higher switching speeds while keeping power consumption in check. The process shrink should also allow for bigger caches. Since that has happened with every shrink, it's barely noteworthy, but it does help explain how a 20% performance increase is entirely feasible without micro-architecture changes.

By the way, the blog post about Haswell New Instructions pretty much answers your question about FMA throughput: "our floating-point multiply accumulate significantly increases peak flops". In particular it means Haswell will feature two 256-bit FMA units per core.

bronxzv · ‎06-14-2011

FinFET is a perfectly good explanation for the Tick+ designation

I think you made your point very clear already. From the link I just provided:

"It is expected that Ivy Bridge CPUs, which will be made using 22nm process technology, will have certain micro-architecture level enhancements along with clock-speed and some other methods to boost performance."

so please understand that someother peoplehave another opinion,"other methods" for example may be for the rumored stacked DRAM, also we still don't know ATM what is referred to as "enhanced AVX acceleration"

capens__nicolas · ‎06-14-2011

Quoting bronxzv

"It is expected that Ivy Bridge CPUs, which will be made using 22nm process technology, will have certain micro-architecture level enhancements along with clock-speed and some other methods to boost performance."

so please understand that someother peoplehave another opinion,"other methods" for example may be for the rumored stacked DRAM, also we still don't know ATM what is referred to as "enhanced AVX acceleration"

The stacked DRAM rumor has absolutely no credibility. 30% higher IGP performance can simply be achieved by using 16 EUs instead of 12, and using DDR3-1600 instead of 1333. Stacked DRAM on the other hand would be used to provide a massive increase in bandwidth, and we would have gotten some official confirmation about the use of such technology and its far reaching consequences by now. The silicon and packaging cost would be substantial. Seriously, the numbers just don't add up. Other technologies offer bandwidth scaling at a lower cost: DDR3 will continue to scale up for a few more years, after which DDR4 will take over. Point-to-point memory topologies andThrough-Silicon Via (TSV) technology have been confirmed to be in active development. That's for the 2015 timeframe though, and the need for DRAM based L4 caches is even further out. For the short-term, Ivy Bridge, there's no reason to expect anything radical since we're not running into big issues yet.

I suspect someone heard about TSV and simply started to fantasize out loud.

And it only takes a single reporter to jot down "enhanced AVX acceleration" when hearing aboutaccelerated half-float support, to make some people think it's something more substantial. Please read Mark Buxton's blog post again, it clearly indicates Ivy Bridge will merely add support for what is called post-32nm instructions in the Programming Reference.

bronxzv · ‎06-14-2011

>And it only takes a single reporter to jot down "enhanced AVX acceleration"

Nope, you got it wrong: as already explained "Enhanced AVX acceleration" is in the original Intel slide:
http://overclockingevent.com/index.php?option=com_content&task=view&id=2071&Itemid=65

>That's for the 2015 timeframe though

huh?

>Please read Mark Buxton's blog post again, it clearly indicates Ivy Bridge will merely add support for what is called post-32nm instructions in the Programming Reference.

his very cool posttalks only ofthe ISA, nothing is told about theimplemention or uarch enhancements (or lack of)in Ivy Bridge, since neither you nor me haveanything reallyconclusive so far I'll sugest to stop the speculations and to wait for upcoming official information

capens__nicolas · ‎01-23-2012

Greased graphics turn a tick into a tock

I stand corrected. The tick+ refers to the graphics after all. They should have called it a tick++ for the Tri-Gate though. ;-)

I'm still not expecting AVX enhancements beyond the few new instructions, but again I wouldn't mind being wrong.

bronxzv · ‎01-23-2012

I'm still expecting some solid L1D and/or L2 $ bandwidth increase that willspeedup AVX-256 code

Hint: see the comment for the 1.25 x Excel 2010 speedup here :
http://wccftech.com/intel-slides-officially-detail-3rd-generation-ivy-bridge-processors-architecture-launch-dates-performance-estimates-i7-2600k/

Intel(R) SDE (emulator) release 3.88 supporting POST-32nm instructions