- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have heard that Sandy Bridge won't have FMA implementation.
If that rumor is true, I would really like to know who decided that x86 developers should wait more to finally get fused multiply-add isntruction? Is it so useless in real code or the marketing department has again started doing an engineer's job?
Hereby I publicly voice my displeasure over that poor decision.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have heard that Sandy Bridge won't have FMA implementation.
If that rumor is true, I would really like to know who decided that x86 developers should wait more to finally get fused multiply-add isntruction? Is it so useless in real code or the marketing department has again started doing an engineer's job?
Hereby I publicly voice my displeasure over that poor decision.
Below is a response from the Engineering team:
Hi Igor,
Sandy Bridge will not have FMA, it's targeted for a future processor. I apologize if there is any confusion I (or Intel) caused. In our defense, we did discuss feature timing in the last two Intel developer forums (and now to my embarrassment, I see that presentation has been removed from the IDF content catalog at http://www.intel.com/idf , we'll have it up in time for the upcoming IDF on Oct 20). And it's on a separate CPUID feature flag (separate section of the document too ) in the programming reference.
Anyway, enough for my justifications. There is no intent to 'market' here, we're just engineers: Our strategy going forward is to disclose the industry early on our directions, first to get feedback on the value (and definition) of features like wider vectors, FMA, new instructions, and secondly to get software ready as early as possible. From your perspective is this the right strategy, or are we just confusing people? (and for anyone else reading this: While I appreciate the private mails, I especially like feedback discussions to happen in public forums...). So far I have collected a lot of feedback on the definition and direction and we hope to provide some public response to it shortly.
It sounds like you are an FMA supporter - beyond the raw FLOPS improvement, do you have any sensitivity to the numerical advantages FMA can provide? There are obviously a lot of tradeoffs in the implementations we can provide, and having some data to understand how you would use it would be very helpful.
Regards,
Mark Buxton
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Igor,
Sandy Bridge will not have FMA, it's targeted for a future processor. I apologize if there is any confusion I (or Intel) caused. In our defense, we did discuss feature timing in the last two Intel developer forums (and now to my embarrassment, I see that presentation has been removed from the IDF content catalog at http://www.intel.com/idf , we'll have it up in time for the upcoming IDF on Oct 20). And it's on a separate CPUID feature flag (separate section of the document too ) in the programming reference.
Anyway, enough for my justifications. There is no intent to 'market' here, we're just engineers: Our strategy going forward is to disclose the industry early on our directions, first to get feedback on the value (and definition) of features like wider vectors, FMA, new instructions, and secondly to get software ready as early as possible. From your perspective is this the right strategy, or are we just confusing people? (and for anyone else reading this: While I appreciate the private mails, I especially like feedback discussions to happen in public forums...). So far I have collected a lot of feedback on the definition and direction and we hope to provide some public response to it shortly.
It sounds like you are an FMA supporter - beyond the raw FLOPS improvement, do you have any sensitivity to the numerical advantages FMA can provide? There are obviously a lot of tradeoffs in the implementations we can provide, and having some data to understand how you would use it would be very helpful.
Regards,
Mark Buxton
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On Itanium, for applications I support, we saw about 5% difference in performance between FMA disabled (using separate fma instructions for multiple and for add) and enabled. I don't know how such figures from other platforms could be translated to a Sandy Bridge successor, when instruction issue rate is not such a limiter as it was in past architectures. The big gains for FMA occur in serial code, where the full latency of add and multiply is exposed. We do everything we can to avoid such situations; maybe someone thinks those are the only cases which count. I expect the gain from initial AVX to be much larger than the subsequent gain from FMA, but I don't put FMA in the category of those instruction additions which caused more noise than benefit.
On MIPS R8000, there were disastrous situations where applications broke with FMA, for example sqrt(a*a -b*b) producing a NaN or run-time abort when a == b, because one product is rounded and the other is not. Other than that, FMA usually gives more accurate results, the major objection being these small inconsistencies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I second all of these points. FMA has been implemented in GPUs for years now as a very effective way to double the raw FLOPS performance. In graphics and multimedia code the occurence of a multiply followed by an addition is so common that effective performance increases of over 50% are no exception.
It seems to me that the transistor budget required to widen datapaths to 256-bit for AVX is far greater than that for adding FMA support. So I don't fully understandwhy it has been postponed.
If a compromise was really necessary I believe supporting the instructions without actual FMA execution units (by splitting them into two operations) would have been a better option. This way software developers can use it early and when their binaries get run on a future CPU with actual FMA units it would result in a performance boost without code changes.
I'm convinced this applies to additional instructions as well. For instance scatter/gather operations are still sorely missing so they should be added as soon as reasonably possible, even if early implementations are not optimal. Developers need functional instructions, not specifications on paper, for fast adoption of new ISA extensions. Popular instructions are then automatically put in the spotlight so you know what deserves a faster implementation for later processors...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If a compromise was really necessary I believe supporting the instructions without actual FMA execution units (by splitting them into two operations) would have been a better option. This way software developers can use it early and when their binaries get run on a future CPU with actual FMA units it would result in a performance boost without code changes.
(snip)
Doing a full FMA (that doubles FLOPS) will indeed be very expensive for us, unfortunately I can't discuss all the reasons. When we looked at the performance benefit vs. cost for a wide variety of workloads, 256-bit vectors came out on top - at least when the user is able to put the effort into vectorizing their code :). That's why we did wider vectors first.
You have an interesting suggestion about deploying a 2-uop FMA. Would you still support it if the performance were not equal or better (in all cases) to the alternative mul+add - i.e. the additional latency of putting multiply on the critical path would not becompensated by higher throughput in such an architecture. Some codes are sensitive to this effect (or you would have to be really smart about where you could deploy such an FMA)?
Regards,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What? Are you 100% positive about that? Have you run a simulation?
Old code I posted can be a bit shorter now that we have INSERTPS but still take a look at this mess again:
; SSE4.1 gather emulation: mov esi, dword ptr [data] mov edx, dword ptr [index] loop: ... mov eax, dword ptr [edx] movss xmm0, [esi + eax] mov eax, dword ptr [edx + 4] insertps xmm0, [esi + eax], 1 mov eax, dword ptr [edx + 8] insertps xmm0, [esi + eax], 2 mov eax, dword ptr [edx + 12] insertps xmm0, [esi + eax], 3 ... jnz loop ; HYPOTHETICAL gather instruction: mov esi, dword ptr [data] mov edx, dword ptr [index] loop: ... gmovps xmm0, xmmword ptr [edx] ... jnz loop
So, is it really faster to fetch, decode and execute eight instructions taking 37 bytes (more in 64-bit mode) on a critical code path, instead of a single possibly less than optimally implemented instruction?
I would really like to understand why a single instruction might be slower.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Personally I think a strided load would be a waste in the long term. Sooner or later true scatter/gather will be added (*) and the strided load becomes another superseded legacy instruction that you have to drag with you till the end of days.
If that's not a concern, fine, but please consider adding the gather instruction as soon as possible. An early implementation could work just like in Larrabee; using multiple wide loads till all the elements have been 'gathered'. It would definitely be faster than using individualinsertps instructions, with a minimal latencyequal tothat of a movups (for sequential indexes or indexes all in the same vector).
And it would be useful for a lot more than just matrix transposition. It opens the door for things that aren't even conceivable today. Truely any loop that involves independent iterations could be (automatically) parallelized when we have scatter/gather instructions, no matter how the data is organized, or even in the presence of pointer chasing. So it's not just for HPC or multimedia (although those would benefit massively as well). If you think that's radical, please realise that the rules for writing high performance software have already changeddramatically when we went multi-core. So you might as well finish what you started and add scatter/gather support or the CPU will keep losing terrain to the GPU. You're nearing the point where people just buy the cheapest CPU available and rather invest in a more powerful GPU to do the 'real work'. The competition (both AMD and NVIDIA) are in rather sweet spots to take the biggest pieces of the pie in this scenario. So you'd better give people good reasons to keep buying the latest CPUs, by adding instructions to support algorithms that would otherwise run more optimally outside of the CPU. The only reason I care is because I believe it's better for the end user.
Anyhow, I like Igor's suggested syntax of a gather instructions, but I believe the following would be even more powerful:
movups ymm0, [r0+ymm1*4]
Note that I'm using the same mnemonic as a regular load. And in fact I believe it could use the same encoding except for one bit to indicate the use of a vector register as index(es). Also note how r0 is used as a base pointer instead of requiring the implicit use of rsi, and I can scale the indices (all using regular SIB byte encoding).
(*) P.S: It's really out of the question whether or not scatter/gather will be necessary. As you continue to widen the vectors, accessing data at different locations becomes a massive bottleneck. AVX can scale up to 1024-bit (32 dwords) so you better have flexible and fast ways to get data in and out of such vectors. Neither insertps or a strided load helps much when an arithmetic operation on up to 32 elements costs one cycle (throughput) while the load costs 32 cycles or more! So it seems obvious to me to architect the scatter/gather instructions sooner rather than later and make them as future proof as possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I almost thought I was exaggerating, and then I found this: NVIDIA Introduces NVIDIA Quadro CX...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's not exactly what I said. It's definitely somewhat useful in certain cases. But it's just going to be entirely superseded by a gather instruction sooner or later so why bother. It would be yet another stopgap that makes x86 look even messier in the long run. I'd much ratherhave a gather instruction thatin its first implementationdoesn't provide much if any benefit over insertps, but is entirely flexible and that holds the promise of getting faster implementations over time with no code change required.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's not exactly what I said. It's definitely somewhat useful in certain cases. But it's just going to be entirely superseded by a gather instruction sooner or later so why bother. It would be yet another stopgap that makes x86 look even messier in the long run. I'd much ratherhave a gather instruction thatin its first implementationdoesn't provide much if any benefit over insertps, but is entirely flexible and that holds the promise of getting faster implementations over time with no code change required.
I was simply trying toamplifythe point thatstrided load instruction(which canbe emulatedusing INSERTPS today) will be possible to emulatevia gather instruction in the future but not vice versa -- i.e. you cannot emulate gather with strided load and most likely that strided load instruction will suffer from the same cache line split penalty as all current implementations of unaligned load.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page