- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.
Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.
On Itanium systems, it seems FMA are supported.
With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.
Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?
Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.
Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.
~BR
Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.
Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.
On Itanium systems, it seems FMA are supported.
With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.
Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?
Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.
Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.
~BR
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - srimks
Hello,
Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.
Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.
On Itanium systems, it seems FMA are supported.
With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.
Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?
Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.
Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.
~BR
Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.
Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.
On Itanium systems, it seems FMA are supported.
With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.
Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?
Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.
Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.
~BR
It will take a lot more data than your "probably" to make a decision on it. The greatest benefits of fma have come on CPUs with a limited FP instruction issue rate, compared to memory bandwidth, or with serial dependencies, limited register set, and high latency add instructions.
On your Itanium, PPC, or MIPS box, you can build your favorite application with and without fma and report to us on your performance comparisons, on those machines which were designed to depend on fma for performance. If you are still convinced that your code can run twice as fast on those machines with fma, I'll accept that it may run 50% faster on that future Xeon fma.
Speaking of the MIPS, they dropped the true fma from the standard compilation in favor of an instruction which simply executed multiply and add in series, with IEEE intermediate rounding, so as to avoid the numerical consistency issue. It's not easy to get code bases adjusted, with appropriate choices between sqrt(a*a-b*b) vs sqrt((a+b)*(a-b)) for example. Note that fma shouldn't speed this one up on Xeon, it only gives scope for numerical problems.
I do have some code examples where there is a significant advantage in performing operations in the order implied by fma, but it can't be done with icc or gcc except by disabling important optimizations. I don't see this as an inherent advantage for fma hardware, which would only increase the importance of making the compiler adhere to sane associativity rules.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
Quoting - srimks
Hello,
Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.
Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.
On Itanium systems, it seems FMA are supported.
With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.
Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?
Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.
Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.
~BR
Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.
Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.
On Itanium systems, it seems FMA are supported.
With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.
Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?
Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.
Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.
~BR
It will take a lot more data than your "probably" to make a decision on it. The greatest benefits of fma have come on CPUs with a limited FP instruction issue rate, compared to memory bandwidth, or with serial dependencies, limited register set, and high latency add instructions.
On your Itanium, PPC, or MIPS box, you can build your favorite application with and without fma and report to us on your performance comparisons, on those machines which were designed to depend on fma for performance. If you are still convinced that your code can run twice as fast on those machines with fma, I'll accept that it may run 50% faster on that future Xeon fma.
Speaking of the MIPS, they dropped the true fma from the standard compilation in favor of an instruction which simply executed multiply and add in series, with IEEE intermediate rounding, so as to avoid the numerical consistency issue. It's not easy to get code bases adjusted, with appropriate choices between sqrt(a*a-b*b) vs sqrt((a+b)*(a-b)) for example. Note that fma shouldn't speed this one up on Xeon, it only gives scope for numerical problems.
I do have some code examples where there is a significant advantage in performing operations in the order implied by fma, but it can't be done with icc or gcc except by disabling important optimizations. I don't see this as an inherent advantage for fma hardware, which would only increase the importance of making the compiler adhere to sane associativity rules.
Thanks for your insights, couple of processors support FMA like PowerPC.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page