Intel Xeon 5345 & 5560 processor - FMA instructions support

srimks · ‎05-05-2009

Hello,

Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.

Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.

On Itanium systems, it seems FMA are supported.

With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.

Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?

Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.

Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.

~BR

TimP · ‎05-05-2009

Quoting - srimks

Hello,

Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.

Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.

On Itanium systems, it seems FMA are supported.

With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.

Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?

Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.

Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.

~BR

If you're interested in this topic, you should read the discussions earlier in this forum. Fma instructions are defined in the AVX ISA, but are not planned for implementation in the first AVX hardware implementation.
It will take a lot more data than your "probably" to make a decision on it. The greatest benefits of fma have come on CPUs with a limited FP instruction issue rate, compared to memory bandwidth, or with serial dependencies, limited register set, and high latency add instructions.
On your Itanium, PPC, or MIPS box, you can build your favorite application with and without fma and report to us on your performance comparisons, on those machines which were designed to depend on fma for performance. If you are still convinced that your code can run twice as fast on those machines with fma, I'll accept that it may run 50% faster on that future Xeon fma.
Speaking of the MIPS, they dropped the true fma from the standard compilation in favor of an instruction which simply executed multiply and add in series, with IEEE intermediate rounding, so as to avoid the numerical consistency issue. It's not easy to get code bases adjusted, with appropriate choices between sqrt(a*a-b*b) vs sqrt((a+b)*(a-b)) for example. Note that fma shouldn't speed this one up on Xeon, it only gives scope for numerical problems.
I do have some code examples where there is a significant advantage in performing operations in the order implied by fma, but it can't be done with icc or gcc except by disabling important optimizations. I don't see this as an inherent advantage for fma hardware, which would only increase the importance of making the compiler adhere to sane associativity rules.

srimks · ‎05-05-2009

Quoting - tim18

Quoting - srimks

Hello,

Does Intel Xeon 5345 (Clovertown) & 5560 (Nehalem) has FMA(Fused Multiply & Add) instructions support? I am not able to locate in Intel Xeon documents.

Many HPC scientific applications, it's compute algorithmhas "D = A + (BX C)" operations.

On Itanium systems, it seems FMA are supported.

With AVX, the "Intel Advance Vector Extensions Programming Reference" Chapter#6document (319433-003)speaks about FMA instructionsbeing supported.

Why such a difference of FMA support with Itanium system exist but not with Xeon 5345 & 5560 processors?

Probably, use of FMA instructions call for operation like "D = A + (B X C)" has significant benefit.

Note: With a decoding rate of one instruction per clock cycle, the peak throughput is two floating-point operations per cycle for FMA instructions. For individual add or multiply instructions, it is only one floating-point operation per clock cycle. With above "D = A + (B X C)" combines floating-point adder(FPA) and floating-point multiplier(FPM) in a single hardware block for increased performance.

~BR

If you're interested in this topic, you should read the discussions earlier in this forum. Fma instructions are defined in the AVX ISA, but are not planned for implementation in the first AVX hardware implementation.
It will take a lot more data than your "probably" to make a decision on it. The greatest benefits of fma have come on CPUs with a limited FP instruction issue rate, compared to memory bandwidth, or with serial dependencies, limited register set, and high latency add instructions.
On your Itanium, PPC, or MIPS box, you can build your favorite application with and without fma and report to us on your performance comparisons, on those machines which were designed to depend on fma for performance. If you are still convinced that your code can run twice as fast on those machines with fma, I'll accept that it may run 50% faster on that future Xeon fma.
Speaking of the MIPS, they dropped the true fma from the standard compilation in favor of an instruction which simply executed multiply and add in series, with IEEE intermediate rounding, so as to avoid the numerical consistency issue. It's not easy to get code bases adjusted, with appropriate choices between sqrt(a*a-b*b) vs sqrt((a+b)*(a-b)) for example. Note that fma shouldn't speed this one up on Xeon, it only gives scope for numerical problems.
I do have some code examples where there is a significant advantage in performing operations in the order implied by fma, but it can't be done with icc or gcc except by disabling important optimizations. I don't see this as an inherent advantage for fma hardware, which would only increase the importance of making the compiler adhere to sane associativity rules.

Thanks for your insights, couple of processors support FMA like PowerPC.