Solved: Integer Unit vs Floating Point Unit

Max_Rafiandy · ‎12-17-2015

I have written code matrix multiplication program using gcc and Intel Core i7 4790. The matrix multiplication is divided into two, the first one is integer intensive matrix multiplication, the second one is a floating point intensive matrix multiplication. each matrix multiplication has it's own sequence and parallel. My questions are: why the floating point has better speed-up than integer? how many fpu and integer unit of intel core i7 4790? Thanks in advance.

McCalpinJohn · ‎12-18-2015

The Core i7 4790 processor uses the Haswell core and supports the AVX2 instruction set.

For floating-point matrix multiplication, all of the operations can be implemented with the Fused Multiply-Add (FMA) instructions. There are two 256-bit FMA units, so for 64-bit floating-point data the processor can perform the equivalent of 16 floating-point operations per cycle (2 functional units * 4 elements per vector * 2 FP operations per instruction), and for 32-bit floating-point data the processor can perform the equivalent of 32 floating-point operations per cycle (2 functional units * 8 elements per vector * 2 FP operations per instruction.

The Haswell core has no combined multiply/add instructions for integer data types, so the peak performance for packed integers is exactly 1/2 of the peak performance for packed floating-point values. This assumes that you can ignore the top half of each of the multiply results -- under general conditions multiplying two 32-bit integers results in a 64-bit value. Intel processors support a wide variety of instructions for performing multiplication operations on integer data (including packed integer data), but if you need all of the bits of the product then there is (at least) another factor of two reduction in the peak performance.

It is extremely difficult to implement a floating-point matrix multiplication code that approaches the core's peak performance, but the Intel MKL DGEMM function delivers ~92% of the peak performance for large square matrices. On the integer side the lower pipeline latencies reduce the number of accumulators required, but the complexity of dealing with the various types of multiplication more than outweighs this benefit.

View solution in original post

TimP · ‎12-18-2015

I suppose, due to less pipeline depth of integer operations (and lack of integer fma, as John pointed out below), they wouldn't benefit as much from vectorization and unrolling, even if the number of parallel resources are the same. It may take intensive study of your code and the comparisons you make to comment further.

I suppose you could get the fma acceleration for floating point without having coded it explicitly, but your compiler reports would show its use. I would have thought, if you're talking about vector or parallel speedup, you would use fma in the base case as well, so it wouldn't augment your quoted speedup.

McCalpinJohn · ‎12-18-2015

The Core i7 4790 processor uses the Haswell core and supports the AVX2 instruction set.

For floating-point matrix multiplication, all of the operations can be implemented with the Fused Multiply-Add (FMA) instructions. There are two 256-bit FMA units, so for 64-bit floating-point data the processor can perform the equivalent of 16 floating-point operations per cycle (2 functional units * 4 elements per vector * 2 FP operations per instruction), and for 32-bit floating-point data the processor can perform the equivalent of 32 floating-point operations per cycle (2 functional units * 8 elements per vector * 2 FP operations per instruction.

The Haswell core has no combined multiply/add instructions for integer data types, so the peak performance for packed integers is exactly 1/2 of the peak performance for packed floating-point values. This assumes that you can ignore the top half of each of the multiply results -- under general conditions multiplying two 32-bit integers results in a 64-bit value. Intel processors support a wide variety of instructions for performing multiplication operations on integer data (including packed integer data), but if you need all of the bits of the product then there is (at least) another factor of two reduction in the peak performance.

It is extremely difficult to implement a floating-point matrix multiplication code that approaches the core's peak performance, but the Intel MKL DGEMM function delivers ~92% of the peak performance for large square matrices. On the integer side the lower pipeline latencies reduce the number of accumulators required, but the complexity of dealing with the various types of multiplication more than outweighs this benefit.

Max_Rafiandy · ‎12-24-2015

Very clear explanation. these answers solved my problem. Thanks, Tim. Thanks, John.