Showing results for

- Intel Community
- Software
- Software Development Topics
- Software Tuning, Performance Optimization & Platform Monitoring
- Integer Unit vs Floating Point Unit

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Max_Rafiandy

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-17-2015
08:38 PM

829 Views

Integer Unit vs Floating Point Unit

1 Solution

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-18-2015
06:08 AM

829 Views

The Core i7 4790 processor uses the Haswell core and supports the AVX2 instruction set.

For floating-point matrix multiplication, all of the operations can be implemented with the Fused Multiply-Add (FMA) instructions. There are two 256-bit FMA units, so for 64-bit floating-point data the processor can perform the equivalent of 16 floating-point operations per cycle (2 functional units * 4 elements per vector * 2 FP operations per instruction), and for 32-bit floating-point data the processor can perform the equivalent of 32 floating-point operations per cycle (2 functional units * 8 elements per vector * 2 FP operations per instruction.

The Haswell core has no combined multiply/add instructions for integer data types, so the peak performance for packed integers is exactly 1/2 of the peak performance for packed floating-point values. This assumes that you can ignore the top half of each of the multiply results -- under general conditions multiplying two 32-bit integers results in a 64-bit value. Intel processors support a wide variety of instructions for performing multiplication operations on integer data (including packed integer data), but if you need all of the bits of the product then there is (at least) another factor of two reduction in the peak performance.

It is extremely difficult to implement a floating-point matrix multiplication code that approaches the core's peak performance, but the Intel MKL DGEMM function delivers ~92% of the peak performance for large square matrices. On the integer side the lower pipeline latencies reduce the number of accumulators required, but the complexity of dealing with the various types of multiplication more than outweighs this benefit.

Link Copied

3 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-18-2015
04:08 AM

829 Views

I suppose, due to less pipeline depth of integer operations (and lack of integer fma, as John pointed out below), they wouldn't benefit as much from vectorization and unrolling, even if the number of parallel resources are the same. It may take intensive study of your code and the comparisons you make to comment further.

I suppose you could get the fma acceleration for floating point without having coded it explicitly, but your compiler reports would show its use. I would have thought, if you're talking about vector or parallel speedup, you would use fma in the base case as well, so it wouldn't augment your quoted speedup.

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-18-2015
06:08 AM

830 Views

The Core i7 4790 processor uses the Haswell core and supports the AVX2 instruction set.

For floating-point matrix multiplication, all of the operations can be implemented with the Fused Multiply-Add (FMA) instructions. There are two 256-bit FMA units, so for 64-bit floating-point data the processor can perform the equivalent of 16 floating-point operations per cycle (2 functional units * 4 elements per vector * 2 FP operations per instruction), and for 32-bit floating-point data the processor can perform the equivalent of 32 floating-point operations per cycle (2 functional units * 8 elements per vector * 2 FP operations per instruction.

The Haswell core has no combined multiply/add instructions for integer data types, so the peak performance for packed integers is exactly 1/2 of the peak performance for packed floating-point values. This assumes that you can ignore the top half of each of the multiply results -- under general conditions multiplying two 32-bit integers results in a 64-bit value. Intel processors support a wide variety of instructions for performing multiplication operations on integer data (including packed integer data), but if you need all of the bits of the product then there is (at least) another factor of two reduction in the peak performance.

It is extremely difficult to implement a floating-point matrix multiplication code that approaches the core's peak performance, but the Intel MKL DGEMM function delivers ~92% of the peak performance for large square matrices. On the integer side the lower pipeline latencies reduce the number of accumulators required, but the complexity of dealing with the various types of multiplication more than outweighs this benefit.

Max_Rafiandy

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-24-2015
07:54 PM

829 Views

Very clear explanation. these answers solved my problem. Thanks, Tim. Thanks, John.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.