Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® C++ Compiler
- Why some scalar intrinsics are faster than packed intrinsics?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Raymond_S_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-24-2016
08:45 PM

438 Views

Why some scalar intrinsics are faster than packed intrinsics?

Dear all:

I found that some scalar intrinsics are faster than packed intrinsics. why?

who can tell me what is the difference between "scalar", "packed" and "extended packed" ?

1 Solution

Kittur_G_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
06:51 AM

438 Views

Hi Raymond,

You'll need to attach a reproducer when you say scalar intrinsic operation was better than vector operation. Are you referring to masked context all but lowest element packed? There is no good reason for scalar instruction to be slower than corresponding packed instruction as the intrinsics map 1:1 to instructions so should be the same in performance.

That said, a scalar instruction processes only one data element at a time (example, integer, float, double etc). A scalar processor therefore processes a single instruction single data. A vector instruction on the other hand can operate on one dimensional arrays of data called "vectors" and uses the concept of SIMD, single instruction and multiple data, meaning a single instruction can operate on the entire vector in one instruction. The operand to the instruction are the complete vector instead of one element. It therefore reduces the fetch and decode bandwidth and number of instructions fetched are less.

See attached foil showing scalar and vector operation. If say there were 8 elements in the vector you can think of a speedup of 8x compared to scalar operation since the same operation (assignment) performed in the for loop for each element is applied to all 8 elements in one single instruction (operated on the entire vector in parallel using vector instruction, SIMD approach)

You can think of packed as supporting a vector datatype where a vector is a collection of vector length n bit words with the basic data type as a n-bit word (scalar type in traditional scalar processor) and present day processors use vector operations. So, SSE has 128 bits, so "packed" can mean several of the same data type put into one vector such as: packed single precision floating point (4 * 32 bit floating point numbers stored as a 128-bit value). Scalar operation only operates on the least-significant data element (bit 0~31), and packed operation computes all four elements in parallel.

Hope the above helps.

Kittur

Link Copied

3 Replies

Kittur_G_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
06:51 AM

439 Views

Hi Raymond,

You'll need to attach a reproducer when you say scalar intrinsic operation was better than vector operation. Are you referring to masked context all but lowest element packed? There is no good reason for scalar instruction to be slower than corresponding packed instruction as the intrinsics map 1:1 to instructions so should be the same in performance.

That said, a scalar instruction processes only one data element at a time (example, integer, float, double etc). A scalar processor therefore processes a single instruction single data. A vector instruction on the other hand can operate on one dimensional arrays of data called "vectors" and uses the concept of SIMD, single instruction and multiple data, meaning a single instruction can operate on the entire vector in one instruction. The operand to the instruction are the complete vector instead of one element. It therefore reduces the fetch and decode bandwidth and number of instructions fetched are less.

See attached foil showing scalar and vector operation. If say there were 8 elements in the vector you can think of a speedup of 8x compared to scalar operation since the same operation (assignment) performed in the for loop for each element is applied to all 8 elements in one single instruction (operated on the entire vector in parallel using vector instruction, SIMD approach)

You can think of packed as supporting a vector datatype where a vector is a collection of vector length n bit words with the basic data type as a n-bit word (scalar type in traditional scalar processor) and present day processors use vector operations. So, SSE has 128 bits, so "packed" can mean several of the same data type put into one vector such as: packed single precision floating point (4 * 32 bit floating point numbers stored as a 128-bit value). Scalar operation only operates on the least-significant data element (bit 0~31), and packed operation computes all four elements in parallel.

Hope the above helps.

Kittur

Kittur_G_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-03-2016
09:16 AM

438 Views

Did my previous communication answer your question? Let me know if you need any further clarification, appreciate much.

Regards,

Kittur

alvaro__laurent

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-08-2020
09:43 AM

438 Views

Hi Kittur,

Actually, you don't explain what "extend packed" means.

I cannot find the info.

Thanks in advance !

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.