Community
cancel
Showing results for
Did you mean: Beginner
438 Views

## Why some scalar intrinsics are faster than packed intrinsics?

Dear all:

I found that some scalar intrinsics are faster than packed intrinsics. why?

who can tell me what is the difference between "scalar", "packed" and "extended packed" ?

1 Solution Employee
438 Views

Hi Raymond,
You'll need to attach a reproducer when you say scalar intrinsic operation was better than vector operation. Are you referring to masked context all but lowest element packed?  There is no good reason for scalar instruction to be slower than corresponding packed instruction as the intrinsics map 1:1 to instructions so should be the same in performance.

That said,  a scalar instruction processes only one data element at a time (example, integer, float, double etc). A scalar processor therefore processes a single instruction single data.  A vector instruction on the other hand can operate on one dimensional arrays of data called "vectors" and uses the concept of SIMD, single instruction and multiple data, meaning a single instruction can operate on the entire vector in one instruction. The operand to the instruction are the complete vector instead of one element. It therefore reduces the fetch and decode bandwidth and number of instructions fetched are less.

See attached foil showing scalar and vector operation. If say there were 8 elements in the vector you can think of a speedup of 8x compared to scalar operation since the same operation (assignment) performed in the for loop for each element is applied to all 8 elements in one single instruction (operated on the entire vector in parallel using vector instruction, SIMD approach)

You can think of packed as supporting a vector datatype where a vector is a collection of vector length n bit words with the basic data type as a n-bit word (scalar type in traditional scalar processor) and present day processors use vector operations. So, SSE has 128 bits, so "packed" can mean several of the same data type put into one vector such as: packed single precision floating point (4 * 32 bit floating point numbers stored as a 128-bit value). Scalar operation only operates on the least-significant data element (bit 0~31), and packed operation computes all four elements in parallel.

Hope the above helps.

Kittur

3 Replies Employee
439 Views

Hi Raymond,
You'll need to attach a reproducer when you say scalar intrinsic operation was better than vector operation. Are you referring to masked context all but lowest element packed?  There is no good reason for scalar instruction to be slower than corresponding packed instruction as the intrinsics map 1:1 to instructions so should be the same in performance.

That said,  a scalar instruction processes only one data element at a time (example, integer, float, double etc). A scalar processor therefore processes a single instruction single data.  A vector instruction on the other hand can operate on one dimensional arrays of data called "vectors" and uses the concept of SIMD, single instruction and multiple data, meaning a single instruction can operate on the entire vector in one instruction. The operand to the instruction are the complete vector instead of one element. It therefore reduces the fetch and decode bandwidth and number of instructions fetched are less.

See attached foil showing scalar and vector operation. If say there were 8 elements in the vector you can think of a speedup of 8x compared to scalar operation since the same operation (assignment) performed in the for loop for each element is applied to all 8 elements in one single instruction (operated on the entire vector in parallel using vector instruction, SIMD approach)

You can think of packed as supporting a vector datatype where a vector is a collection of vector length n bit words with the basic data type as a n-bit word (scalar type in traditional scalar processor) and present day processors use vector operations. So, SSE has 128 bits, so "packed" can mean several of the same data type put into one vector such as: packed single precision floating point (4 * 32 bit floating point numbers stored as a 128-bit value). Scalar operation only operates on the least-significant data element (bit 0~31), and packed operation computes all four elements in parallel.

Hope the above helps.

Kittur Employee
438 Views

Hi Raymond,
Did my previous communication answer your question? Let me know if you need any further clarification, appreciate much.
Regards,
Kittur Beginner
438 Views

Hi Kittur,

Actually, you don't explain what "extend packed" means.

I cannot find the info. 