Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Single load operation

unrue
Beginner
433 Views

Dear Intel developers,

by using a __mm128 type, what is the best and fast way to fill that type one float per time starting from an array of float? Thanks.

0 Kudos
5 Replies
TimP
Honored Contributor III
433 Views

Did you consult intrinsics guide, e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ?

If you don't want _mm_set_ps or _mm_setr_ps, you will need to explain your requirements.  Depending on what you have in mind, the C++ or possibly the ISA forum may be appropriate.

These intrinsics will choose appropriate instructions according to your compiler architecture switch setting.  Supposing that you do want to change just one 32-bit field, you can set the other fields to the current values, and check whether the compiler optimizes away redundant operations.

0 Kudos
unrue
Beginner
433 Views

Tim P. wrote:

Did you consult intrinsics guide, e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ?

If you don't want _mm_set_ps or _mm_setr_ps, you will need to explain your requirements.  Depending on what you have in mind, the C++ or possibly the ISA forum may be appropriate.

These intrinsics will choose appropriate instructions according to your compiler architecture switch setting.  Supposing that you do want to change just one 32-bit field, you can set the other fields to the current values, and check whether the compiler optimizes away redundant operations.

 

Hi Tim,

Yes I use frequently Intel Intrinsics Guide, but at the moment I didn't find a solution. Starting from a __m128 type, so having 4 floats [1, 2, 3, 4] I would like to set a single float per time without modify the others. By using a maskload for example, I can set a single elements, but the others are set to zero from that instructions, but It seems, also from your reply, the partial solution is to rewrite the elements with the same values except the value to modify

0 Kudos
McCalpinJohn
Honored Contributor III
433 Views

It is easy enough to write the four values to consecutive memory locations using a 4-element dummy array, then perform a 128-bit load to get them all back into a vector register.   I find this more convenient than figuring out some of the more obscure intrinsic functions.
 

0 Kudos
jimdempseyatthecove
Honored Contributor III
433 Views

Try

masked load of target (same mask as for source)
masked load of source
xor the mask load of target with target (zeroing out the field of interest)
or the masked load of source into the target (with zeroed out the field of interest)

Jim Dempsey
 

0 Kudos
Bernard
Valued Contributor I
433 Views

Usually I am using what @John described in his response. As a additional advise you may align your float array on 16-byte boundaries before loading it into XMM register.

 

0 Kudos
Reply