Single load operation

unrue · ‎02-03-2016

Dear Intel developers,

by using a __mm128 type, what is the best and fast way to fill that type one float per time starting from an array of float? Thanks.

TimP · ‎02-03-2016

Did you consult intrinsics guide, e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ?

If you don't want _mm_set_ps or _mm_setr_ps, you will need to explain your requirements. Depending on what you have in mind, the C++ or possibly the ISA forum may be appropriate.

These intrinsics will choose appropriate instructions according to your compiler architecture switch setting. Supposing that you do want to change just one 32-bit field, you can set the other fields to the current values, and check whether the compiler optimizes away redundant operations.

unrue · ‎02-03-2016

Tim P. wrote:

Did you consult intrinsics guide, e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ?

If you don't want _mm_set_ps or _mm_setr_ps, you will need to explain your requirements. Depending on what you have in mind, the C++ or possibly the ISA forum may be appropriate.

These intrinsics will choose appropriate instructions according to your compiler architecture switch setting. Supposing that you do want to change just one 32-bit field, you can set the other fields to the current values, and check whether the compiler optimizes away redundant operations.

Hi Tim,

Yes I use frequently Intel Intrinsics Guide, but at the moment I didn't find a solution. Starting from a __m128 type, so having 4 floats [1, 2, 3, 4] I would like to set a single float per time without modify the others. By using a maskload for example, I can set a single elements, but the others are set to zero from that instructions, but It seems, also from your reply, the partial solution is to rewrite the elements with the same values except the value to modify

McCalpinJohn · ‎02-03-2016

It is easy enough to write the four values to consecutive memory locations using a 4-element dummy array, then perform a 128-bit load to get them all back into a vector register. I find this more convenient than figuring out some of the more obscure intrinsic functions.

jimdempseyatthecove · ‎02-03-2016

Try

masked load of target (same mask as for source)
masked load of source
xor the mask load of target with target (zeroing out the field of interest)
or the masked load of source into the target (with zeroed out the field of interest)

Jim Dempsey

Bernard · ‎03-05-2016

Usually I am using what @John described in his response. As a additional advise you may align your float array on 16-byte boundaries before loading it into XMM register.