- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel developers,
by using a __mm128 type, what is the best and fast way to fill that type one float per time starting from an array of float? Thanks.
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you consult intrinsics guide, e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ?
If you don't want _mm_set_ps or _mm_setr_ps, you will need to explain your requirements. Depending on what you have in mind, the C++ or possibly the ISA forum may be appropriate.
These intrinsics will choose appropriate instructions according to your compiler architecture switch setting. Supposing that you do want to change just one 32-bit field, you can set the other fields to the current values, and check whether the compiler optimizes away redundant operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
Did you consult intrinsics guide, e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ?
If you don't want _mm_set_ps or _mm_setr_ps, you will need to explain your requirements. Depending on what you have in mind, the C++ or possibly the ISA forum may be appropriate.
These intrinsics will choose appropriate instructions according to your compiler architecture switch setting. Supposing that you do want to change just one 32-bit field, you can set the other fields to the current values, and check whether the compiler optimizes away redundant operations.
Hi Tim,
Yes I use frequently Intel Intrinsics Guide, but at the moment I didn't find a solution. Starting from a __m128 type, so having 4 floats [1, 2, 3, 4] I would like to set a single float per time without modify the others. By using a maskload for example, I can set a single elements, but the others are set to zero from that instructions, but It seems, also from your reply, the partial solution is to rewrite the elements with the same values except the value to modify
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is easy enough to write the four values to consecutive memory locations using a 4-element dummy array, then perform a 128-bit load to get them all back into a vector register. I find this more convenient than figuring out some of the more obscure intrinsic functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try
masked load of target (same mask as for source)
masked load of source
xor the mask load of target with target (zeroing out the field of interest)
or the masked load of source into the target (with zeroed out the field of interest)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Usually I am using what @John described in his response. As a additional advise you may align your float array on 16-byte boundaries before loading it into XMM register.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page