- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
After reading a large of materials, I can never fount out how to broadcast 4 float variables into 4 lanes of the vector register on MIC.
e.g. float array[4]={a,b,c,d};
how to load into a vector register like :{aaaa,bbbb,cccc,dddd} using one intrinsic.
If I use _mm512_mask_blend_ps, it takes 4 intrinsics.
__forceinline __m512 gather16float_4float(const float a, const float b, const float c, const float d)
{
__m512 v = _mm512_set1_ps(a);
v = _mm512_mask_blend_ps(0x00f0,v,_mm512_set1_ps(b));
v = _mm512_mask_blend_ps(0x0f00,v,_mm512_set1_ps(c));
v = _mm512_mask_blend_ps(0xf000,v,_mm512_set1_ps(d));
return v;
}
Any more faster methods?
All of the intrinsics are about 128bits broadcast. Is there any intrinsics between 4 lane.
Could u please help me how to do this.
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another question:
How to swizzle a register like this:
from: {abcd,efgh,ijkl,mnop}
to:{abcd,abcd,abcd,abcd}
Once again, I have read " User and Reference Guider for the Intel® C++ Compiler 15.0", but nothing was found.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Either of these can be done with the VPERMD instruction, accessed by the _mm512_permutevar_epi32 and/or _mm512_mask_permutevar_epi32 intrinsics. The name "permute" is slightly misleading -- indices are allowed to be repeated, so any of the 16 32-bit fields of the input can be copied to any number of output fields.
The total number of instructions required depends on where the original four floats are located and whether you have to count the instruction that sets up the register of indices. If the 4 32-bit input values are in the first 128 bits of a cache-line-aligned memory location, then the VPERMD instruction can read them directly from memory. The contents of the upper 12 32-bit fields would be irrelevant, since you would not use indices 4-15 to perform either of these swizzles.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page