how to broadcast 4 float into 4 lanes?

wang_p_1 · ‎03-07-2016

Hi there,

After reading a large of materials, I can never fount out how to broadcast 4 float variables into 4 lanes of the vector register on MIC.

e.g. float array[4]={a,b,c,d};

how to load into a vector register like :{aaaa,bbbb,cccc,dddd} using one intrinsic.

If I use _mm512_mask_blend_ps, it takes 4 intrinsics.

__forceinline __m512 gather16float_4float(const float a, const float b, const float c, const float d)
{
__m512 v = _mm512_set1_ps(a);
v = _mm512_mask_blend_ps(0x00f0,v,_mm512_set1_ps(b));
v = _mm512_mask_blend_ps(0x0f00,v,_mm512_set1_ps(c));
v = _mm512_mask_blend_ps(0xf000,v,_mm512_set1_ps(d));
return v;
}

Any more faster methods?

All of the intrinsics are about 128bits broadcast. Is there any intrinsics between 4 lane.

Could u please help me how to do this.

Thanks.

wang_p_1 · ‎03-08-2016

Another question:

How to swizzle a register like this:

from: {abcd,efgh,ijkl,mnop}

to:{abcd,abcd,abcd,abcd}

Once again, I have read " User and Reference Guider for the Intel® C++ Compiler 15.0", but nothing was found.

McCalpinJohn · ‎03-08-2016

Either of these can be done with the VPERMD instruction, accessed by the _mm512_permutevar_epi32 and/or _mm512_mask_permutevar_epi32 intrinsics. The name "permute" is slightly misleading -- indices are allowed to be repeated, so any of the 16 32-bit fields of the input can be copied to any number of output fields.

The total number of instructions required depends on where the original four floats are located and whether you have to count the instruction that sets up the register of indices. If the 4 32-bit input values are in the first 128 bits of a cache-line-aligned memory location, then the VPERMD instruction can read them directly from memory. The contents of the upper 12 32-bit fields would be irrelevant, since you would not use indices 4-15 to perform either of these swizzles.