Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
7679 Discussions

how to broadcast 4 float into 4 lanes?

wang_p_1
Beginner
166 Views

Hi there,

After reading a large of materials, I can never fount out how to broadcast 4 float variables into 4 lanes of the vector register on MIC.

e.g. float array[4]={a,b,c,d};

how to load into a vector register like :{aaaa,bbbb,cccc,dddd} using one intrinsic.

If I use _mm512_mask_blend_ps, it takes 4 intrinsics.

__forceinline __m512 gather16float_4float(const float a, const float b, const float c, const float d)
{
        __m512 v = _mm512_set1_ps(a);
        v = _mm512_mask_blend_ps(0x00f0,v,_mm512_set1_ps(b));
        v = _mm512_mask_blend_ps(0x0f00,v,_mm512_set1_ps(c));
        v = _mm512_mask_blend_ps(0xf000,v,_mm512_set1_ps(d));
        return v;
}

Any more faster methods?

All of the intrinsics are about 128bits broadcast. Is there any intrinsics between 4 lane.

Could u please help me how to do this.

Thanks.

0 Kudos
2 Replies
wang_p_1
Beginner
166 Views

Another question:

How to swizzle a register like this:

from: {abcd,efgh,ijkl,mnop}

to:{abcd,abcd,abcd,abcd}

Once again, I have read " User and Reference Guider for the Intel® C++ Compiler 15.0", but nothing was found.

McCalpinJohn
Black Belt
166 Views

Either of these can be done with the VPERMD instruction, accessed by the _mm512_permutevar_epi32 and/or _mm512_mask_permutevar_epi32 intrinsics.    The name "permute" is slightly misleading -- indices are allowed to be repeated, so any of the 16 32-bit fields of the input can be copied to any number of output fields.

The total number of instructions required depends on where the original four floats are located and whether you have to count the instruction that sets up the register of indices.   If the 4 32-bit input values are in the first 128 bits of a cache-line-aligned memory location, then the VPERMD instruction can read them directly from memory.  The contents of the upper 12 32-bit fields would be irrelevant, since you would not use indices 4-15 to perform either of these swizzles.
 

Reply