Tayler,

Georg_V_ · ‎03-07-2013

Using Xeon Phi intrinsics in C++, I would like to interleave the float values of 2 registers. It is basically a vector of structures (vector<complex<float> >) to structure of vectors thing. I guess it is somehow related to swizzle and shuffle, but looking at the compiler and instruction set manuals I dont see how to do it. Here are the "formal" specs.

Given v1, v2 of type __m512, both containing 16 floats, transform v1=(x7 y7 ... x0 y0), v2=(x15 y15 ... x8 y8) into v1=(y15 y14 ... y1 y0) v2=(x15 x14 ... x1 x0)
Given v1, v2 of type __m512, both containing 16 floats, transform v1=(y15 y14 ... y1 y0), v2=(x15 x14... x1 x0) into v1=(x7 y7... x0 y0), v2=(x15 y15... x8 y8) (basically the reverse operation of the first)

With SSE, I do it with _mm_shuffle_ps() and _mm_unpackhi/lo_ps(), but how to (efficiently) do it for Xeon Phi?

Georg

Georg_V_ · ‎03-08-2013

No answer yet. So it is either a difficult or a stupid question ;-)

Evgueni_P_Intel · ‎03-10-2013

Hi georgv,

You may do it as follows.

[cpp]

static void

interleave(__m512 re, __m512 im, __m512 *u, __m512 *v)

{

__m512i interleave_lo_hi = _mm512_set_16to16_epi32(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);

__m512 tmp_im = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)im);

__m512 tmp_re = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)re);

*u = _mm512_mask_blend_ps(0xAAAA, tmp_re, _mm512_swizzle_ps(tmp_im, _MM_SWIZ_REG_CDAB));

*v = _mm512_mask_blend_ps(0x5555, tmp_im, _mm512_swizzle_ps(tmp_re, _MM_SWIZ_REG_CDAB));

}

[/cpp]

Georg_V_ · ‎03-13-2013

Hello,

thanks for providing this piece of code. It covers case 2. I guess that finding suitable code sequences for re-ordering register content is one of the more advanced jobs in Xeon Phi programming ;-) . Or is there some guidance on how to do i?

After learning about _mm512_mask_blend_ps() (it is not in the compiler documentation, but in the header files), I also found a solution for case 1. Before this I had a solution using _m512_i32_scatter_ps() and _m512_i32_gather_ps(), but in both cases the new implementation is about twice as fast (code see below).

I was wondering if _m512_i32_gather_ps() is so much slower because the compiler fails to generate vprefetch* instructions for scatter/gather. However, an experiment (commented out section in code below) did not show the performance boost I hoped for. Is there any guideline on the distances for prefetch commands, or is there any way to convince the compiler to generate suitable prefetches? Or is it just that _m512_i32_scatter_ps() and _m512_i32_gather_ps() are slow compared to the register fiddling I am doing now?

Thanks for your help,

Georg

[cpp]

const size_type nFloats=16;

typedef _m512 vec_t;

/// load 32 float values from float aligned adress into v1,v2 and interleave values from v1 and v2 such
/// that every second value is place in the other one. p_p is 4 byte aligned.
///
/// v1=(x7 y7 ... x0 y0), v2=(x15 y15 ... x8 y8) becomes v1=(y15 y14 ... y1 y0) v2=(x15 x14 ... x1 x0)
inline void vec_gather2_ps(const float *p_p, vec_t &rv1_p,vec_t &rv2_p)
{
#if 0
// version with gather
// get every second
__m512i index=_mm512_set_epi32(30,28,26,24,22,20,18,16,14,12,10,8,6,4,2,0); // step 2
rv1_p=_mm512_i32gather_ps(index,p_p,4);
rv2_p=_mm512_i32gather_ps(index,p_p+1,4);
// prefetch values as generated by code in other alternative. However,
// no significant impact and still considerably slower
//_mm_prefetch(reinterpret_cast<const char *>(p_p+63*nFloats),_MM_HINT_T0);
//_mm_prefetch(reinterpret_cast<const char *>(p_p+64*nFloats),_MM_HINT_T0);
//_mm_prefetch(reinterpret_cast<const char *>(p_p+495*nFloats),_MM_HINT_T2);
//_mm_prefetch(reinterpret_cast<const char *>(p_p+496*nFloats),_MM_HINT_T2);
#else
// version with permutevar/mask_blend
__m512 v1=_mm512_setzero_ps();
v1=_mm512_loadunpacklo_ps(v1,p_p);
v1=_mm512_loadunpackhi_ps(v1,p_p+1*nFloats);
__m512 v2=_mm512_setzero_ps();
v2=_mm512_loadunpacklo_ps(v2,p_p+1*nFloats);
v2=_mm512_loadunpackhi_ps(v2,p_p+2*nFloats);
// interleave such that odd elements are collected in high word
const __m512i gather_lo_hi = _mm512_set_16to16_epi32(15,13,11,9,7,5,3,1,14,12,10,8,6,4,2,0);
const __m512i gather_hi_lo = _mm512_set_16to16_epi32(14,12,10,8,6,4,2,0,15,13,11,9,7,5,3,1);
const __m512 split_v1 = (__m512)_mm512_permutevar_epi32(gather_lo_hi, (__m512i)v1); //even elements are now in lower half
const __m512 split_v2 = (__m512)_mm512_permutevar_epi32(gather_hi_lo, (__m512i)v2); //even elements are in upper half
rv1_p=_mm512_mask_blend_ps(0xFF00,split_v1,split_v2);
rv2_p=_mm512_permute4f128_ps(_mm512_mask_blend_ps(0x00FF,split_v1,split_v2),_MM_PERM_BADC);
#endif

}

/// distribute values for v1 and v2 such that every second value is placed in the other one, and store them at p_p.

/// p_p is 4 byte aligned.
///
/// v1=(y15 y14 ... y1 y0), v2=(x15 x14 ... x1 x0) becomes v1=(x7 y7 ... x0 y0), v2=(x15 y15 ... x7 y7)
inline void vec_scatter2_ps(float *p_p, vec_t const &rv1_p,vec_t const &rv2_p)
{
#if 0
// version with scatter
const __m512i index=_mm512_set_epi32(30,28,26,24,22,20,18,16,14,12,10,8,6,4,2,0); //step 2
_mm512_i32scatter_ps(p_p,index,rv1_p,4);
_mm512_i32scatter_ps(p_p+1,index,rv2_p,4);
#else
// version with code from forum
const __m512i interleave_lo_hi = _mm512_set_16to16_epi32(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);
const __m512 tmp_v1 = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)rv1_p);
const __m512 tmp_v2 = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)rv2_p);
const __m512 v1_new=_mm512_mask_blend_ps(0xAAAA, tmp_v1, _mm512_swizzle_ps(tmp_v2, _MM_SWIZ_REG_CDAB));
const __m512 v2_new=_mm512_mask_blend_ps(0x5555, tmp_v2, _mm512_swizzle_ps(tmp_v1, _MM_SWIZ_REG_CDAB));
_mm512_packstorelo_ps(p_p,v1_new);
_mm512_packstorehi_ps(p_p+nFloats,v1_new);
_mm512_packstorelo_ps(p_p+nFloats,v2_new);
_mm512_packstorehi_ps(p_p+2*nFloats,v2_new);
#endif
}

[/cpp]

TaylorIoTKidd · ‎08-26-2013

Georg,

Do you still want an answer to the last few questions you posed?

Regards
--
Taylor

PS We're going through and catching any accidentally dropped posts.

Georg_V_ · ‎08-27-2013

Tayler,

thanks for coming back to this topic. I am currently busy with different projects, but medium term the answer to the questions below would still be interesting. I also noticed that there are other posts in the forum about scatter/gather/prefetch and its performance, So mayb this is of general interest.

The remaining questions are:

I was wondering if _m512_i32_gather_ps() is so much slower because the compiler fails to generate vprefetch* instructions for scatter/gather. However, an experiment (commented out section in code below) did not show the performance boost I hoped for.
Is there any guideline on the distances for prefetch commands, or is there any way to convince the compiler to generate suitable prefetches?
Or is it just that _m512_i32_scatter_ps() and _m512_i32_gather_ps() are slow compared to the register fiddling I am doing now?

Thanks,

Georg

Evgueni_P_Intel · ‎08-27-2013

Dear georgv,

The gather/scatter instructions are used for indirect access to arrays when the access pattern is unknown. In your case it is better to use blend/permute.

Regarding vec_gather2_ps, it is possible to use 2 blend's followed by 2 permutevar's -- the idea is shown below.

[cpp]

static void
split(__m512 tmp_re, __m512 tmp_im, __m512 *re, __m512 *im)
{
    __m512i interleave_lo_hi = _mm512_set_16to16_epi32(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);
    __m512 u = _mm512_mask_blend_ps(0xAAAA, tmp_re, _mm512_swizzle_ps(tmp_im, _MM_SWIZ_REG_CDAB));
    __m512 v = _mm512_mask_blend_ps(0x5555, tmp_im, _mm512_swizzle_ps(tmp_re, _MM_SWIZ_REG_CDAB));
    *im = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)tmp_im);
    *re = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)tmp_re);
}

[/cpp]

Interleaving values of 2 vector registers