How to implement Unpacklo, Unpachi with _m512i?

Hien_P_1 · ‎01-11-2015

Hi guys,

I am porting a source code from SSE intrinsic to KNC Intel Xeon Phi. An issue I have to deal with now is that I can't find the way to implement unpacklo and unpackhi of SSE in KNC.

Anyone can help me for this issue?

Thanks in advance.

Kevin_D_Intel · ‎01-12-2015

I can inquire with others better versed in the intrisics if you can provide some additional details and/or source code about what the exact problem or interest is.

Hien_P_1 · ‎01-12-2015

Hi Kevin,

My source code in SSE is simple like this:

_m128i bm3, bm2, bm1, bm0;

//calculate values and store in bm3, bm2

bm1 = _mm_unpacklo_epi32(bm3,bm2);

bm0 = _mm_unpackhi_epi32(bm3,bm2);

When porting to KNC instruction set, I will pack 16 integer elements in a vector of m512i. I am looking for a solution to implement unpacklo and unpackhi in KNC.

_m512i bm3, bm2, bm1, bm0;

//setting values from memory to bm3, bm2

bm1 = _mm512_unpacklo_epi32(bm3,bm2);

bm0 = _mm512_unpackhi_epi32(bm3,bm2);

Thanks in advance.

Evgueni_P_Intel · ‎01-12-2015

Hi Hein P.

You can do it as follows.

Evgueni.

[cpp]

__m512i CcAa = _mm512_mask_blend_epi32(0xaaaa, dcba, _mm512_swizzle_epi32(DCBA, _MM_SWIZ_REG_CDAB));

__m512i DdBb = _mm512_mask_blend_epi32(0x5555, DCBA, _mm512_swizzle_epi32(dcba, _MM_SWIZ_REG_CDAB));

[/cpp]

Hien_P_1 · ‎01-12-2015

Hi @Evgueni Petrov,

Thanks for your reply. It seems to me that it doesn't work as my expectation.

For example, in my case:

If

bm3 = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ]
bm2 = [-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16]

then

bm1 = [1 -1 2 -2 3 -3 4 -4 5 -5 6 -6 7 -7 8 -8]
bm0 = [9 -9 10 -10 11 -11 12 -12 13 -13 14 -14 15 -15 16 -16]

Actually, I have found out a solution for this case. The code looks like this:

__m512i idx1 = _mm512_setr_epi32(0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15);
__m512i idx2 = _mm512_setr_epi32(8,0,9,1,10,2,11,3,12,4,13,5,14,6,15,7);

__m512i d, e;

d = _mm512_permutevar_epi32(idx1, bm3);
e = _mm512_permutevar_epi32(idx2, bm2);

bm1 = _mm512_mask_blend_epi32(0xAAAA, d, e);
bm0 = _mm512_mask_blend_epi32(0x5555, d, e);

bm0 = _mm512_shuffle_epi32(bm0, _MM_PERM_CDAB);

However, I'm not sure my solution is the best for this case. And it's a big surprise for me that KNC doesn't provide unpacklo, unpackhi instructions.

Evgueni_P_Intel · ‎01-12-2015

We can save one instruction and one index.

 
  const __m512i interleave_lo_hi = _mm512_set_16to16_epi32(15, 7,14, 6,13, 5,12, 4,11, 3,10, 2,9,1,8,0);
   __m512i tmp_im = _mm512_permutevar_epi32(interleave_lo_hi, im);
   __m512i tmp_re = _mm512_permutevar_epi32(interleave_lo_hi, re);
   u = _mm512_mask_blend_epi32(0xAAAA, tmp_re, _mm512_swizzle_epi32(tmp_im, _MM_SWIZ_REG_CDAB));
   v = _mm512_mask_blend_epi32(0x5555, tmp_im, _mm512_swizzle_epi32(tmp_re, _MM_SWIZ_REG_CDAB));

Hien_P_1 · ‎01-12-2015

It's still not correct, @Evgueni. The results are:

bm1 = [-1 1 -2 2 -3 3 -4 4 -5 5 -6 6 -7 7 -8 8]
bm0 = [-9 9 -10 10 -11 11 -12 12 -13 13 -14 14 -15 15 -16 16]

whilst what I need is:

bm1 = [1 -1 2 -2 3 -3 4 -4 5 -5 6 -6 7 -7 8 -8]
bm0 = [9 -9 10 -10 11 -11 12 -12 13 -13 14 -14 15 -15 16 -16]

Of course, we can swizzle the bm1, bm0 in your code. But in that case, it means that your code is not better than mine.

Other possible solution?

Evgueni_P_Intel · ‎01-12-2015

Just swap bm2 and bm3 in the intrinsics :)

Hien_P_1 · ‎01-13-2015

Yes, it is correct, @Evgueni. Thanks for your suggestion.

I still reckon that the next Knight Landing Xeon Phi instruction set should provide the unpacklo and unpackhi instructions. It's necessary for the implementation of a large number of algorithms.

Cheers,
Hien Phan.

Hien_P_1 · ‎01-13-2015

Hi @Evgueni,

I read again on your code.

u = _mm512_mask_blend_epi32(0xAAAA, tmp_re, _mm512_swizzle_epi32(tmp_im, _MM_SWIZ_REG_CDAB));

v = _mm512_mask_blend_epi32(0x5555, tmp_im, _mm512_swizzle_epi32(tmp_re, _MM_SWIZ_REG_CDAB));

		
	are equal to 4 instructions (not 2). So your code still uses 7 instruction as mine. Am I correct?

Evgueni_P_Intel · ‎01-13-2015

Since the blend instruction can incorporate a mask and a swizzle, we can compute u and v using only 2 blend instructions given tmp_im and tmp_re.

If this code is located inside a loop and the compiler finds a free zmm, then set_16to16 (a load) is moved out of the loop and the loop contains only 4 instructions (2 permutes, 2 blends.)

Hien_P_1 · ‎01-13-2015

Could you please give some links about the incorporation of instructions in KNC, @Evgueni?

Evgueni_P_Intel · ‎01-13-2015

This information is contained in "Intel Xeon Phi Coprocessor Instruction Set Reference Manual".

You can reach it from https://software.intel.com/en-us/forums/topic/278102 -- please look at the downloads at the bottom of the page.

Hien_P_1 · ‎01-14-2015

Thanks @Evgueni a lot.