Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
17060 Discussions

How to implement Unpacklo, Unpachi with _m512i?

Hien_P_1
Beginner
1,867 Views

Hi guys,

I am porting a source code from SSE intrinsic to KNC Intel Xeon Phi. An issue I have to deal with now is that I can't find the way to implement unpacklo and unpackhi of SSE in KNC.

Anyone can help me for this issue?

Thanks in advance.

 

0 Kudos
13 Replies
Kevin_D_Intel
Employee
1,867 Views

I can inquire with others better versed in the intrisics if you can provide some additional details and/or source code about what the exact problem or interest is.

0 Kudos
Hien_P_1
Beginner
1,867 Views

Hi Kevin,

My source code in SSE is simple like this:

_m128i bm3, bm2, bm1, bm0;

//calculate values and store in bm3, bm2

bm1 = _mm_unpacklo_epi32(bm3,bm2);

bm0 = _mm_unpackhi_epi32(bm3,bm2);

When porting to KNC instruction set, I will pack 16 integer elements in a vector of m512i. I am looking for a solution to implement unpacklo and unpackhi in KNC.

_m512i bm3, bm2, bm1, bm0;

//setting values from memory to bm3, bm2

bm1 = _mm512_unpacklo_epi32(bm3,bm2);

bm0 = _mm512_unpackhi_epi32(bm3,bm2);

Thanks in advance.

0 Kudos
Evgueni_P_Intel
Employee
1,867 Views

Hi Hein P.

You can do it as follows.

Evgueni.

[cpp]

__m512i CcAa = _mm512_mask_blend_epi32(0xaaaa, dcba, _mm512_swizzle_epi32(DCBA, _MM_SWIZ_REG_CDAB));

__m512i DdBb = _mm512_mask_blend_epi32(0x5555, DCBA, _mm512_swizzle_epi32(dcba, _MM_SWIZ_REG_CDAB));

[/cpp]

 

0 Kudos
Hien_P_1
Beginner
1,867 Views

Hi @Evgueni Petrov, 

Thanks for your reply. It seems to me that it doesn't work as my expectation. 

For example, in my case:

If

bm3 = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ]
bm2 = [-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16]

then 

bm1 = [1 -1 2 -2 3 -3 4 -4 5 -5 6 -6 7 -7 8 -8]
bm0 = [9 -9 10 -10 11 -11 12 -12 13 -13 14 -14 15 -15 16 -16] 

Actually, I have found out a solution for this case. The code looks like this:

 __m512i idx1 = _mm512_setr_epi32(0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15);
__m512i idx2 = _mm512_setr_epi32(8,0,9,1,10,2,11,3,12,4,13,5,14,6,15,7);

__m512i d, e;

d = _mm512_permutevar_epi32(idx1, bm3);
e = _mm512_permutevar_epi32(idx2, bm2);

bm1 = _mm512_mask_blend_epi32(0xAAAA, d, e);
bm0 = _mm512_mask_blend_epi32(0x5555, d, e);

bm0 = _mm512_shuffle_epi32(bm0, _MM_PERM_CDAB);

However, I'm not sure my solution is the best for this case.  And it's a big surprise for me that KNC doesn't provide unpacklo, unpackhi instructions. 

0 Kudos
Evgueni_P_Intel
Employee
1,867 Views

We can save one instruction and one index.

 
  const __m512i interleave_lo_hi = _mm512_set_16to16_epi32(15, 7,14, 6,13, 5,12, 4,11, 3,10, 2,9,1,8,0);
   __m512i tmp_im = _mm512_permutevar_epi32(interleave_lo_hi, im);
   __m512i tmp_re = _mm512_permutevar_epi32(interleave_lo_hi, re);
   u = _mm512_mask_blend_epi32(0xAAAA, tmp_re, _mm512_swizzle_epi32(tmp_im, _MM_SWIZ_REG_CDAB));
   v = _mm512_mask_blend_epi32(0x5555, tmp_im, _mm512_swizzle_epi32(tmp_re, _MM_SWIZ_REG_CDAB));

 

0 Kudos
Hien_P_1
Beginner
1,867 Views

It's still not correct, @Evgueni. The results are:

bm1 = [-1 1 -2 2 -3 3 -4 4 -5 5 -6 6 -7 7 -8 8] 
bm0 = [-9 9 -10 10 -11 11 -12 12 -13 13 -14 14 -15 15 -16 16] 

whilst what I need is:

bm1 = [1 -1 2 -2 3 -3 4 -4 5 -5 6 -6 7 -7 8 -8]
bm0 = [9 -9 10 -10 11 -11 12 -12 13 -13 14 -14 15 -15 16 -16] 

Of course, we can swizzle the bm1, bm0 in your code. But in that case, it means that your code is not better than mine.

Other possible solution? 

 

0 Kudos
Evgueni_P_Intel
Employee
1,867 Views

Just swap bm2 and bm3 in the intrinsics :)

0 Kudos
Hien_P_1
Beginner
1,867 Views

Yes, it is correct, @Evgueni. Thanks for your suggestion. 

I still reckon that the next Knight Landing Xeon Phi instruction set should provide the unpacklo and unpackhi instructions. It's necessary for the implementation of a large number of algorithms. 

Cheers, 
Hien Phan. 

 

0 Kudos
Hien_P_1
Beginner
1,867 Views

Hi @Evgueni, 

I read again on your code. 

 u = _mm512_mask_blend_epi32(0xAAAA, tmp_re, _mm512_swizzle_epi32(tmp_im, _MM_SWIZ_REG_CDAB)); 

 v = _mm512_mask_blend_epi32(0x5555, tmp_im, _mm512_swizzle_epi32(tmp_re, _MM_SWIZ_REG_CDAB));

   

are equal to 4 instructions (not 2). So your code still uses 7 instruction as mine. Am I correct? 

0 Kudos
Evgueni_P_Intel
Employee
1,867 Views

Since the blend instruction can incorporate a mask and a swizzle, we can compute u and v using only 2 blend instructions given tmp_im and tmp_re.

If this code is located inside a loop and the compiler finds a free zmm, then set_16to16 (a load) is moved out of the loop and the loop contains only 4 instructions (2 permutes, 2 blends.)

0 Kudos
Hien_P_1
Beginner
1,867 Views

Could you please give some links about the incorporation of instructions in KNC, @Evgueni? 

0 Kudos
Evgueni_P_Intel
Employee
1,867 Views

This information is contained in "Intel Xeon Phi Coprocessor Instruction Set Reference Manual".

You can reach it from https://software.intel.com/en-us/forums/topic/278102 -- please look at the downloads at the bottom of the page.

0 Kudos
Hien_P_1
Beginner
1,867 Views

Thanks @Evgueni a lot. 

0 Kudos
Reply