- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guys,
I am porting a source code from SSE intrinsic to KNC Intel Xeon Phi. An issue I have to deal with now is that I can't find the way to implement unpacklo and unpackhi of SSE in KNC.
Anyone can help me for this issue?
Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can inquire with others better versed in the intrisics if you can provide some additional details and/or source code about what the exact problem or interest is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
My source code in SSE is simple like this:
_m128i bm3, bm2, bm1, bm0;
//calculate values and store in bm3, bm2
bm1 = _mm_unpacklo_epi32(bm3,bm2);
bm0 = _mm_unpackhi_epi32(bm3,bm2);
When porting to KNC instruction set, I will pack 16 integer elements in a vector of m512i. I am looking for a solution to implement unpacklo and unpackhi in KNC.
_m512i bm3, bm2, bm1, bm0;
//setting values from memory to bm3, bm2
bm1 = _mm512_unpacklo_epi32(bm3,bm2);
bm0 = _mm512_unpackhi_epi32(bm3,bm2);
Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Hein P.
You can do it as follows.
Evgueni.
[cpp]
__m512i CcAa = _mm512_mask_blend_epi32(0xaaaa, dcba, _mm512_swizzle_epi32(DCBA, _MM_SWIZ_REG_CDAB));
__m512i DdBb = _mm512_mask_blend_epi32(0x5555, DCBA, _mm512_swizzle_epi32(dcba, _MM_SWIZ_REG_CDAB));
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Evgueni Petrov,
Thanks for your reply. It seems to me that it doesn't work as my expectation.
For example, in my case:
If
bm3 = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ]
bm2 = [-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16]then
bm1 = [1 -1 2 -2 3 -3 4 -4 5 -5 6 -6 7 -7 8 -8]
bm0 = [9 -9 10 -10 11 -11 12 -12 13 -13 14 -14 15 -15 16 -16]
Actually, I have found out a solution for this case. The code looks like this:
__m512i idx1 = _mm512_setr_epi32(0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15);
__m512i idx2 = _mm512_setr_epi32(8,0,9,1,10,2,11,3,12,4,13,5,14,6,15,7);__m512i d, e;
d = _mm512_permutevar_epi32(idx1, bm3);
e = _mm512_permutevar_epi32(idx2, bm2);bm1 = _mm512_mask_blend_epi32(0xAAAA, d, e);
bm0 = _mm512_mask_blend_epi32(0x5555, d, e);bm0 = _mm512_shuffle_epi32(bm0, _MM_PERM_CDAB);
However, I'm not sure my solution is the best for this case. And it's a big surprise for me that KNC doesn't provide unpacklo, unpackhi instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We can save one instruction and one index.
const __m512i interleave_lo_hi = _mm512_set_16to16_epi32(15, 7,14, 6,13, 5,12, 4,11, 3,10, 2,9,1,8,0); __m512i tmp_im = _mm512_permutevar_epi32(interleave_lo_hi, im); __m512i tmp_re = _mm512_permutevar_epi32(interleave_lo_hi, re); u = _mm512_mask_blend_epi32(0xAAAA, tmp_re, _mm512_swizzle_epi32(tmp_im, _MM_SWIZ_REG_CDAB)); v = _mm512_mask_blend_epi32(0x5555, tmp_im, _mm512_swizzle_epi32(tmp_re, _MM_SWIZ_REG_CDAB));
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's still not correct, @Evgueni. The results are:
bm1 = [-1 1 -2 2 -3 3 -4 4 -5 5 -6 6 -7 7 -8 8]
bm0 = [-9 9 -10 10 -11 11 -12 12 -13 13 -14 14 -15 15 -16 16]
whilst what I need is:
bm1 = [1 -1 2 -2 3 -3 4 -4 5 -5 6 -6 7 -7 8 -8]
bm0 = [9 -9 10 -10 11 -11 12 -12 13 -13 14 -14 15 -15 16 -16]
Of course, we can swizzle the bm1, bm0 in your code. But in that case, it means that your code is not better than mine.
Other possible solution?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just swap bm2 and bm3 in the intrinsics :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it is correct, @Evgueni. Thanks for your suggestion.
I still reckon that the next Knight Landing Xeon Phi instruction set should provide the unpacklo and unpackhi instructions. It's necessary for the implementation of a large number of algorithms.
Cheers,
Hien Phan.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Evgueni,
I read again on your code.
u = _mm512_mask_blend_epi32(0xAAAA, tmp_re, _mm512_swizzle_epi32(tmp_im, _MM_SWIZ_REG_CDAB));
v = _mm512_mask_blend_epi32(0x5555, tmp_im, _mm512_swizzle_epi32(tmp_re, _MM_SWIZ_REG_CDAB));
are equal to 4 instructions (not 2). So your code still uses 7 instruction as mine. Am I correct?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since the blend instruction can incorporate a mask and a swizzle, we can compute u and v using only 2 blend instructions given tmp_im and tmp_re.
If this code is located inside a loop and the compiler finds a free zmm, then set_16to16 (a load) is moved out of the loop and the loop contains only 4 instructions (2 permutes, 2 blends.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please give some links about the incorporation of instructions in KNC, @Evgueni?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This information is contained in "Intel Xeon Phi Coprocessor Instruction Set Reference Manual".
You can reach it from https://software.intel.com/en-us/forums/topic/278102 -- please look at the downloads at the bottom of the page.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Evgueni a lot.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page