- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my effort to perform ZGEMM on Intel's MIC, I have come across the following problem and would appreciate any help regarding the same. Say, I need to perform: OUT = M*IN, where OUT,M,IN are complex doubles. I have the following where I have multiplied 4 rows of 'M' with a single column of IN and get the following 4 vectors:
O1-> |a8|a7|a6|a5|a4|a3|a2|a1|//M1*IN0
O2-> |b8|b7|b6|b5|b4|b3|b2|b1|//M2*IN0
O3-> |c8|c7|c6|c5|c4|c3|c2|c1|//M3*IN0
O4-> |d8|d7|d6|d5|d4|d3|d2|d1|//M4*IN0
I have to rearrange it into:
O1_new ->|d2|d1|c2|c1|b2|b1|a2|a1|
O2_new ->|d4|d3|c4|c3|b4|b3|a4|a3|
O3_new ->|d6|d5|c6|c5|b6|b5|a6|a5|
O4_new ->|d8|d7|c8|c7|b8|b7|a8|a7|
which is similar to a transpose. How can I achieve this using the C intrinsics for Larrabee with the fewest cycles?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Evgueni,
I can't use Intel's MKL since I'm trying to code this on my own :)
Regarding the method that you have outlined, two queries:
1. Is _mm512_mask_blend_pd() available since it isn't mentioned in the Intel manual.
2. Won't this be extremely slow or result in loss of precision since we are casting it to an integer and then recasting it back?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While the compiler doc team answers your question re _mm512_mask_blend_pd, you may use the following equivalent sequence.
The type cast from __m512i to __m512d does not convert from int to double. It only tells the compiler how to interpret the 512 bits under the cast.
[cpp] __m512d a1 = _mm512_mask_mov_pd(w1, 0x33, _mm512_swizzle_pd(w0, _MM_SWIZ_REG_BADC));
__m512d a0 = _mm512_mask_mov_pd(w0, 0xcc, _mm512_swizzle_pd(w1, _MM_SWIZ_REG_BADC));
__m512d a3 = _mm512_mask_mov_pd(w3, 0x33, _mm512_swizzle_pd(w2, _MM_SWIZ_REG_BADC));
__m512d a2 = _mm512_mask_mov_pd(w2, 0xcc, _mm512_swizzle_pd(w3, _MM_SWIZ_REG_BADC));
__m512d y2 = (__m512d)_mm512_mask_alignr_epi32(a2, 0x00ff, a0, a0, 8);
__m512d y0 = (__m512d)_mm512_mask_alignr_epi32(a0, 0xff00, a2, a2, 8);
__m512d y3 = (__m512d)_mm512_mask_alignr_epi32(a3, 0x00ff, a1, a1, 8);
__m512d y1 = (__m512d)_mm512_mask_alignr_epi32(a1, 0xff00, a3, a3, 8);[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The _mm512_mask_blend_epi32, _mm512_mask_blend_epi64, _mm512_mask_blend_ps, _mm512_mask_blend_pd (present in zmmintrin.h) are missing from the C++ User Guide. I reported this to the Documentation team under the internal tracking id noted below.
(Internal tracking id: DPD200242070)
(Resolution Update on 11/27/2013): This defect is fixed in the Intel C++ Composer XE 2013 SP1 Initial Release (2013.1.0.080 - Linux)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Evgueni,
Thanks for the help! It works correctly, have to now try and see how this affects performance.
Hi Frank,
Thanks for looking into this. It is weird that there is a discrepancy between the number of intrinsics available for integers as opposed to doubles. There are no permute or shuffle intrinsics for doubles that work on two vectors? Or is it ok to use the corresponding ones for integers like _mm512_shuffle_epi32/_mm512_permute4f128_epi32?
Thanks,
Bharat.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In regard to your comment/questions (to Frank), our Development team replied:
On KNC, only 32-bit versions of the shuffle and permute are available, with the semantics defined by KNC EAS. The shuffle/permute instructions on KNC are non-typed, and it does not have shuffle/permute instructions operating on 64-bit elements. Here is the list of shuffle/permute intrinsics for KNC:
extern __m512i __ICL_INTRINCC _mm512_shuffle_epi32(__m512i, _MM_PERM_ENUM);
extern __m512i __ICL_INTRINCC _mm512_mask_shuffle_epi32(__m512i, __mmask16, __m512i, _MM_PERM_ENUM);
extern __m512i __ICL_INTRINCC _mm512_permutevar_epi32(__m512i, __m512i);
extern __m512i __ICL_INTRINCC _mm512_mask_permutevar_epi32(__m512i, __mmask16, __m512i, __m512i);
extern __m512i __ICL_INTRINCC _mm512_permute4f128_epi32(__m512i, _MM_PERM_ENUM);
extern __m512i __ICL_INTRINCC _mm512_mask_permute4f128_epi32(__m512i, __mmask16, __m512i, _MM_PERM_ENUM);
extern __m512 __ICL_INTRINCC _mm512_permute4f128_ps(__m512, _MM_PERM_ENUM);
extern __m512 __ICL_INTRINCC _mm512_mask_permute4f128_ps(__m512, __mmask16, __m512, _MM_PERM_ENUM);
If the user wants to do shuffle or permute with the KNC semantics on vectors of double elements, the cast intrinsics may be used, like for example:
__m512d x = _mm512_castsi512_pd(_mm512_shuffle_epi32(_mm512_castpd_si512(y), _MM_PERM_CDAB))
Hope that helps.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page