- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Everyone,
I need to shift vector register in 64-bit double floats. The value in the register is showed as follows:
V: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
I want to perform an element-by-element logical left or right shift of float64 vector V. For example, after shifting by the number of 2 float64, we can get the result as follows:
V: | 0 | 0 | 5 | 4 | 3 | 2 | 1 | 0 |
But I cann't find a instruction like that. Are there some instructions satisfy me?
(By the way, I saw instructions can performs an element-by-element logical shift of int32 vector v2. For example:_mm512_sllv_epi32 )
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that's what the "swizzle" and "permute" instructions are for. One of them moves around 4 blocks of 4 floats inside a register, and the other moves floats within each block. In the Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual, they are described in Section 2.2. In the Intel C++ Compiler XE User and Reference Guide, the corresponding intrinsics are described in "Compiler Reference -> Intrinsics -> Intrinsics for the Intel MIC Architecture -> Shuffle Intrinsics" (and maybe this link will work: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-E903C1C4-A361-4D12-9A3A-DD1047B4A2A3.htm )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have saw these instructions. But permute instructions can only be used on float64 vectors. So I realize the logical shift of float64 vector on "swizzle" instruction. It worked. But the problem is that I need 6 instructions in total to perform a logical left or right shift. For example,
#define F64_SL1(_sl1_arg_zmm, _sl1_arg_pach, _sl1_arg_val4)\
(\
_sl1_d1=_mm512_swizzle_pd((_sl1_arg_zmm), _MM_SWIZ_REG_CDAB),\
_sl1_d2=_mm512_mask_swizzle_pd(_sl1_d1, _MASK_44, (_sl1_arg_zmm), _MM_SWIZ_REG_BBBB),\
_sl1_d3=_mm512_set1_pd((_sl1_arg_val4)),\
_sl1_d4=_mm512_mask_swizzle_pd(_sl1_d2, _MASK_10, _sl1_d3, _MM_SWIZ_REG_NONE),\
_sl1_d5=_mm512_set1_pd((_sl1_arg_pach)),\
_mm512_mask_swizzle_pd(_sl1_d4, _MASK_01, _sl1_d5, _MM_SWIZ_REG_NONE) \
)
It is so expensive!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Zhang,
I confess I was not able to fully follow the example above: what are de definitions for _MASK_*? Are all open/close parenthesis matched in the definition of the C macro F64_SL1 ?
Although the intrinsics API for permute is targeted for i32, I _wonder_ if one can just apply two i32 permutations to get a 64bit permutation. Use the masked version to fill the shifted portion with the new value(s) you want. My thoughts:
#define rotate_mask_d 0xfffc
__m512i permut_idx_d = _mm512_set_epi32(13,12,11,10,9,8,7,6,5,4,3,2,1,0,15,14);
__m512d v_fill_value = _mm512_set1_pd(-10.0);
v_target = (__m512d) _mm512_mask_permutevar_epi32((__m512i)v_fill_value, rotate_mask_d, permut_idx_d, (__m512i)v_target);
So, for v_target = 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8 should be rotated to -10, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7
And opposite masks should rotate to the other direction:
#define rotate_mask_d 0x3fff
__m512i permut_idx_d = _mm512_set_epi32(1,0,15,14,13,12,11,10,9,8,7,6,5,4,3,2);
This might worth a try...
Leo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Leo,Thanks very much for your help!
It is my carelessness. _MASK_10 is defined as 0x10 for "#define _MASK_10 0x10", and other _MASK_* have the same mean. By the way, my snippet can work, but it perform badly.
I have tested the method just like yours. But I use _mm512_alignr_epi32. It work better than the original version! The follow is my way:
__m512d _sl1_vec_pach;
#define F64_SL1(_sl1_arg_zmm, _sl1_f64_pach, _sl1_arg_val4)\
(\
_sl1_vec_pach=_mm512_set1_pd((_sl1_f64_pach)),\
(__m512d)_mm512_alignr_epi32((__m512i)_sl1_arg_zmm,(__m512i)_sl1_vec_pach,14)\
)
__m512d _sr1_vec_pach;
#define F64_SR1(_sr1_arg_zmm, _sr1_f64_pach, _sr1_arg_val4)\
(\
_sr1_vec_pach=_mm512_set1_pd((_sr1_f64_pach)),\
(__m512d)_mm512_alignr_epi32((__m512i)_sr1_vec_pach,(__m512i)_sr1_arg_zmm,2)\
)
I still don't satisfy that, because I think the instructions "_mm512_set1_pd((_s*1_f64_pach))" waste the bandwidth seriously. So I still want to know whether there are some vector shift instructions that can work between a vector register and one scalar which needed to be pached the space generated by the vector shift.
Thank you very much!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I need to shift vector register...
>>V: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
>>V: | 0 | 0 | 5 | 4 | 3 | 2 | 1 | 0 |
The above is not a shift, but could be done with mask
Shift 2 right would result in:
V: | 0 | 0 | 7 | 6 | 5 | 4 | 3 | 2 |
Try using the int32 instruction and shifting 2x the distance.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If "pach" is constant, then load it once "p8=_mm512_set1_pd(&pach)" and use p8 -- it will stay in a register.
By the way, small functions declared as __forceinline can replace macro substitutions in many cases.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page