MIC doesn't have float64 vector shift instructions

zhang_y_1 · ‎04-16-2013

Hi Everyone,

I need to shift vector register in 64-bit double floats. The value in the register is showed as follows:

V: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

I want to perform an element-by-element logical left or right shift of float64 vector V. For example, after shifting by the number of 2 float64, we can get the result as follows:

V: | 0 | 0 | 5 | 4 | 3 | 2 | 1 | 0 |

But I cann't find a instruction like that. Are there some instructions satisfy me?

(By the way, I saw instructions can performs an element-by-element logical shift of int32 vector v2. For example:_mm512_sllv_epi32 )

Thanks!

Andrey_Vladimirov · ‎04-16-2013

I think that's what the "swizzle" and "permute" instructions are for. One of them moves around 4 blocks of 4 floats inside a register, and the other moves floats within each block. In the Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual, they are described in Section 2.2. In the Intel C++ Compiler XE User and Reference Guide, the corresponding intrinsics are described in "Compiler Reference -> Intrinsics -> Intrinsics for the Intel MIC Architecture -> Shuffle Intrinsics" (and maybe this link will work: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-E903C1C4-A361-4D12-9A3A-DD1047B4A2A3.htm )

zhang_y_1 · ‎04-16-2013

I have saw these instructions. But permute instructions can only be used on float64 vectors. So I realize the logical shift of float64 vector on "swizzle" instruction. It worked. But the problem is that I need 6 instructions in total to perform a logical left or right shift. For example,

#define F64_SL1(_sl1_arg_zmm, _sl1_arg_pach, _sl1_arg_val4)\
(\
_sl1_d1=_mm512_swizzle_pd((_sl1_arg_zmm), _MM_SWIZ_REG_CDAB),\
_sl1_d2=_mm512_mask_swizzle_pd(_sl1_d1, _MASK_44, (_sl1_arg_zmm), _MM_SWIZ_REG_BBBB),\
_sl1_d3=_mm512_set1_pd((_sl1_arg_val4)),\
_sl1_d4=_mm512_mask_swizzle_pd(_sl1_d2, _MASK_10, _sl1_d3, _MM_SWIZ_REG_NONE),\
_sl1_d5=_mm512_set1_pd((_sl1_arg_pach)),\
_mm512_mask_swizzle_pd(_sl1_d4, _MASK_01, _sl1_d5, _MM_SWIZ_REG_NONE) \
)

It is so expensive!

Leonardo_B_Intel · ‎04-17-2013

Hello Zhang,

I confess I was not able to fully follow the example above: what are de definitions for _MASK_*? Are all open/close parenthesis matched in the definition of the C macro F64_SL1 ?

Although the intrinsics API for permute is targeted for i32, I _wonder_ if one can just apply two i32 permutations to get a 64bit permutation. Use the masked version to fill the shifted portion with the new value(s) you want. My thoughts:

#define rotate_mask_d 0xfffc

__m512i permut_idx_d = _mm512_set_epi32(13,12,11,10,9,8,7,6,5,4,3,2,1,0,15,14);

__m512d v_fill_value = _mm512_set1_pd(-10.0);

v_target = (__m512d) _mm512_mask_permutevar_epi32((__m512i)v_fill_value, rotate_mask_d, permut_idx_d, (__m512i)v_target);

So, for v_target = 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8 should be rotated to -10, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7

And opposite masks should rotate to the other direction:

#define rotate_mask_d 0x3fff

__m512i permut_idx_d = _mm512_set_epi32(1,0,15,14,13,12,11,10,9,8,7,6,5,4,3,2);

This might worth a try...

Leo.

zhang_y_1 · ‎04-18-2013

Leo，Thanks very much for your help!

It is my carelessness. _MASK_10 is defined as 0x10 for "#define _MASK_10 0x10", and other _MASK_* have the same mean. By the way, my snippet can work, but it perform badly.

I have tested the method just like yours. But I use _mm512_alignr_epi32. It work better than the original version! The follow is my way:

__m512d _sl1_vec_pach;
#define F64_SL1(_sl1_arg_zmm, _sl1_f64_pach, _sl1_arg_val4)\
(\
_sl1_vec_pach=_mm512_set1_pd((_sl1_f64_pach)),\
(__m512d)_mm512_alignr_epi32((__m512i)_sl1_arg_zmm,(__m512i)_sl1_vec_pach,14)\
)

__m512d _sr1_vec_pach;
#define F64_SR1(_sr1_arg_zmm, _sr1_f64_pach, _sr1_arg_val4)\
(\
_sr1_vec_pach=_mm512_set1_pd((_sr1_f64_pach)),\
(__m512d)_mm512_alignr_epi32((__m512i)_sr1_vec_pach,(__m512i)_sr1_arg_zmm,2)\
)

I still don't satisfy that, because I think the instructions "_mm512_set1_pd((_s*1_f64_pach))" waste the bandwidth seriously. So I still want to know whether there are some vector shift instructions that can work between a vector register and one scalar which needed to be pached the space generated by the vector shift.

Thank you very much!

jimdempseyatthecove · ‎04-18-2013

>>I need to shift vector register...
>>V: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
>>V: | 0 | 0 | 5 | 4 | 3 | 2 | 1 | 0 |

The above is not a shift, but could be done with mask

Shift 2 right would result in:
V: | 0 | 0 | 7 | 6 | 5 | 4 | 3 | 2 |

Try using the int32 instruction and shifting 2x the distance.

Jim Dempsey

Evgueni_P_Intel · ‎04-18-2013

If "pach" is constant, then load it once "p8=_mm512_set1_pd(&pach)" and use p8 -- it will stay in a register.

By the way, small functions declared as __forceinline can replace macro substitutions in many cases.