Yep, it's a big shame that

perfwise · ‎08-27-2013

In the process of testing VGATHER* instructions, a couple questions arose. One needs to put the indexes into a {X|Y}MM register for the VSIB addressing. To do so I imagine it's adventageous to put those indexes from GPRs to XMM. To do this most efficiently I'd imagine you would put directly the GPR value into the proper location of a XMM or YMM. This can be done with VPINSRD and VPINSRQ, however, you can't put these values into the upper 128-bits of a YMM. Was there some rationale as to why this wasn't important. Sure, you could do this in 5 instr with 2 VMOVQ, 2 PINSRQ and then a VINSERTI128, but just seems it was overlooked to extend the VPINSR* instr. Any chance of getting these extended some day?

Lastly.. on the VGATHER instructions, the mask is zero'd out, correct? In the pseuodo-code documentation it shows the mask being zero'd out irresprective of whether the upper bit of every mask entry is set or not. Is this the intended behavior of this instruction. I'll likely determine this myself.. but wanted to verify on the forum. Thanks..

Perfwise

Elmar · ‎08-29-2013

Yep, it's a big shame that VPINS* and VPEXTR* only work for the lower 128bits, clutters my code at many locations. For VINSERTPS the reason is obvious, since the immediate byte is fully used, but for the others...?

But I still got hope: since there is no YMM version documented, Intel can still fix this for AVX-512. Let's start a petition..?

CU,

Elmar

Let's start a petition

capens__nicolas · ‎08-29-2013

perfwise wrote:
In the process of testing VGATHER* instructions, a couple questions arose. One needs to put the indexes into a {X|Y}MM register for the VSIB addressing. To do so I imagine it's adventageous to put those indexes from GPRs to XMM. To do this most efficiently I'd imagine you would put directly the GPR value into the proper location of a XMM or YMM.

You shouldn't have to move between GPR and YMM registers much at all. AVX2 is intended to parallelize loops in an SPMD fashion: each YMM register holds just one logical scalar from your loop body, but for multiple iterations. In fact gather should help eliminate cases where you previously still needed GPRs. It's the parallel version of an indirect load operation.

perfwise · ‎08-30-2013

I've been coding these up.. and it may be possible that the indexes specified by the X|Ymm can be reused.. maybe not. In the case of a TRANSPOSE, yes you can reuse the indexes and just move the base GPR via LEA or some other methodology. However, if you were not making regular, matrix like, accesses, say for instance in a molecular dynamics package where the particle index must be looked up and then you can use that index to use this to gather many particles (loop iterations upon the particle count - not index) then you most definitely have to create a new index every unrolled (vectorized) loop iteration. That would be true in GROMACS and NAMD, for which I have experience with. So it's true you may not be hindered by the lack of PINSR* for some cases as I pointed out which are regular strided accesses by a multiple of an offset (matrix like) but there are many others which you will want PINSR* extended to 256-bits. Unfortunate that this isn't done.

More generally.. it would be nice if you could take 2 GPRs (d and q sized) and move them into a XMM with something of the sort:

VPINSR{D|Q}X {x|y}mm, gpr1, gpr2, imm8

where imm8 uses bits 0:3 to put gpr1 into a specified slot and 4:7 to put gpr2 into a specified slot of {x|y}mm.

Food for thought..

perfwise

Why weren't PINSR* instructions extended to 256-bits in AVX2