Costs of scatter store

Johannes_P_ · ‎09-04-2017

I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them.

What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory into two vector registers and compresses them using a dynamic mask (fig. 1). The resulting vector registers are scattered into the memory, so that the two vector registers are interleaved (fig. 2).

Figure 1: Compressing the two data vector registers using the same dynamically created mask.

Figure 2: Scatter store to interleave the compressed data.

My code looks like the following:

void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) {
   /* Initialize a vector register containing zeroes to get the store mask */
   __m512i zeroVec     = _mm512_setzero_epi32();
   /* Load data */
   __m512i dataVec_1   = _mm512_conflict_epi32( data );
   __m512i dataVec_2   = _mm512_conflict_epi32( data + 16 );
   /* Compress the data */
   __m512i compVec_1   = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 );
   __m512i compVec_2   = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 );

   /* Get the store mask by compare the compressed register with the zero-register (4 means !=) */
   __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 );

   /* Interleave the selected data */
   _mm512_mask_i32scatter_epi32(
      result,
      maskStore,
      vindex,
      compVec_1,
      1
   );
   _mm512_mask_i32scatter_epi32(
      result + 1,
      maskStore,
      vindex,
      compVec_2,
      1
   );
}

I compiled everything with

-O3 -march=knl -lmemkind -mavx512f -mavx512pf

I call the method for 100'000'000 elements. To actually get an overview of the behaviour of the scatter store I repeated this measurement with different values for maskCompress. I expected some kind of dependence between the time needed for execution and the number of set bits within the maskCompress. But I observed, that the tests needed roughly the same time for execution.

I did a little bit of research and came up to this: Instruction latency of avx512. Following the given link, the latency of the used instructions are constant. But to be honest, I am a little bit confused about this behaviour.

I know that this is not the usual way, but I have 3 questions, related to this topic and I am hopefull that one can help me out.

Why should a masked store with only one set bit needs the same time as a masked store where all bits are set?
Does anyone has some experience or is there a good documentation to understand the behaviour of the AVX512 scatter store?
Is there a more easy or more performant way to interleave two vector registers?

Thanks for your help!

Sincerely

andysem · ‎09-05-2017

Is there a more easy or more performant way to interleave two vector registers?

The easiest and fastest way I know is to use one of the unpack instructions (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_unpack). The result will be in a vector register, so if you also want to preserve the data in memory that is not supposed to be overwritten by the "unused" elements in the vector you will have to perform a blend or masked move first. Or better yet, you can perform a masked unpack. Then, when you have the final resulting vector of data, write it to memory with a regular store.

Anil_M_ · ‎09-30-2017

Hi Johannes,

It is unlikely to be the case but still, did you by any chance put garbage value in the vindex, or leave it uninitialized for the places where the mask is zero ? If that is the case even if the mask bits are not set, illegal addresses will trigger CPU assists where AFIK op codes will be fetched from the MSROM and cause long delays thus making any latency difference caused by the masks to be negligible.

-Anil