Solved: What is the fastest intrinsic to load constant stride data into __mm128 and __mm256 vectprs?

mikeitexpert · ‎06-20-2021

I have a rollup / cumulative summation loop which I am vectorizing.

// T can be either float or double
// SFX is a macro just verion the function with _d or _f 
// for float/double types of images

VL_EXPORT void
VL_XCAT(vl_imconvcoltri_integrate_bwd_, SFX)
(T* buffer, vl_index y, T* imagei, vl_size imageStride, vl_size imageHeight)
{
  for (; y >=  0 ; --y) {
      imagei -= imageStride ;
      buffer[y] = buffer[y + 1] + *imagei ;
    }
}

imageStride doesn't change throughout the loop and in intel advisor Refinement Report you can see different stride values for the serial code as shown below (under stride column):

After vectorization the refinement report looks as below:

For one thing, I used gather intrinsics (_mm_i32gather_p[ds] and _mm256_i32gather_p[ds]) to load the image data into a vector data type.

I am wondering given the stride values (imageStride values) from the Refinement Report (for the serial code), would be any faster/better replacement for gather instructions under SSE, AVX and AVX512 achitecture.

Please let me know your valuable comments.

Regards

jimdempseyatthecove · ‎06-26-2021

The AVX instruction set does not have scatter/gather instructions. AVX2 does have these instructions. .B1.4 loop is the "vector always" implementation of the loop, which was unrolled 2x (count # addps's). IOW the "vector always" instructs the compile to perform (if possible) the computation portion via vector instructions. In this case the computation is a singular ADD.

The loop .B1.8 is the implementation of the for loop as scalar. This was inserted here to take care of potential non-vector-wide remainder. The scalar loop was not unrolled 2x as it likely would not be productive to do so.

The following is the AVX2 implementation (note, while it is different, the compiler did not make use of Gather and Scatter):

.B1.4::                         ; Preds .B1.4 .B1.3
                                ; Execution count [5.00e+00]
        vmovss    xmm0, DWORD PTR [1492+rdx+r8]                 ;8.16
        lea       rbx, QWORD PTR [rdx+rcx]                      ;8.9
        vmovss    xmm1, DWORD PTR [rdx+r8]                      ;8.16
        add       rax, 8                                        ;7.5
        vmovss    xmm4, DWORD PTR [1492+rdx+r10]                ;8.23
        vmovss    xmm5, DWORD PTR [rdx+r10]                     ;8.23
        vinsertps xmm3, xmm0, DWORD PTR [4476+rdx+r8], 16       ;8.16
        vinsertps xmm2, xmm1, DWORD PTR [2984+rdx+r8], 16       ;8.16
        vinsertps xmm4, xmm4, DWORD PTR [4476+rdx+r10], 16      ;8.23
        vinsertps xmm5, xmm5, DWORD PTR [2984+rdx+r10], 16      ;8.23
        vunpcklps xmm0, xmm2, xmm3                              ;8.16
        vunpcklps xmm1, xmm5, xmm4                              ;8.23
        vaddps    xmm2, xmm0, xmm1                              ;8.23
        vmovss    DWORD PTR [rbx], xmm2                         ;8.9
        vextractps DWORD PTR [1492+rbx], xmm2, 1                ;8.9
        vextractps DWORD PTR [2984+rbx], xmm2, 2                ;8.9
        vextractps DWORD PTR [4476+rbx], xmm2, 3                ;8.9
        vmovss    xmm3, DWORD PTR [7460+rdx+r8]                 ;8.16
        vinsertps xmm2, xmm3, DWORD PTR [10444+rdx+r8], 16      ;8.16
        vmovss    xmm0, DWORD PTR [5968+rdx+r8]                 ;8.16
        vmovss    xmm3, DWORD PTR [7460+rdx+r10]                ;8.23
        vmovss    xmm4, DWORD PTR [5968+rdx+r10]                ;8.23
        vinsertps xmm1, xmm0, DWORD PTR [8952+rdx+r8], 16       ;8.16
        vinsertps xmm3, xmm3, DWORD PTR [10444+rdx+r10], 16     ;8.23
        vinsertps xmm5, xmm4, DWORD PTR [8952+rdx+r10], 16      ;8.23
        add       rdx, 11936                                    ;7.5
        vunpcklps xmm0, xmm1, xmm2                              ;8.16
        cmp       rax, r11                                      ;7.5
        vunpcklps xmm1, xmm5, xmm3                              ;8.23
        vaddps    xmm2, xmm0, xmm1                              ;8.23
        vmovss    DWORD PTR [5968+rbx], xmm2                    ;8.9
        vextractps DWORD PTR [7460+rbx], xmm2, 1                ;8.9
        vextractps DWORD PTR [8952+rbx], xmm2, 2                ;8.9
        vextractps DWORD PTR [10444+rbx], xmm2, 3               ;8.9
        jb        .B1.4         ; Prob 82%                      ;7.5

I am not sure why AVX2 chose to not use the Gather and Scatter, however, the instruction sequence is different from the AVX you presented.

And next, AVX512 (using Gather and Scatter):

.B1.4::                         ; Preds .B1.4 .B1.3
                                ; Execution count [5.00e+00]
        vpxord    zmm1, zmm1, zmm1                              ;8.16 c1
        kxnorw    k1, k0, k0                                    ;8.16 c1
        lea       rsi, QWORD PTR [rbx+r11]                      ;8.16 c1
        vpxord    zmm2, zmm2, zmm2                              ;8.23 c1
        kxnorw    k2, k0, k0                                    ;8.23 c1
        kxnorw    k3, k0, k0                                    ;8.9 c3
        vpxord    zmm4, zmm4, zmm4                              ;8.16 c3
        kxnorw    k4, k0, k0                                    ;8.16 c3
        vpxord    zmm5, zmm5, zmm5                              ;8.23 c3
        vgatherdps zmm1{k1}, DWORD PTR [rsi+zmm0*4]             ;8.16 c3
        lea       rsi, QWORD PTR [r8+r11]                       ;8.23 c3
        kxnorw    k5, k0, k0                                    ;8.23 c5
        add       r9, 32                                        ;7.5 c5
        vgatherdps zmm2{k2}, DWORD PTR [rsi+zmm0*4]             ;8.23 c5
        kxnorw    k6, k0, k0                                    ;8.9 c7
        lea       rsi, QWORD PTR [rcx+r11]                      ;8.9 c9
        vaddps    zmm3, zmm1, zmm2                              ;8.23 c11
        vscatterdps DWORD PTR [rsi+zmm0*4]{k3}, zmm3            ;8.9 c17 stall 2
        lea       rsi, QWORD PTR [23872+r11+rbx]                ;8.16 c17
        vgatherdps zmm4{k4}, DWORD PTR [rsi+zmm0*4]             ;8.16 c19
        lea       rsi, QWORD PTR [23872+r11+r8]                 ;8.16 c23 stall 1
        vgatherdps zmm5{k5}, DWORD PTR [rsi+zmm0*4]             ;8.23 c25
        lea       rsi, QWORD PTR [23872+r11+rcx]                ;8.16 c25
        add       r11, 47744                                    ;7.5 c25
        vaddps    zmm16, zmm4, zmm5                             ;8.23 c31 stall 2
        vscatterdps DWORD PTR [rsi+zmm0*4]{k6}, zmm16           ;8.9 c37 stall 2
        cmp       r9, r10                                       ;7.5 c37
        jb        .B1.4         ; Prob 82%                      ;7.5 c39

Jim Dempsey

View solution in original post

ShivaniK_Intel · ‎06-21-2021

Hi,

Thanks for reaching out to us.

We are working on it and will get back to you soon.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎06-24-2021

Hi,

Could you please provide the version of Intel advisor, compiler, and compiler details you have been using?

Could you also please provide us the compilation steps and sample reproducer code to investigate more on your issue?

Thanks & Regards

Shivani

mikeitexpert · ‎06-25-2021

Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623

jimdempseyatthecove · ‎06-25-2021

Gather instructions are only available with AVX2 and AVX512 (and KNC).

For gather to work efficiently, one does not update the indexes, instead, it is better to update the base address to the (next) first of stride to load (this conserves a vector register).

Jim Dempsey

mikeitexpert · ‎06-25-2021

Thanks Jim

I am curious how icc handles that .... so I prepared simple array add example as you can see below:


float add(float c[], float a[], float b[], int numel){
    #pragma ivdep
    #pragma vector always
    for(int i = 0; i < numel;  i+=373 ){
        c[i] = a[i] + b[i];
    }
    float d = a[0] + b[0];
    return d;
}

I forced the compiler to vectorized it ...

C:\tmp\gathertest>make
icl -arch:avx -Fo:test_gather.obj -Fa:test_gather.asm -c test_gather.cpp
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

test_gather.cpp

C:\tmp\gathertest>

Then I checked out asm files ...

; mark_description "Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623";
; mark_description "";
; mark_description "/arch:avx /Fo:test_gather.obj /Fa:test_gather.asm /c";
	OPTION DOTNAME
_TEXT	SEGMENT      'CODE'
;	COMDAT ?add@@YAMQEAM00H@Z
TXTST0:
; -- Begin  ?add@@YAMQEAM00H@Z
;?add@@YAMQEAM00H@Z	ENDS
_TEXT	ENDS
_TEXT	SEGMENT      'CODE'
;	COMDAT ?add@@YAMQEAM00H@Z
; mark_begin;
       ALIGN     16
	PUBLIC ?add@@YAMQEAM00H@Z
; --- add(float *, float *, float *, int)
?add@@YAMQEAM00H@Z	PROC 
; parameter 1: rcx
; parameter 2: rdx
; parameter 3: r8
; parameter 4: r9d
.B1.1::                         ; Preds .B1.0
                                ; Execution count [1.00e+00]
L1::
                                                           ;2.54
        mov       r10, r8                                       ;2.54
        mov       r8, rdx                                       ;2.54
        test      r9d, r9d                                      ;5.24
        jle       .B1.10        ; Prob 50%                      ;5.24
                                ; LOE rcx rbx rbp rsi rdi r8 r10 r12 r14 r15 r9d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.2::                         ; Preds .B1.1
                                ; Execution count [9.00e-01]
        add       r9d, 372                                      ;2.54
        mov       eax, 1473876177                               ;2.54
        imul      r9d                                           ;2.54
        sar       r9d, 31                                       ;2.54
        sar       edx, 7                                        ;2.54
        sub       edx, r9d                                      ;2.54
        movsxd    rax, edx                                      ;5.5
        cmp       rax, 16                                       ;5.5
        jl        .B1.11        ; Prob 10%                      ;5.5
                                ; LOE rax rcx rbx rbp rsi rdi r8 r10 r12 r14 r15 edx xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.3::                         ; Preds .B1.2
                                ; Execution count [9.00e-01]
        and       edx, -16                                      ;5.5
        xor       r11d, r11d                                    ;5.5
        movsxd    rdx, edx                                      ;5.5
        xor       r9d, r9d                                      ;5.5
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.4::                         ; Preds .B1.4 .B1.3
                                ; Execution count [5.00e+00]
        vmovss    xmm4, DWORD PTR [1492+r9+r8]                  ;6.16
        add       r11, 16                                       ;5.5
        vmovss    xmm5, DWORD PTR [r9+r8]                       ;6.16
        vmovss    xmm3, DWORD PTR [7460+r9+r8]                  ;6.16
        vmovss    xmm2, DWORD PTR [5968+r9+r8]                  ;6.16
        vinsertps xmm0, xmm4, DWORD PTR [4476+r9+r8], 16        ;6.16
        vinsertps xmm1, xmm5, DWORD PTR [2984+r9+r8], 16        ;6.16
        vinsertps xmm5, xmm3, DWORD PTR [10444+r9+r8], 16       ;6.16
        vinsertps xmm4, xmm2, DWORD PTR [8952+r9+r8], 16        ;6.16
        vunpcklps xmm1, xmm1, xmm0                              ;6.16
        vunpcklps xmm0, xmm4, xmm5                              ;6.16
        vmovss    xmm4, DWORD PTR [r9+r10]                      ;6.23
        vmovss    xmm2, DWORD PTR [1492+r9+r10]                 ;6.23
        vinsertps xmm5, xmm4, DWORD PTR [2984+r9+r10], 16       ;6.23
        vmovss    xmm4, DWORD PTR [5968+r9+r10]                 ;6.23
        vinsertf128 ymm3, ymm1, xmm0, 1                         ;6.16
        vinsertps xmm1, xmm2, DWORD PTR [4476+r9+r10], 16       ;6.23
        vmovss    xmm0, DWORD PTR [7460+r9+r10]                 ;6.23
        vunpcklps xmm2, xmm5, xmm1                              ;6.23
        vinsertps xmm1, xmm0, DWORD PTR [10444+r9+r10], 16      ;6.23
        vinsertps xmm5, xmm4, DWORD PTR [8952+r9+r10], 16       ;6.23
        vunpcklps xmm0, xmm5, xmm1                              ;6.23
        vinsertf128 ymm1, ymm2, xmm0, 1                         ;6.23
        vaddps    ymm3, ymm3, ymm1                              ;6.23
        vextractf128 xmm1, ymm3, 1                              ;6.9
        vmovss    DWORD PTR [r9+rcx], xmm3                      ;6.9
        vmovss    DWORD PTR [5968+r9+rcx], xmm1                 ;6.9
        vextractps DWORD PTR [1492+r9+rcx], xmm3, 1             ;6.9
        vextractps DWORD PTR [2984+r9+rcx], xmm3, 2             ;6.9
        vextractps DWORD PTR [4476+r9+rcx], xmm3, 3             ;6.9
        vextractps DWORD PTR [7460+r9+rcx], xmm1, 1             ;6.9
        vextractps DWORD PTR [8952+r9+rcx], xmm1, 2             ;6.9
        vextractps DWORD PTR [10444+r9+rcx], xmm1, 3            ;6.9
        vmovss    xmm0, DWORD PTR [13428+r9+r8]                 ;6.16
        vinsertps xmm5, xmm0, DWORD PTR [16412+r9+r8], 16       ;6.16
        vmovss    xmm2, DWORD PTR [11936+r9+r8]                 ;6.16
        vmovss    xmm1, DWORD PTR [19396+r9+r8]                 ;6.16
        vmovss    xmm0, DWORD PTR [17904+r9+r8]                 ;6.16
        vinsertps xmm4, xmm2, DWORD PTR [14920+r9+r8], 16       ;6.16
        vinsertps xmm3, xmm1, DWORD PTR [22380+r9+r8], 16       ;6.16
        vinsertps xmm2, xmm0, DWORD PTR [20888+r9+r8], 16       ;6.16
        vunpcklps xmm4, xmm4, xmm5                              ;6.16
        vunpcklps xmm5, xmm2, xmm3                              ;6.16
        vmovss    xmm2, DWORD PTR [11936+r9+r10]                ;6.23
        vmovss    xmm0, DWORD PTR [13428+r9+r10]                ;6.23
        vinsertps xmm3, xmm2, DWORD PTR [14920+r9+r10], 16      ;6.23
        vmovss    xmm2, DWORD PTR [17904+r9+r10]                ;6.23
        vinsertf128 ymm1, ymm4, xmm5, 1                         ;6.16
        vinsertps xmm4, xmm0, DWORD PTR [16412+r9+r10], 16      ;6.23
        vmovss    xmm5, DWORD PTR [19396+r9+r10]                ;6.23
        vunpcklps xmm0, xmm3, xmm4                              ;6.23
        vinsertps xmm4, xmm5, DWORD PTR [22380+r9+r10], 16      ;6.23
        vinsertps xmm3, xmm2, DWORD PTR [20888+r9+r10], 16      ;6.23
        vunpcklps xmm5, xmm3, xmm4                              ;6.23
        vinsertf128 ymm0, ymm0, xmm5, 1                         ;6.23
        vaddps    ymm1, ymm1, ymm0                              ;6.23
        vextractf128 xmm0, ymm1, 1                              ;6.9
        vmovss    DWORD PTR [11936+r9+rcx], xmm1                ;6.9
        vmovss    DWORD PTR [17904+r9+rcx], xmm0                ;6.9
        vextractps DWORD PTR [13428+r9+rcx], xmm1, 1            ;6.9
        vextractps DWORD PTR [14920+r9+rcx], xmm1, 2            ;6.9
        vextractps DWORD PTR [16412+r9+rcx], xmm1, 3            ;6.9
        vextractps DWORD PTR [19396+r9+rcx], xmm0, 1            ;6.9
        vextractps DWORD PTR [20888+r9+rcx], xmm0, 2            ;6.9
        vextractps DWORD PTR [22380+r9+rcx], xmm0, 3            ;6.9
        add       r9, 23872                                     ;5.5
        cmp       r11, rdx                                      ;5.5
        jb        .B1.4         ; Prob 82%                      ;5.5
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.6::                         ; Preds .B1.4 .B1.11
                                ; Execution count [1.00e+00]
        imul      r9, rdx, 1492                                 ;5.5
        cmp       rdx, rax                                      ;5.5
        jae       .B1.10        ; Prob 9%                       ;5.5
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.8::                         ; Preds .B1.6 .B1.8
                                ; Execution count [5.00e+00]
        vmovss    xmm0, DWORD PTR [r9+r8]                       ;6.16
        inc       rdx                                           ;5.5
        vaddss    xmm1, xmm0, DWORD PTR [r9+r10]                ;6.23
        vmovss    DWORD PTR [r9+rcx], xmm1                      ;6.9
        add       r9, 1492                                      ;5.5
        cmp       rdx, rax                                      ;5.5
        jb        .B1.8         ; Prob 82%                      ;5.5
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.10::                        ; Preds .B1.8 .B1.1 .B1.6
                                ; Execution count [1.00e+00]
        vmovss    xmm0, DWORD PTR [r8]                          ;8.15
        vaddss    xmm0, xmm0, DWORD PTR [r10]                   ;8.22
        vzeroupper                                              ;9.12
        ret                                                     ;9.12
                                ; LOE
.B1.11::                        ; Preds .B1.2
                                ; Execution count [9.00e-02]: Infreq
        xor       edx, edx                                      ;5.5
        jmp       .B1.6         ; Prob 100%                     ;5.5
        ALIGN     16
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r10 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.12::
; mark_end;
?add@@YAMQEAM00H@Z ENDP
;?add@@YAMQEAM00H@Z	ENDS
_TEXT	ENDS
_DATA	SEGMENT      'DATA'
_DATA	ENDS
; -- End  ?add@@YAMQEAM00H@Z
_DATA	SEGMENT      'DATA'
_DATA	ENDS
EXTRN	__ImageBase:PROC
EXTRN	_fltused:BYTE
	INCLUDELIB <libmmt>
	INCLUDELIB <LIBCMT>
	INCLUDELIB <libirc>
	INCLUDELIB <svml_dispmt>
	INCLUDELIB <OLDNAMES>
	INCLUDELIB <libdecimal>
	END

I only see only (though I am not an assembly expert)

vextractps
vinsertps
vunpcklps

Do you think it might be a better approach to find better intrinics this way? or perhaps I better off learning the language using a reference or sth ... (I guess I am just trying to prevent asking too many stupid questions on the Intel's forum

Please let me know thank you. (appreciate it)

jimdempseyatthecove · ‎06-26-2021

The AVX instruction set does not have scatter/gather instructions. AVX2 does have these instructions. .B1.4 loop is the "vector always" implementation of the loop, which was unrolled 2x (count # addps's). IOW the "vector always" instructs the compile to perform (if possible) the computation portion via vector instructions. In this case the computation is a singular ADD.

The loop .B1.8 is the implementation of the for loop as scalar. This was inserted here to take care of potential non-vector-wide remainder. The scalar loop was not unrolled 2x as it likely would not be productive to do so.

The following is the AVX2 implementation (note, while it is different, the compiler did not make use of Gather and Scatter):

.B1.4::                         ; Preds .B1.4 .B1.3
                                ; Execution count [5.00e+00]
        vmovss    xmm0, DWORD PTR [1492+rdx+r8]                 ;8.16
        lea       rbx, QWORD PTR [rdx+rcx]                      ;8.9
        vmovss    xmm1, DWORD PTR [rdx+r8]                      ;8.16
        add       rax, 8                                        ;7.5
        vmovss    xmm4, DWORD PTR [1492+rdx+r10]                ;8.23
        vmovss    xmm5, DWORD PTR [rdx+r10]                     ;8.23
        vinsertps xmm3, xmm0, DWORD PTR [4476+rdx+r8], 16       ;8.16
        vinsertps xmm2, xmm1, DWORD PTR [2984+rdx+r8], 16       ;8.16
        vinsertps xmm4, xmm4, DWORD PTR [4476+rdx+r10], 16      ;8.23
        vinsertps xmm5, xmm5, DWORD PTR [2984+rdx+r10], 16      ;8.23
        vunpcklps xmm0, xmm2, xmm3                              ;8.16
        vunpcklps xmm1, xmm5, xmm4                              ;8.23
        vaddps    xmm2, xmm0, xmm1                              ;8.23
        vmovss    DWORD PTR [rbx], xmm2                         ;8.9
        vextractps DWORD PTR [1492+rbx], xmm2, 1                ;8.9
        vextractps DWORD PTR [2984+rbx], xmm2, 2                ;8.9
        vextractps DWORD PTR [4476+rbx], xmm2, 3                ;8.9
        vmovss    xmm3, DWORD PTR [7460+rdx+r8]                 ;8.16
        vinsertps xmm2, xmm3, DWORD PTR [10444+rdx+r8], 16      ;8.16
        vmovss    xmm0, DWORD PTR [5968+rdx+r8]                 ;8.16
        vmovss    xmm3, DWORD PTR [7460+rdx+r10]                ;8.23
        vmovss    xmm4, DWORD PTR [5968+rdx+r10]                ;8.23
        vinsertps xmm1, xmm0, DWORD PTR [8952+rdx+r8], 16       ;8.16
        vinsertps xmm3, xmm3, DWORD PTR [10444+rdx+r10], 16     ;8.23
        vinsertps xmm5, xmm4, DWORD PTR [8952+rdx+r10], 16      ;8.23
        add       rdx, 11936                                    ;7.5
        vunpcklps xmm0, xmm1, xmm2                              ;8.16
        cmp       rax, r11                                      ;7.5
        vunpcklps xmm1, xmm5, xmm3                              ;8.23
        vaddps    xmm2, xmm0, xmm1                              ;8.23
        vmovss    DWORD PTR [5968+rbx], xmm2                    ;8.9
        vextractps DWORD PTR [7460+rbx], xmm2, 1                ;8.9
        vextractps DWORD PTR [8952+rbx], xmm2, 2                ;8.9
        vextractps DWORD PTR [10444+rbx], xmm2, 3               ;8.9
        jb        .B1.4         ; Prob 82%                      ;7.5

I am not sure why AVX2 chose to not use the Gather and Scatter, however, the instruction sequence is different from the AVX you presented.

And next, AVX512 (using Gather and Scatter):

.B1.4::                         ; Preds .B1.4 .B1.3
                                ; Execution count [5.00e+00]
        vpxord    zmm1, zmm1, zmm1                              ;8.16 c1
        kxnorw    k1, k0, k0                                    ;8.16 c1
        lea       rsi, QWORD PTR [rbx+r11]                      ;8.16 c1
        vpxord    zmm2, zmm2, zmm2                              ;8.23 c1
        kxnorw    k2, k0, k0                                    ;8.23 c1
        kxnorw    k3, k0, k0                                    ;8.9 c3
        vpxord    zmm4, zmm4, zmm4                              ;8.16 c3
        kxnorw    k4, k0, k0                                    ;8.16 c3
        vpxord    zmm5, zmm5, zmm5                              ;8.23 c3
        vgatherdps zmm1{k1}, DWORD PTR [rsi+zmm0*4]             ;8.16 c3
        lea       rsi, QWORD PTR [r8+r11]                       ;8.23 c3
        kxnorw    k5, k0, k0                                    ;8.23 c5
        add       r9, 32                                        ;7.5 c5
        vgatherdps zmm2{k2}, DWORD PTR [rsi+zmm0*4]             ;8.23 c5
        kxnorw    k6, k0, k0                                    ;8.9 c7
        lea       rsi, QWORD PTR [rcx+r11]                      ;8.9 c9
        vaddps    zmm3, zmm1, zmm2                              ;8.23 c11
        vscatterdps DWORD PTR [rsi+zmm0*4]{k3}, zmm3            ;8.9 c17 stall 2
        lea       rsi, QWORD PTR [23872+r11+rbx]                ;8.16 c17
        vgatherdps zmm4{k4}, DWORD PTR [rsi+zmm0*4]             ;8.16 c19
        lea       rsi, QWORD PTR [23872+r11+r8]                 ;8.16 c23 stall 1
        vgatherdps zmm5{k5}, DWORD PTR [rsi+zmm0*4]             ;8.23 c25
        lea       rsi, QWORD PTR [23872+r11+rcx]                ;8.16 c25
        add       r11, 47744                                    ;7.5 c25
        vaddps    zmm16, zmm4, zmm5                             ;8.23 c31 stall 2
        vscatterdps DWORD PTR [rsi+zmm0*4]{k6}, zmm16           ;8.9 c37 stall 2
        cmp       r9, r10                                       ;7.5 c37
        jb        .B1.4         ; Prob 82%                      ;7.5 c39

Jim Dempsey

jimdempseyatthecove · ‎06-26-2021

By the way.

Prior to the gather/scatter loop zmm0 was loaded with the offsets:

vmovdqu32 zmm0, ZMMWORD PTR [_2il0floatpacket.0] ;

Where:

_2il0floatpacket.0 DD 000000000H,000000175H,0000002eaH,00000045fH,0000005d4H,000000749H,0000008beH,000000a33H,000000ba8H,000000d1dH,000000e92H,000001007H,00000117cH,0000012f1H,000001466H,0000015dbH

The AVX2 could have generated similar code using 256-bit vectors.

Jim Dempsey

jimdempseyatthecove · ‎06-26-2021

Also (if Intel is watching this thread) the above AVX512 loop is unnecessarily creating a 1's mask (kxnorw's) which would only be required (to one mask) on the last (remainder) iteration.

IOW two additional optimizatons:

1) AVX2 to make use of Gather/Scatter

2) better optimization of mask registers

Jim Dempsey

mikeitexpert · ‎06-26-2021

Thanks so much for valuable details. Much appreciate your comments.

Interestingly enough -arch:core-avx2 did not generate any gather/scather either!

If I use both-arch:skylake-avx512 -arch:core-avx2, the earlier is overridden by the later! (very interesting perhaps there is rational behind it), however, if I use skylay-avx512 it did generate the gather/scatter intrinsics. (below)

C:\temp\simdtests\gathertest>make
icl -arch:skylake-avx512 -arch:core-avx2 -Fo:test_gather.obj -Fa:test_gather.asm -c test_gather.cpp
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.
icl: command line warning #10121: overriding '/archskylake-avx512' with '/archcore-avx2'

test_gather.cpp

C:\temp\simdtests\gathertest>make
icl -arch:skylake-avx512 -Fo:test_gather.obj -Fa:test_gather.asm -c test_gather.cpp
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

test_gather.cpp

C:\temp\simdtests\gathertest>

.B1.4::                         ; Preds .B1.4 .B1.3
                                ; Execution count [5.00e+00]
        vpcmpeqb  k1, xmm0, xmm0                                ;6.16
        lea       rsi, QWORD PTR [rbx+r11]                      ;6.16
        vpcmpeqb  k2, xmm0, xmm0                                ;6.16
        vpcmpeqb  k3, xmm0, xmm0                                ;6.23
        vpcmpeqb  k4, xmm0, xmm0                                ;6.23
        vpcmpeqb  k5, xmm0, xmm0                                ;6.9
        vpcmpeqb  k6, xmm0, xmm0                                ;6.9
        vxorps    ymm2, ymm2, ymm2                              ;6.16
        add       r9, 16                                        ;5.5
        vxorps    ymm4, ymm4, ymm4                              ;6.16
        vxorps    ymm3, ymm3, ymm3                              ;6.23
        vxorps    ymm5, ymm5, ymm5                              ;6.23
        vgatherdps ymm2{k1}, DWORD PTR [rsi+ymm1*4]             ;6.16
        vgatherdps ymm4{k2}, DWORD PTR [11936+rsi+ymm1*4]       ;6.16
        lea       rsi, QWORD PTR [r8+r11]                       ;6.23
        vgatherdps ymm5{k4}, DWORD PTR [11936+rsi+ymm1*4]       ;6.23
        vgatherdps ymm3{k3}, DWORD PTR [rsi+ymm1*4]             ;6.23
        vaddps    ymm17, ymm4, ymm5                             ;6.23
        vaddps    ymm16, ymm2, ymm3                             ;6.23
        lea       rsi, QWORD PTR [rcx+r11]                      ;6.9
        add       r11, 23872                                    ;5.5
        vscatterdps DWORD PTR [rsi+ymm0*4]{k5}, ymm16           ;6.9
        vscatterdps DWORD PTR [11936+rsi+ymm0*4]{k6}, ymm17     ;6.9
        cmp       r9, r10                                       ;5.5
        jb        .B1.4         ; Prob 82%                      ;5.5

I am wondering what compiler switch should I use when my processor is not a skylake but an AVX512 supported.

Thanks

Mike

jimdempseyatthecove · ‎06-27-2021

Have you considered using the intrinsic functions?

It is a little more work, but can yield productive results.

Be careful to stick with the instruction set(s) supported by your machine(s).

Jim Dempsey

mikeitexpert · ‎06-28-2021

Intrinsic functions makes more sense to use. I am just trying to cheat and find the right function just by looking at compilers assembly instruction selection just to make sure I am using the right intrinsics. ( may sound weird. Perhaps I should start reading more about intrinsics and how to use them for different tasks to get better performance outcome.)

Please let me know if there is a book or article for that matter.

Regards

jimdempseyatthecove · ‎06-29-2021

In the Intel C++ documentation you might find the following informative:

Processor Targeting (intel.com)

cpu_dispatch, cpu_specific (intel.com)

_allow_cpu_features (intel.com)

_may_i_use_cpu_feature (intel.com)

and the Intrinsics guide.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined

In the intrinsics guid, you can enter "gather_pd" into the search, then individually select(check) the Technologies check boxes:

SSE4.1 (doesn't have gather)

AVX (doesn't have gather)

AVX2 (has gather)

AVX512(F) (has gather)

If (when found) you click on the intrinsic function, you will then expose a Synopsis, Description and Operation pseudo code.

(unfortunately programming example code not provided)

To get programming example, try Google search: _mm256_i32gather_pd

2nd link: https://stackoverflow.com/questions/50883785/how-to-use-vindex-and-scale-with-mm-i32gather-epi32-to-gather-elements

Jim Dempsey

jimdempseyatthecove · ‎06-27-2021

Use -QxHost

Jim Dempsey

mikeitexpert · ‎06-28-2021

No luck with QxHost either perhaps the processor is not supporting AVX2 ... weird though because cpu-z says it does.

mikeitexpert · ‎06-28-2021

Feel free to close this thread.

Regards

ShivaniK_Intel · ‎06-28-2021

Hi,

Thanks for accepting as a solution!

Glad to know that your issue is resolved. As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Have a Good day.

Thanks & Regards

Shivani