Why does compiler generate VGATHER instructions?

McCalpinJohn · ‎12-08-2015

Working with the Intel compiler (icc (ICC) 15.0.3 20150407), I have a simple loop for which the compiler is generating what looks like terrible code.

The loop is a trivial calculation over a contiguous range of indices of some float arrays:

    for (i=Nwin+Ngap; i<N-Nwin-Ngap; i++) {
        w_avg = avgscale2*(slidingsum[i+Ngap+1]+slidingsum[i-Ngap-Nwin]);
    }

The assembler output shows a completely reasonable piece of scalar code for the short vector case:

                                # LOE rdx r8 r9 r12 r13 r14 eax ecx ebx esi edi r10d r11d xmm0 xmm1
..B10.90:                       # Preds ..B10.90 ..B10.89
        incl      %ecx                                          #1344.2
        vmovss    (%r8,%rdx,4), %xmm2                           #1345.25
        incl      %edx                                          #1344.2
        vaddss    (%r8,%r13,4), %xmm2, %xmm3                    #1345.46
        incl      %r13d                                         #1344.2
        vmulss    %xmm3, %xmm0, %xmm4                           #1345.46
        vmovss    %xmm4, (%r9,%r12,4)                           #1345.3
        incq      %r12                                          #1344.2
        cmpl      %eax, %ecx                                    #1344.2
        jb        ..B10.90      # Prob 82%                      #1344.2

But the vectorized version of the code is bizarre. It is hard to tell exactly what is happening, but the code has something like 10 vector instructions to manipulate the indices used by the two gather instructions, even though the memory references are contiguous!

                                # LOE r8 r9 r12 r15 eax edx ecx ebx esi edi r10d r11d r13d xmm0 xmm1 ymm3 ymm4 ymm5 ymm6 ymm7 ymm8 ymm9
..B10.86:                       # Preds ..B10.86 ..B10.85
        vpaddd    %ymm6, %ymm7, %ymm10                          #1345.38
        lea       (%r12,%r8), %r14                              #1345.25
        vpsubd    %ymm6, %ymm7, %ymm14                          #1345.57
        vpaddd    %ymm3, %ymm7, %ymm7                           #1345.3
        vpaddd    %ymm4, %ymm10, %ymm11                         #1345.43
        vpsubd    %ymm5, %ymm14, %ymm15                         #1345.57
        vpaddd    .L_2il0floatpacket.3(%rip), %ymm11, %ymm12    #1345.25
        vpaddd    .L_2il0floatpacket.3(%rip), %ymm15, %ymm10    #1345.46
        vxorps    %ymm2, %ymm2, %ymm2                           #1345.25
        addl      $8, %r13d                                     #1344.2
        vmovdqa   %ymm9, %ymm13                                 #1345.25
        vgatherdps %ymm13, (%r14,%ymm12,4), %ymm2               #1345.25
        vxorps    %ymm12, %ymm12, %ymm12                        #1345.46
        vmovdqa   %ymm9, %ymm11                                 #1345.46
        vgatherdps %ymm11, (%r14,%ymm10,4), %ymm12              #1345.46
        vaddps    %ymm12, %ymm2, %ymm2                          #1345.46
        vmulps    %ymm2, %ymm8, %ymm13                          #1345.46
        vmovups   %ymm13, (%r9,%r15,4)                          #1345.3
        addq      $8, %r15                                      #1344.2
        cmpl      %ecx, %r13d                                   #1344.2
        jb        ..B10.86      # Prob 82%                      #1344.2

For L1-contained data (loop trip count of about 900 elements), the loop runs at a very poor rate of about 3 cycles per element. The scalar code should be faster than that! Neglecting alignment issues, the vector code should look more like:

    vmovups (address1), %ymm1
    vaddps (address1), %ymm1, %ymm2
    vmulps %ymm1, %ymm2, %ymm3
    vmovups %ymm3, (address3)
    addq
    cmpl
    jb

which should take between 1 and 2 cycles for 8 elements, or 0.125 to 0.25 cycles per element -- at least 10x faster than the code currently runs.

I have played around with loop length and alignment pragmas with no impact. Is there some other reason that the compiler would be generating this ridiculously complex VGATHER-based code?

McCalpinJohn · ‎12-09-2015

In the previous example, the vectorization report says that a gather is generated for the slidingsum accesses because of an "indirect access".

The access is clearly not "indirect" in the usual sense -- I am not loading a value from memory and then using that to compute a different address -- unless the compiler thinks that stores to the "w_avg" array might be modifying the Nwin or Ngap variables? But if that were the case I would have expected some aliasing messages in the optimization report, but there are none. (All the arrays are accessed through "restrict" pointers, while Nwin and Ngap are passed by value into this routine.)

As I suspected, adding a "#pragma novector" reduced the execution time from 3 cycles per element to 2 cycles per element.

McCalpinJohn · ‎12-09-2015

Manually vectorizing the loop (with intrinsics) reduces the execution time to about 0.5 cycles per element -- a bit slower than I would like, but 6x faster than the compiler-generated VGATHER-based code and 4x faster than the scalar code.

It looks like the compiler is not doing an optimal job of removing the common sub-expressions from the vector loads in this case, with the assembly code looking like:

                                # LOE rdx rdi r9 r11 r12 eax ecx ebx r8d r10d r13d r14d r15d xmm5 xmm6 ymm0
..B9.83:                        # Preds ..B9.83 ..B9.82
        lea       1(%rax,%r14), %esi                            #1361.12
        vmovups   (%rdx,%rsi,4), %ymm1                          #1359.28
        movl      %r14d, %esi                                   #1052.1
        negl      %esi                                          #1052.1
        subl      %r15d, %esi                                   #1052.1
        addl      %eax, %esi                                    #1361.12
        addl      $8, %eax                                      #1358.35
        vaddps    (%rdx,%rsi,4), %ymm1, %ymm2                   #1361.12
        vmulps    %ymm0, %ymm2, %ymm3                           #1362.12
        vmovups   %ymm3, (%rdi,%r11,4)                          #1363.21
        addq      $8, %r11                                      #1358.35
        cmpl      %ecx, %eax                                    #1358.29
        jb        ..B9.83       # Prob 82%                      #1358.29

If I could remove some of the unnecessary index fiddling, the performance would probably be closer to the 0.25 cycles per element (2 cycles per loop) that I originally expected.

McCalpinJohn · ‎12-09-2015

I still don't know why the compiler decides to generate the code for these loops using VGATHER instructions, but at least I figured out how to prevent it.

Using a temporary variable to hold the computed offset for the duration of the loop seems to do the trick.

The loop above becomes:

    register int offset_plus, offset_minus;
    offset_plus = Ngap+1;
    offset_minus = Ngap+Nwin;
    for (i=Nwin+Ngap; i<N-Nwin-Ngap; i++) {
        w_avg = avgscale2*(slidingsum[i+offset_plus]+slidingsum[i-offset_minus]);
    }

The code vectorizes cleanly with no messages about "gather" and no "vgather" instructions. Execution time is about 0.45 cycles per element if I exclude the overhead of the timer calls. The compiler unrolls enough to perform 16 loop iterations per pass, so about 7.2 cycles for the 15 instructions.

The code is still not quite optimal. For some reason the compiler is incrementing 5 registers by 16 each time through the loop and still needs to perform a compare instruction before the final branch. Given that there are only 3 memory references (and the difference across the addresses is constant) this seems inefficient. Hmmm..... Two of the registers are used for the loads, with immediate offsets used for the second 8 elements in each input stream. Two of the registers are used for the two stores -- maybe the store instruction does not allow the same sort of immediate offset? The fifth register is used for the loop trip count. Seems redundant....

TimP · ‎12-09-2015

I submitted a case 5 months ago where the compiler implements stl transform by vgather when targeting knc. Ips6000059929. For knc it is normal to use vgather for remainder loop, so the cause may be entirely different.

McCalpinJohn · ‎12-10-2015

I found the problem & generated a trivial reproducer (appended)... Compile with:

icc -O3 -xCORE-AVX2 -std=c99 -qopt-report=5 -S VGATHERPS_reproducer.c

Then look for the loops that were compiled with VGATHERPS using:

grep gather VGATHERPS_reproducer.optrpt

Summary:

The icc 15.0.3 compiler will generate a loop using VGATHERPS instructions (for the CORE-AVX2 target) if an array offset uses an unsigned variable.

Notes:

It does not matter if the unsigned variable is a register variable, a const variable, or an ordinary local variable.
It does not matter if the loop index is also unsigned.
All nine cases that I tested using unsigned variables in array index calculations generate VGATHERPS code, while none of the four cases using signed variables do so.
Limited checks with the Intel 16 compiler show the same behavior.

#include <stdio.h>
#include <stdint.h>

int VGATHERPS_reproducer(
	int			N,
	int			N1,
	uint32_t			N2,
	float* restrict		source,
	float* restrict		target
)
{
	int i;
	uint32_t j;
	int var_offset;
	unsigned int unsigned_offset;
	const unsigned int unsigned_const_offset=N2+1;
	register int reg_offset;
	const int const_offset=N2+1;
	float scale;

	scale = 1.0/(float)N1;

    switch(N1) {
		// -------------------------------------------------------------------------------
		// The first examples all generate VGATHERPS instructions

		case 1: // my original code -- note that the array offset includes an unsigned value
			for (i=0; i<N1+N2; i++) {
				target = scale*source[i+N2+1];
			}
			break;

		case 2: // simplify the upper loop bound - no change
			for (i=0; i<N; i++) {
				target = scale*source[i+N2+1];
			}
			break;

		case 3:  // upper loop bound includes only signed integers - no change
			for (i=0; i<N-N1; i++) {
				target = scale*source[i+N2+1];
			}
			break;

		case 4: // upper loop bound is a mix of signed and unsigned integers - no change
			for (i=0; i<N-N2; i++) {
				target = scale*source[i+N2+1];
			}
			break;

		case 5: // simplify offset calculation to a simple add - no change
			for (i=0; i<N-N2; i++) {
				target = scale*source[i+N2];
			}
			break;

		case 6: // simplified loop bound and loop offset - offset is unsigned - no change
			for (i=0; i<N; i++) {
				target = scale*source[i+N2];
			}
			break;

		case 7: // my original code, but with unsigned loop index (to match the offset)
			for (j=0; j<N1+N2; j++) {
				target = scale*source[j+N2+1];
			}
			break;

		case 8: // use a pre-computed const variable for the offset - unsigned
			for (i=0; i<N; i++) {
				target = scale*source[i+unsigned_const_offset];
			}
			break;

		case 9: // put offset in an unsigned local variable
			unsigned_offset = N2+1;
			for (i=0; i<N1+N2; i++) {
				target = scale*source[i+unsigned_offset];
			}
			break;


		// -------------------------------------------------------------------------------
		// The following cases do not generate VGATHERPS instructions

		case 10: // simplified loop bound and loop offset - offset is signed -- no gathers
			for (i=0; i<N; i++) {
				target = scale*source[i+N1];
			}
			break;


		case 20: // put offset in a register variable (signed) -- no gathers
			reg_offset = N2+1;
			for (i=0; i<N1+N2; i++) {
				target = scale*source[i+reg_offset];
			}
			break;

		case 21: // put offset in a const variable (signed) -- no gathers
			for (i=0; i<N1+N2; i++) {
				target = scale*source[i+const_offset];
			}
			break;

		case 22: // put offset in an ordinary local variable (signed) -- no gathers
			var_offset = N2+1;
			for (i=0; i<N1+N2; i++) {
				target = scale*source[i+var_offset];
			}
			break;

		// -------------------------------------------------------------------------------
		// New cases to test

	}
	return(0);
}

TimP · ‎12-10-2015

I didn't notice the unsigned offsets. I remember the compiler team arguing that unsigned indexing had extreme adverse implications, particularly in 32-bit mode (back when gather instruction requests were routinely rejected as not offering any value).

You've probably noticed my tiresome advice to other people about avoiding 64-bit unsigned indexing if they want performance. It seems to be more of a sticking point than the platform vendors will account for. My first major account when working at Intel eventually said they would not support any platform which distinguished between int and size_t. And people wonder why standard Fortran has no explicit support for unsigned data types.

Thomas_W_Intel · ‎12-14-2015

Thanks a lot for the report and the detailed analysis. I'll move the thread to the compiler forum, so that compiler support engineers can easily pick it up there.

Kind regards

Thomas

McCalpinJohn · ‎12-16-2015

More details....

I built a test case generator so I could investigate more cases.

I started with simple loop:

for (${LOOP_INDEX}=0; ${LOOP_INDEX}<${LOOP_SIZE}; ${LOOP_INDEX}++) {
    ${TARGET}[${LOOP_INDEX}] = scale*${SOURCE}[${LOOP_INDEX}+${OFFSET_VAR}]; 
}

I generated every permutation of [int, unsigned int, long, unsigned long] variables for the LOOP_INDEX, OFFSET_VAR, and LOOP_SIZE variables, and every combination of [float* restrict, double* restrict] for the SOURCE and TARGET variables.

Of the 256 test cases defined by this parameter space, 36 were interpreted by the compiler (icc 16.0.1, but the 15.0.3 results look the same) as being indirect accesses and therefore generated gather instructions.

It took a while to figure out the pattern...

Gathers are generated if and only if:

Both LOOP_INDEX and OFFSET_VAR are 32 bit variables (int or unsigned int), and
At least one of LOOP_INDEX and OFFSET_VAR are unsigned, and
At least one of SOURCE and TARGET are pointers to float (not pointers to double).

This is weird, but it exactly matches the results.

This has been submitted as issue 6000146067

KitturGanesh · ‎12-17-2015

@John, thanks for filing this interesting issue (6000146067) and is being looked at by Anoop who will communicate with you offline via that issue as well. -Kittur