spill avoidance (register pressure from loop fusion)

TimP · ‎11-07-2016

In the attached screen snip from VTune running on KNL, there appear to be stalls due to spills, mostly using AVX-512 moves to copy data chunks from (OpenMP shared) data arrays to stack. I wondered if this happens because the compiler appears to fuse hundreds of short aligned for loops within the single parallel region into a single loop (note code from source line 833 next to 1070), and whether there is any way to avoid it. It seems wasteful to make these stack copies of cache lines. I have tried several combinations of #pragma nofusion and explicitly fusing a reasonable number of for loops, but it makes no difference when viewed as .asm file.

There are also many cases of a locally generated cache line being spilled and later reloaded(like the first 2 stores in the attached snip, one of which is simply zeroing out that stack address). I don't see large numbers of clock tick event counts associated with those spills which don't follow immediately after loads.

With 32 separate named zmm registers available, the compiler should be capable of fusing many loops without incurring register pressure, but it seems to have found a way to go way past the limit. I don't know enough about KNL to guess whether stalls may be incurred by having so many different memory operations in the same loop. The code runs about 50% faster with the numactl setting to fast on-board memory (rather than default cached mode), so there does appear to be some memory bandwidth limitation.

I'm asking here rather than on MIC forum because it looks like a compiler question which isn't specific to MIC.

Yuan_C_Intel · ‎11-23-2016

Hi,Tim

Does those spilled zmm registers be used to store other local arrays, like those private arrays in a parallel loop?

Could you provide a reproducer or any source code piece to further explain the issue? From the screenshot of assembly only, we cannot decide whether compiler did wrong here. We need more context information from the source to understand the situation.

I noticed some case of zmm spills when there is a function call follows, but may not be the same as yours.

Thanks.

TimP · ‎11-23-2016

The owners of the code did not respond as to whether they would agree to put it on IPS. I will try to see if a small example will replicate this.

There are a reasonable number of global arrays, along with hundreds of local intermediate results which are given explicit vector length, and each of those intermediates is populated by its own short loop e.g.

double *restrict g1, g2, g3,.... // typically 10 global arrays

#define SIMD 8

__declspec(align(64))

double l1[SIMD], l2[SIMD], l3[SIMD],.... // hundreds of these

__assume_aligned(g1, 64)

__assume_aligned(g2, 64)

__assume_aligned(g3, 64)

.....

#pragma omp parallel for

#pragma vector aligned

for(int i=0; i<ncells; i += SIMD){

for(m=0; m<SIMD; ++m) l1=.....;

for(m=0; m<SIMD; ++m) l2=.....;

for(m=0; m<SIMD; ++m) l3=.....;

...

for(m=0; m<SIMD; ++m) g1[i+m]=.....;

......

}

so the compiler recognizes each of those inner loops as a group of simd operations with no remainder, but everything is spilled multiple times to stack. This style may have been chosen for Nvidia compiler nvcc. There are commented out private clauses (but we have seen before that Intel compilers don't like to see a large number of privates, particularly not arrays). It ends up spending most of the time copying data to and from stack. The compiler appears to have decided, probably correctly, that no prefetch will be useful. I thought some use of #pragma distribute point might be worth consideration, but these have no effect.

It seems the job could be done more cleanly by defining those locals as scalars inside the for scope, dispensing with the inner loops e.g.

for(int i=0; i<ncells; i++){

double l1,l2,l3,....

l1=.....

g1=...

}

but then it's not clear whether the loop could be split automatically to deal with register pressure.

I suppose I might try making a pseudo-code example to see whether the spilling behavior occurs as soon as there are 40 or so of these local arrays.

TimP · ‎11-23-2016

In case it's of interest, compiler option requirement for the combination of Microsoft and gcc specific stuff is somewhat out of the ordinary:

icl -S -O3 -Qopenmp -Qgcc-dialect:490 -QxMIC-AVX512 -Qstd=c99 *.c

I chose the 490 simply because it was the biggest value I could find which was accepted.

jimdempseyatthecove · ‎11-23-2016

TimP,

Does it help if you replace the for(m= loops with CEAN statements?

Jim Dempsey