mitigating permute costs in AVX 256?

Todd_W_ · ‎01-15-2017

Hello, I'm investigating conversion of a number of compute kernels from AVX 128 to AVX 256 and would appreciate any guidance which might be available on getting a small number of operations on port 0+1 to parallel more effectively with vpermilpd and vpermute2f128. The kernels consist of a load, 5-15 double precision multiply accumulates which interact heavily with the contents of the xmm or ymm registers, and one store per loop trip. Currently, the data they act on is interleaved in memory in such a way it interacts seamlessly with AVX 128. Testing indicates with similarly ideal memory layouts AVX 256 versions of the kernels run 60-70% faster than AVX 128 (despite ymm preventing turbo on the CPU clock). However, it's a major change to support CPU dispatch capable of selecting the most efficient memory layout based on how the calculations are configured. And taking advantage of that new layout requires implementation of a wholly new processing pipeline.

So I've been looking at a number of deinterleave/interleave approaches to using AVX 256. To be bald about it, they all seem to suck. In the more sensitive kernels a single vpermute2f128 can reduce performance by a factor of 2.1 on Haswell, converting a 65% speed up into a 20% slowdown. Other kernels are not so touchy but, even with careful arrangement to minimize permute costs, they work out at least 16% slower than AVX 128. More commonly it's a 20-25% penalty. Use of AVX2 gather instructions is worse, starting around 30% penalty and sometimes approaching as much as 70%. The best of the load+store tricks I can work out for deinterleave+interleave come in around 35% slower than AVX 128.

I suspect this indicates moving ahead with an AVX 256 memory layout's the thing to do. But, as it's a big job, I'm curious. Anybody think I'm missing something here? If a deinterleave+interleave version of AVX 256 offered, oh, 20+% improvement over AVX 128 it'd be worth implementing despite being well under the AVX 256 best case. Constraining to Haswell or later is OK. So use of AVX2 and FMA instructions is an option, though I'm not seeing they help much (FMA 256 + AVX 256 kernels are still 10+% slower than AVX 128). However, a Xeon restriction's out. So AVX-512 isn't available. Some mocks suggest if there was no penalty for gather+scatter relative to load+store AVX 256 would be about 45% faster (there's still overhead from in lane permutes but it's not punitive). However, I'm not seeing a way to obtain this level of performance on Haswell. And, by extension, Broadwell or Skylake.

For reference, here's a typical AVX 128 kernel.

vmovupd     xmm9,xmmword ptr [rax]  
add         rdx,0FFFFFFFFFFFFFFFEh  
vmulpd      xmm10,xmm7,xmm9  
vmulpd      xmm11,xmm6,xmm8  
vmulpd      xmm3,xmm5,xmm3  
vaddpd      xmm10,xmm10,xmm11  
vmulpd      xmm0,xmm2,xmm0  
vaddpd      xmm3,xmm10,xmm3  
vmulpd      xmm10,xmm4,xmm1  
vsubpd      xmm11,xmm3,xmm10  
vsubpd      xmm10,xmm11,xmm0  
vmovupd     xmmword ptr [rax],xmm10  
add         rax,0FFFFFFFFFFFFFFF0h  
vmovapd     xmm3,xmm8  
vmovapd     xmm8,xmm9  
vmovapd     xmm0,xmm1  
vmovapd     xmm1,xmm10  
cmp         rdx,r8  
jge         [vmovupd xmm9 at start of loop]

And one of its simpler AVX 256 equivalents below. There are slightly more performant but more complex flavours of this. For all of them what IACA suggests is the permutes going to port 5 introduce enough latency to stall ports 0 and 1 and create critical paths which run through most of the loop body. Looking in VTune indicates this about doubles CPI, cancelling out AVX 256 halving the AVX 128 trip count. Since the permutes mean more instructions in the AVX 256 body the loop works out slower.

    add         rdx,0FFFFFFFFFFFFFFFCh  
    vperm2f128  ymm11,ymm9,ymm5,21h  
    vmulpd      ymm7,ymm3,ymm9  
    vmulpd      ymm11,ymm2,ymm11  
    vmulpd      ymm5,ymm1,ymm5  
    vaddpd      ymm7,ymm7,ymm11  
    vperm2f128  ymm8,ymm10,ymm0,21h  
    vaddpd      ymm5,ymm7,ymm5  
    vmulpd      ymm7,ymm6,ymm8  
    vmulpd      ymm0,ymm4,ymm0  
    vsubpd      ymm8,ymm5,ymm7  
    vsubpd      ymm5,ymm8,ymm0  
    vperm2f128  ymm0,ymm5,ymm10,21h  
    vmulpd      ymm7,ymm6,ymm0  
    vsubpd      ymm0,ymm5,ymm7  
    vmovupd     ymmword ptr [rax],ymm0  
    add         rax,0FFFFFFFFFFFFFFE0h  
    vmovapd     ymm5,ymm9  
    cmp         rdx,r8  
    jge         [add rdx at start of loop]

McCalpinJohn · ‎01-17-2017

For single-precision complex work, I have converted everything to separate real and imaginary storage precisely to avoid the lane-crossing penalties on Haswell. It is surprising how many filter calculations are faster when implemented with separate real and imaginary storage, even including the cost of de-interleaving on the input and re-interleaving on the output. Obviously these cut into the speedup, factors, but it does not take a very wide filter operator for the explicit data re-arrangement to be a net win....

Todd_W_ · ‎01-18-2017

Thanks, John; I'll proceed with the storage changes as I can find the time. It happens the data's promoted to double from lower precision at input and demoted again at output. Both promote and demote are inexpensive relative to the rest of the operations and already necessarily involve some permutes. So changing them round should have negligible effects on performance; the limiting factor's really the programming time to implement the bookkeeping required.

McCalpinJohn · ‎01-19-2017

Promotion from 128-bit packed single to 256-bit double crosses the 128-bit lane boundary, so I would expect it to incur a slight latency penalty. Agner Fog's instruction tables show a latency of 5 cycles (compared to 4 cycles on SNB and IVB), but it is fully pipelined with a throughput of one instruction per cycle on all of the mainstream processors (requiring one dispatch slot on Port 0 for the conversion and one on Port 5 for the cross-lane transfer). As long as you don't have a tight dependency tree on the results, this should not be a problem. The extra latency of the cross-lane transfers is a more serious performance issue when you are doing things like summing across a vector register. Even the (excellent) tree-based code generated by the Intel compiler has a fairly long critical path and lots of idle cycles.

Todd_W_ · ‎01-19-2017

My current promote and demote implementation change data size by 2-4x, depending on the configuration needed. In the cases I've measured, AVX 256 promote/demote is slower than AVX 128 by an amount indistinguishable from the clock speed reduction due to ymm access. I'd need to take a closer look at the loops to see exactly why the trip count reduction is cancelled but Advisor, IACA, and VTune all seem to struggle at that level of detail. So it's pretty time consuming to write, run, and analyze all the probes to pick apart the problem to infer answers. Since promote/demote is far from being the bottleneck and I'm trying to stick to reasonably processor agnostic intrinsics, well...