Hello, I'm investigating conversion of a number of compute kernels from AVX 128 to AVX 256 and would appreciate any guidance which might be available on getting a small number of operations on port 0+1 to parallel more effectively with vpermilpd and vpermute2f128. The kernels consist of a load, 5-15 double precision multiply accumulates which interact heavily with the contents of the xmm or ymm registers, and one store per loop trip. Currently, the data they act on is interleaved in memory in such a way it interacts seamlessly with AVX 128. Testing indicates with similarly ideal memory layouts AVX 256 versions of the kernels run 60-70% faster than AVX 128 (despite ymm preventing turbo on the CPU clock). However, it's a major change to support CPU dispatch capable of selecting the most efficient memory layout based on how the calculations are configured. And taking advantage of that new layout requires implementation of a wholly new processing pipeline.
So I've been looking at a number of deinterleave/interleave approaches to using AVX 256. To be bald about it, they all seem to suck. In the more sensitive kernels a single vpermute2f128 can reduce performance by a factor of 2.1 on Haswell, converting a 65% speed up into a 20% slowdown. Other kernels are not so touchy but, even with careful arrangement to minimize permute costs, they work out at least 16% slower than AVX 128. More commonly it's a 20-25% penalty. Use of AVX2 gather instructions is worse, starting around 30% penalty and sometimes approaching as much as 70%. The best of the load+store tricks I can work out for deinterleave+interleave come in around 35% slower than AVX 128.
I suspect this indicates moving ahead with an AVX 256 memory layout's the thing to do. But, as it's a big job, I'm curious. Anybody think I'm missing something here? If a deinterleave+interleave version of AVX 256 offered, oh, 20+% improvement over AVX 128 it'd be worth implementing despite being well under the AVX 256 best case. Constraining to Haswell or later is OK. So use of AVX2 and FMA instructions is an option, though I'm not seeing they help much (FMA 256 + AVX 256 kernels are still 10+% slower than AVX 128). However, a Xeon restriction's out. So AVX-512 isn't available. Some mocks suggest if there was no penalty for gather+scatter relative to load+store AVX 256 would be about 45% faster (there's still overhead from in lane permutes but it's not punitive). However, I'm not seeing a way to obtain this level of performance on Haswell. And, by extension, Broadwell or Skylake.
For reference, here's a typical AVX 128 kernel.
vmovupd xmm9,xmmword ptr [rax] add rdx,0FFFFFFFFFFFFFFFEh vmulpd xmm10,xmm7,xmm9 vmulpd xmm11,xmm6,xmm8 vmulpd xmm3,xmm5,xmm3 vaddpd xmm10,xmm10,xmm11 vmulpd xmm0,xmm2,xmm0 vaddpd xmm3,xmm10,xmm3 vmulpd xmm10,xmm4,xmm1 vsubpd xmm11,xmm3,xmm10 vsubpd xmm10,xmm11,xmm0 vmovupd xmmword ptr [rax],xmm10 add rax,0FFFFFFFFFFFFFFF0h vmovapd xmm3,xmm8 vmovapd xmm8,xmm9 vmovapd xmm0,xmm1 vmovapd xmm1,xmm10 cmp rdx,r8 jge [vmovupd xmm9 at start of loop]
And one of its simpler AVX 256 equivalents below. There are slightly more performant but more complex flavours of this. For all of them what IACA suggests is the permutes going to port 5 introduce enough latency to stall ports 0 and 1 and create critical paths which run through most of the loop body. Looking in VTune indicates this about doubles CPI, cancelling out AVX 256 halving the AVX 128 trip count. Since the permutes mean more instructions in the AVX 256 body the loop works out slower.
add rdx,0FFFFFFFFFFFFFFFCh vperm2f128 ymm11,ymm9,ymm5,21h vmulpd ymm7,ymm3,ymm9 vmulpd ymm11,ymm2,ymm11 vmulpd ymm5,ymm1,ymm5 vaddpd ymm7,ymm7,ymm11 vperm2f128 ymm8,ymm10,ymm0,21h vaddpd ymm5,ymm7,ymm5 vmulpd ymm7,ymm6,ymm8 vmulpd ymm0,ymm4,ymm0 vsubpd ymm8,ymm5,ymm7 vsubpd ymm5,ymm8,ymm0 vperm2f128 ymm0,ymm5,ymm10,21h vmulpd ymm7,ymm6,ymm0 vsubpd ymm0,ymm5,ymm7 vmovupd ymmword ptr [rax],ymm0 add rax,0FFFFFFFFFFFFFFE0h vmovapd ymm5,ymm9 cmp rdx,r8 jge [add rdx at start of loop]
For single-precision complex work, I have converted everything to separate real and imaginary storage precisely to avoid the lane-crossing penalties on Haswell. It is surprising how many filter calculations are faster when implemented with separate real and imaginary storage, even including the cost of de-interleaving on the input and re-interleaving on the output. Obviously these cut into the speedup, factors, but it does not take a very wide filter operator for the explicit data re-arrangement to be a net win....
Thanks, John; I'll proceed with the storage changes as I can find the time. It happens the data's promoted to double from lower precision at input and demoted again at output. Both promote and demote are inexpensive relative to the rest of the operations and already necessarily involve some permutes. So changing them round should have negligible effects on performance; the limiting factor's really the programming time to implement the bookkeeping required.
Promotion from 128-bit packed single to 256-bit double crosses the 128-bit lane boundary, so I would expect it to incur a slight latency penalty. Agner Fog's instruction tables show a latency of 5 cycles (compared to 4 cycles on SNB and IVB), but it is fully pipelined with a throughput of one instruction per cycle on all of the mainstream processors (requiring one dispatch slot on Port 0 for the conversion and one on Port 5 for the cross-lane transfer). As long as you don't have a tight dependency tree on the results, this should not be a problem. The extra latency of the cross-lane transfers is a more serious performance issue when you are doing things like summing across a vector register. Even the (excellent) tree-based code generated by the Intel compiler has a fairly long critical path and lots of idle cycles.
My current promote and demote implementation change data size by 2-4x, depending on the configuration needed. In the cases I've measured, AVX 256 promote/demote is slower than AVX 128 by an amount indistinguishable from the clock speed reduction due to ymm access. I'd need to take a closer look at the loops to see exactly why the trip count reduction is cancelled but Advisor, IACA, and VTune all seem to struggle at that level of detail. So it's pretty time consuming to write, run, and analyze all the probes to pick apart the problem to infer answers. Since promote/demote is far from being the bottleneck and I'm trying to stick to reasonably processor agnostic intrinsics, well...