Interesting. Sorry I

Todd_W_ · ‎01-06-2017

Hi, I'm trying to figure out why I get a 20% Haswell slowdown porting some C++ IIR filter loops from AVX128 to AVX256. Advisor's no help as both Check Dependencies and Check Memory Access Patterns always fail with internal errors all variations I've had time to try so far. Find Trip Counts and FLOPS fails too as it eventually causes a crash in the target process; it seems to add no information over Survey Target (which is the only bit of the workflow clearly working) but might reach the loops in question. Output from

"C:\Program Files (x86)\IntelSWTools\Advisor\bin32\advixe-feedback.exe" -create-bug-report Advisor2017InternalErrors

attached.

If there's a workaround to try to get unblocked short of factoring code out into a standalone .exe that'd be great. VTune is able to run the same profiling target without issue in all of its analyses but lacks the specificity I'm looking for as to why replacing _mm_load/store_pd() with _mm256_load/store_pd() per optimization manual guidance seems to result in degraded memory access.

TimP · ‎01-08-2017

Have you checked alignment?

I haven't had success with those Advisor options either, but I doubt they would shed additional light on your problem.

Todd_W_ · ‎01-08-2017

Alignment in what sense? The loads and stores would seg fault if not 32 byte aligned so _aligned_malloc() is behaving as expected. Profiing sweeps indicate fastest operation with the loops operating on block sizes around 12kB. Access is a sequential read, modify, write with each iteration beginning with a load and finishing with a store back to the same address. So VTune unsurprisingly indicates insufficient data to accurately report 4k aliasing (because there shouldn't be any). VTune also finds no L3 or DRAM pressure and front end binding is around 5% for both AVX128 and 256.

To be a little more specific, it looks like maybe an _mm256_load_pd() stall. What happens in changing the most interesting loop for comparison from AVX128 to 256 is back end increases from 53% to 68% and CPI increases from 0.66 to 1.19. 256 needs to execute about 70% the instructions of 128 per iteration due to the nature of IIR feedback and its greater width creating some deinterleave/interleave overhead. It also keeps the CPU out of turbo, resulting in an average clock about 7% lower than 128. This works out to a slowdown of 25% or so, in good agreement with measured execution time. Internal to the loop the actual calculations don't move much but _mm_store_pd() to _mm256_store_pd() increases CPI from 0.33 to 1.9. VTune misses _mm_load_pd() but _mm256_load_pd()'s CPI is 31. Yet VTune general exploration indicates using AVX256 decreases cache binding by 3.5% and puts core binding up from 29 to 48%. Memory access analysis agrees with this shape but reports an average latency of 6 cycles for both 128 and 256. So I've some uncertainty over how problematic the load really is.

Advisor does flag the AVX256 version as a possible bad memory access pattern. But fails to reach a conclusion since the memory check internal errors prevent generation of results. I tried also IACA but it's unable to find its marks so no port detail's available from it.

TimP · ‎01-08-2017

Interesting. Sorry I overlooked that you were forcing aligned access, which the compiler doesn't normally do. There is also the possible question of aligned loop body code, and whether unrolling needs adjustment in comparisons such as yours (and whether you have a remainder loop which might be affected). Advisor will (should?) complain about remainder loops if it sees time spent there, regardless of whether the remainder is coded efficiently.

In a case of mine there is a lack of speedup for use of 256-wide operations in a parallel linear search, and Advisor claims possible bad access pattern. I haven't tried to investigate whether cache performance prevents a speedup for wider operations.

In another case where I compare ifort vs. icl, the 2 Intel compilers make a different choice between 128- and 256-wide operations for the identical task, and the 128-bit choice may be slightly faster, but that case involves mixed strides.

Todd_W_ · ‎01-08-2017

It is no trouble. No remainder loops as the blocks are exact multiples of loop stride. Instructions aren't aligned but I don't see there's anything to be done about that. Something which does look to me like a trouble indicator is loop unrolling is 1-2% slower than not unrolling. The obvious difference is the intrinsics for multiply accumulate in an IIR biquad compile as

AVX128    AVX256
vmulpd    vmulpd
vmulpd    vmulpd
vaddpd    vmulpd
vmulpd    vmulpd
vaddpd    vmulpd
vmulpd    vaddpd
vsubpd    vaddpd
vmulpd    vsubpd
vsubpd    vsubpd

despite being in the same order in the 256 C++. What I was hoping for from the tools was some indication whether the load CPI is in fact spurious and whether uOP reordering can find port parallelism in 256 comparable to that of 128 despite the adds getting pushed to the end. The CPI data suggests the register renaming for the latter's too complex for the hardware to figure out; a fully controlled analysis would probably require assembly. I can say _mm_prefetch() or things like explicit prefetch by calling _mm256_load_pd() an iteration ahead in loops which aren't register bound produce only 0-3% speedup. That's consistent with the reported lack of L1 (and L3 and DRAM) binding from VTune.

The module's currently a mix of C++ and C++/CLI. It's more than I have time for right now to factor C++ parts into a separate dll just to speculatively see what happens with ICC's interpretation of the intrinsics versus VS2015.3's. There's also ippsIIR() to look into, though substantial rewrites around interleaving would be involved.

TimP · ‎01-08-2017

Intel C++ doesn't always supply an ALIGN 16 at the top of the loop body; when that is missing it may reduce the effectiveness of unrolling. I haven't worked enough with attempting to collect VTune evidence of loop stream detection to know if that is implicated, although that is my suspicion.

A frequent reason for differences in performance between ICL and g++ came up when one of them omitted the loop alignment directive. Occasionally it might help to change g++ build parameters to make the conditional alignment occur more frequently (more optional padding). I haven't seen Microsoft CL use any alignment padding. You should be able to see alignment padding (if any) by disassembling .obj or by looking for directives in pseudo .asm code emitted by compiler. The padding may involve meaningless loads as well as nops.

In the past, Intel C++ didn't employ automatic unrolling as effectively with intrinsics as with plain C source code. On the other hand, as you say, it doesn't follow intrinsics as literally as you might have expected.

I'm not aware of concerns about the action of hardware register renaming. I don't think there's any way to analyze it with the tools normally available to developers. The main concern in the past has been where use of partial registers could inhibit renaming, which doesn't appear to be a problem for your case.

Todd_W_ · ‎01-09-2017

Yes, I suspect Intel's internal architecture simulator is needed. Some weather cancellations gave time for refactoring and a bit of a look at behaviour with ICC.

ICC 17.0 issues vmulpd and vaddpd/vsubpd in pairs for both AVX128 and 256. This has no effect on performance for either case, suggesting the dominant factor may be the reorder buffer. It also has no effect on loop CPI reported by VTune but does radically change CPI VTune reports on an instruction by instruction basis in multiple run measurements, suggesting instruction granularity may be too low level to contain useable information.
IACA indicates AVX256 to be 15% faster than 128 and FMA (also 256) to be 9% faster. FMA256 measures around 10% slower than AVX128. To speculate, there may be Haswell microarchitecture difficulties in efficiently running 256 bit multiplies and adds in parallel on ports 0 and 1 which do not occur with 128 bits but are mitigated by locking the MACs into a rigid dependency chain/critical path with FMAs. IACA also indicates each individual FMA uses both port 0 and 1. Potentially an interesting hint but I haven't seen anything sufficiently detailed about Haswell ALU internals in Intel's publications or from folks like Agner Fog to follow up on it.
Loop unrolling still produces slower code. So does #pragma loop_count.

What's the process for instruction alignment? All the docs I'm finding are for data alignment.

Advisor 2017 Update 1 (build 486553) checks fail with internal error