Poor Code Gen of FMA3 instructions in SPEC FP 06 using Intel 14.0.0 compiler suite

perfwise · ‎10-02-2013

I have compiled a SPEC FP 06 using the Intel 14.0.0 compiler suite. I've observed great performance but upon looking at the code gen distributions through SDE, I note that only about 0.1% of the instructions executed are FMA3. When I've compiled with Open64 in the past, I noted that 7% of the instructions executed were FMA variants, and between compiling with and without FMA3, the performance increased 5% approximately. I'm using the -xCORE-AVX2 compiler flag upon my Haswell, but it's not "efficienctly leveraging" the use of FMA3. Is there another flag I must use in order to get the Intel 14.0.0 compiler to generate FMA instructions? I'm quite confident there's opportunity missed here and wanted to bring it to someone's attention.

I posted this in this form because it's an ISA issue in the compiler and not isolated solely to the C or Fortran compilers.

Perfwise

perfwise · ‎10-03-2013

Any help here.. as to how to generate FMA3 code in greater abundance. The SPEC06 FP benchmarks "cactusadm" and "soplex" get substantial improvements in performance via running with FMA in open64. The % of dynamic instructions executed in these codes is 15% and 4%. I believe more than 0.1% of the instructions are candidates for FMA in SPEC06FP and just wondering if there's something i'm missing as to why I don't observe more. Is this a known compiler issue when compiling with -xCORE-AVX2 or is there another flag I need to specify. I didn't see any in the -help codegen provided by the 14.0.0 compiler. Thanks for any advice or constructiive suggestions..

perfwise

TimP · ‎10-03-2013

I haven't tested cactus-adm for a long time, but I don't recall seeing a big change in performance between fma and no-fma. A compiler may choose to evaluate

A=B*C+D

E=B*C-F

without fma even when fma is available, possibly depending how register availability works out. fma at best saves about 3 cycles out of 11, for cases of purely sequential dependencies. fma3 may see fewer opportunities than fma4, since it requires an operand to be over-written. It's more interesting to look at asm code than to talk about.

perfwise · ‎10-03-2013

TimP,

Yes, in your example it saves 3 cycles, but 3 out of 11 cycles is quite a bit, if you're bound by chains of dependent operations. If this is a throughput or other related issue, then you may care about the fact that the # of cycles between SC allocation to completion and deallocation is less as well. So there's the benefit of shortend latency, less cycles certain tokens or resources are tied up. Lastly, Intel has move elimination in both the GPR and FPU.. so even if you needed to replicate those values since thisis FMA3 rather than FMA4, it's really a moot point. When I have some concrete examples I'll post them illustrating the lost opportunity.. but I think this is something that's missing right now and just wanted to point it out. How much code was FMA'd upon Itanium.. I think it was significantly more than 0.1% of SPECFP.. right?

perfwise

Bernard · ‎10-03-2013

BTW what is the throughput of FMA instruction?

perfwise · ‎10-03-2013

Throughput is 2 per cycle, and how I achieve 14.8 FLOP/cycle in double-precision in dgemm. The latency is 5 clks, which is the same as the MUL latency. This is why TimP says * followed by (+ or -) is 8 clks, 3 more than the equivalent FMA latency. Which latency or throughput, is more important varies.. but it's hard to understand how you only have 0.1% of the instructions as FMA for code targeted towards Haswell. There's probably a significant uplift to be had yet.. if someone were to look at this.

perfwise

Bernard · ‎10-03-2013

Thanks for the info.

I initially thought that FMA latency would combine fadd and fmul latencies now I see that engineers managed to keep the latency within the bounds of mum latency.Moreover I thought that at decode stage FMA could be broken into 2 or 3 uops (last case when memory access is involved).After checking Agner pdf I see that FMA at hardware stage is represented by one uop probably fused.

Bernard · ‎10-03-2013

Regarding fma throughput vs. latency question it really depends on the nature of the code being executed.If one stream contains a lot of interdependencies thus increasing latency on one execution port then CPU arbitration logic could reorder the instruction flow and execute non depended instruction on second port(Port1).

perfwise · ‎10-04-2013

FMA is being generated in the compilation of specFP. With -AVX on Haswell it is generated but only to a small degree (possibly in library code) but when running binaries compiled with -xCORE-AVX2 (from what I'm observing in SDE) there's a very sizeable % of FMA instructions. So issue is solved, simply a case of using the later flag and then looking at the SDE analysis upon them.

Perfwise