Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

Can I force the compiler to pay attention to my intrinsics?

Black Belt

I have been working on optimization of a DGEMM kernel in the hopes of being able to understand power-limiting on Xeon E5 v3 processors.

Using the DGEMM from Intel's MKL library, I can typically cause a Xeon E5 v3 to enter power-limited frequency throttling when using more than about 1/2 of the cores on the chip.  The details depend a bit on the base frequency, the number of cores, and the TDP, but it is usually around 1/2 of the cores.

Using Intel's MKL DGEMM is, of course, the obvious thing to do, and I would happily use it, except that I want my own source version so that I can fiddle with register blocking and cache blocking to see how these impact power consumption.

After spending about two weeks staring at this, I think I know how to register-block the code.   Starting with a naive DGEMM code in inner product order:

	for (i=0; i<SIZE; i++) {
		for (j=0; j<SIZE; j++) {
			for (k=0; k<SIZE; k++) {
				c += a*b;

After much paper and pencil, it looks like the best optimization is to unroll the "i" loop by 4, the "j" loop by 12, and the "k" loop by 4.  This will give an inner loop with 48 FMAs operating on 12 4-wide AVX accumulators, with 16 "load with broadcast" for elements of the "a" array and 12 vector loads for the "b" array.   Each of the "a" values is used 3 times (48 FMAs / 16 values = 3) and each of the vectors of "b" is used 4 times (48 FMAs / 12 vectors = 4 uses).  

The resulting inner loop should require exactly 16 AVX registers -- 12 to hold the "accumulators" from the "c" array, 3 to hold a set of 3 vectors from "b", and 1 to hold the current broadcast value from "a".

So I wrote this very carefully using AVX/AVX2/FMA intrinsics, laid out as cleanly as I can.  Before the inner loop I load 12 _mm256d variables with the values of "c" from memory.  I then load 3 vectors of "b", load-broadcast 1 element of "a", perform 3 FMAs, load-broadcast the next element of "a", perform 3 FMAs, load-broadcast the next element of "a", perform 3 FMAs, load-broadcast the next element of "a", perform 3 FMAs.  Then I repeat that sequence with the next set of 3 vectors from "b", etc.  After the inner loop I copy the values from the 12 accumulators back to their respective memory locations.

I compiled on Linux with "icc 15.0.1 (20141023)" and just the "-xCORE-AVX2" compiler flag (adding "-O3" or "-O1" made no difference).  The result is a disaster!

  • The compiler hoists 13 of the 16 broadcasts to the top of the loop, even though each of those values is only used in 3 independent FMAs (in consecutive lines of the source code). 
    • This uses 12 more registers (out of 16!) than are required to hold these values, and causes the compiler to run out of registers, so it has to spill and restore many values.  (The compiler places 10 stores in the loop, when there should be zero.)
  • The compiler ignores the intrinsics that load the "b" values into temporaries (where they should be used 4 times) and instead puts those loads into the arithmetic instructions -- so they get loaded 4 times each instead of once.
  • The compiler splits up 12 of the 48 FMAs into separate VMULPD plus VADDPD pairs -- why?
  • The loop as a whole has 74 load operations, when only 12+16=28 were actually specified in my source code.
    • This is 46 extra loads
    • 36 are from loading each element of "b" 4 times instead of loading into a register once (as I specified in the source code)
    • 10 are from restoring the 10 spilled values that the compiler could not find a place for.

It could have been worse, I suppose -- at least this time the compiler did not decide to sprinkle gratuitous VINSERTF128 instructions all over the place like it sometimes does. (I can tell I am getting cranky now...)  The Intel 15 documentation says:

"Providing the same benefit as using inline assembly, intrinsics improve code readability, assist instruction scheduling, and help reduce debugging"

I don't see this at all!!!   Maybe if you only use one intrinsic in a loop?  Any non-trivial code seems to get completely re-written by the compiler.  Not as badly as it rewrites the original C code in this case, but useless nonetheless.

I am assuming that I am out of luck here, and that I need to give up and learn to write this directly in assembly language.   Any indications to the contrary would be greatly appreciated!

0 Kudos
5 Replies
Black Belt

It seems that gcc does a much better job of building code that follows the guidelines provided by the intrinsics.

With gcc 4.7.1 (old, but the newest one that we have installed on the system under test), the generated loop contained the desired 48 FMAs, with the 16 load/broadcasts left in the correct places, and the 12 vector loads actually placed in registers as I had intended.

In each of the last two blocks of 12 FMAs, the gcc compiler spilled one AVX register and restored it using a memory argument to a subsequent FMA instruction.  This should not cause a performance problem.

More problematically (and perversely), the gcc output includes 28 "LEA" instructions, four of which have MOV, SAL, and SUB instructions to complete the (unnecessary) address computations.  I have not timed the code yet, but I expect that these strange address computations will get in the way.   If they arise from a failure to understand the 2D indexing, then I will probably be able to get rid of them by flattening the data structures into 1D arrays with explicit index computations to emulate 2D arrays.

Black Belt

gcc 5.0 appears to produce just 1 spill, vs. 5 for ICL 15.0.2, if I'm counting right.

Did you compare the performance of the icc vs. gcc compilation?  

Strangely, if I set ICL to AVX rather than AVX2, it doesn't expand any fma intrinsics (I thought it might expand all).  Nor does it appear then to do anything objectionable, other than the spills (unless you still object to minor  re-orderings).

Disassembler seems to give a cleaner view of the generated code than compiling with icc -S.


Black Belt

I verified that "flattening" the loop with manual 1D indexing helped the gcc code considerably.    It is still fairly stupid about indexing, generating code to compute register values for indexed loads when the address offsets could easily be computed at compile time and made part of the immediate offset.    This adds about 24 more total instructions than it should -- including 10 LEAs (better than the 28 I got with 2D arrays!), 8 adds, and 6 register moves.  I have not finished my analysis, but it is certainly possible that these instructions are getting in the way -- for L2-contained data the IPC of the "flattened" code using gcc 4.7.1 (-O3) is a hair under 3.1, with an FMA rate that corresponds to about 68% of the peak rate of 2 per cycle.

With icc, the number of instructions is somewhat lower (90 vs 104).  The icc code has 8 more spills, and 12 extra instructions due to splitting 12 of the FMAs into separate VMULPD plus VADDPD, but saves instruction count by putting the 12 loads of b[] into the arithmetic instructions (a false economy, since this forces them to be repeated 4 times each), and saves about 20 instructions by computing the offsets and integrating them into the immediate offset field of the loads at compile time.  Overall, the much higher rate of cache accesses reduces the performance of the icc version considerably.  With L2-contained data, the "flattened" code using icc 15.0.1 delivers about 42% of peak -- almost 40% slower than the gcc version.

I am still trying to learn about using inline assembler.  I am fairly comfortable with it for simple things like generating RDTSC or RDPMC instructions, but creating a 130-instruction block that includes pre-loading the data, iterating over the ~100 instructions of the k loop, and post-storing the data -- well that seems like a completely different category!

The unrolled inner loop is supposed to execute in 24 cycles, so this is still too small to put in an independent function, but I don't want to write everything in assembler since I still have not added the extra levels of blocking for cache and TLB -- it will probably end up being a 9-deep loop nest?  The inner 3 levels could make a reasonable independent assembly language procedure computing for a single L2 block.  At a block size of 96x96, the 2*N^3 operation count comes to 1.77 million FP operations, which is about 110,000 cycles at the peak rate of 16 FP Ops per cycle.

Black Belt

I guess even on Haswell, only 1 256-bit memory access can run per clock cycle, so those fma with memory operands must run on consecutive clock cycles, possibly accounting for groups of instructions at 50% of peak rate.  The additional references almost certainly hit in L1D, and it would seem the additional cycles of latency wouldn't be important here, but throughput certainly is important. Such code generation might give a misleadingly high cache hit rate and cause us to miss the problem during cursory application of VTune.

Anecdotally, I used to hear that intrinsics code was often written with little attention to optimizing memory operands (including not giving attention to the penalties for unaligned AVX data prior to Haswell).  The conclusion to draw from that ought to be that the compiler shouldn't de-optimize memory references simply on account of intrinsics.  And why should icc generate apparently better code when -mavx (not AVX2) is set?

Black Belt

(1) Haswell cores are supposed to be able to perform two 256-bit loads plus one 256-bit store from DCache every cycle, and the bank conflicts that made high L1 DCache bandwidth utilization so difficult to achieve on Sandy Bridge/Ivy Bridge are supposed to be completely absent.

I have not had a chance to explore L1 DCache bandwidth on Haswell systematically, but do I have a code that sustains more than one 256-bit load per cycle with no stalls.

The code is based on the "helloflops3.c" code from Jeffers' and Reinders' book on Xeon Phi programming ( 

  • The code is a simple SAXPY loop (fa += a*fb) that is small enough to be fully unrolled and held in registers on the Xeon Phi, then executed millions of times.
  • With single-precision variables and a LOOP_COUNT of 128, each of the arrays fits into 8 512-bit registers, with 1 register used for the broadcast scalar, so 17 of the 32 512-bit registers are used.
  • Xeon Phi has a single FMA unit with a 6-cycle latency, the 8 arrays used to hold the "fa[]" vector allow plenty of concurrency (especially since you have to run 2 OpenMP threads per core -- doubling the number of accumulators in use -- to be able to issue in every cycle).

To use this code on Haswell, I switched from single-precision floating-point to double-precision and reduced the vector length ("LOOP_COUNT") so that the inner loop could be fully unrolled and held in registers.  The two FMA units have 5 cycles latency, so I need 10 (4-wide) "accumulators" to tolerate the pipeline latency.  This gives a minimum LOOP_COUNT of 40 (10 registers * 4 doubles/register).  It is easy to see that the largest plausible loop length would be 56, which would require 14 registers for the accumulators (the elements of "fa[]"), 1 for the broadcast temporary, and leave 1 register open (for reuse of one element of "fb[]", perhaps).

  • LOOP_COUNT=40 results in 10 (4-wide) AVX2 FMAs in the inner loop.  10 Registers are used to hold the array being updated, 1 register is used to hold the broadcast scalar, leaving 5 to hold 20 values of "fb[]".    The other 20 values of "fb" are loaded as memory arguments from 5 of the FMAs.  This version runs consistently at 99.9% of peak, averaging 0.999 loads per cycle (5 loads in 5 cycles).
  • LOOP_COUNT=44 has 11 FMAs in the loop.   11 registers hold the elements of "fa", 1 register holds the broadcast temporary, and the remaining 4 hold 16 of the 44 elements of "fb".   The remaining 28 elements of "fb" are loaded using (aligned) memory operands in 7 of the 11 FMAs.  This version also runs at 99.9% of peak, averaging 7 loads every 5.5 cycles (1.271 loads/cycle).
  • LOOP_COUNT=48 has 12 FMAs in the loop.  The compiler-generated code has the same structure as the previous two cases: 12 registers used to hold the 48 values of "fa", 1 register used to hold the broadcast scalar, and 3 registers used to hold 12 elements of "fb".  In this case, 9 of the FMAs have memory arguments.   I expected this to run at full speed also, since 9 loads in 6 cycles is only 1.5 loads per cycle, but the actual behavior was more complex.  About 75% of the time the code ran at ~97% of peak (~6.18 cycles per iteration, and ~1.46 loads/cycle), but the other 25% of the time it ran at ~77%-83% of peak.   I have not figured out what is going wrong here, but the code averaged more than one load per cycle in even the worst case.
  • LOOP_COUNT=52 or larger did not get unrolled by the compiler, so the inner loop included the full 2 loads plus 1 store of the underlying DAXPY kernel.  Performance was much lower than the 50% of peak I would have expected: (2 loads + 1 FMA + 1 store)/cycle, but I have not investigated further.


(2) Using all 10 cores on a Xeon E5-2660 v3 processor, I can exceed the chip's power limit with the 10-FMA and 11-FMA versions of this synthetic code. 

  • The 11-FMA version reduces the frequency a little bit more than the 10-FMA version, but neither version uses enough power to drop the frequency very much -- the 10-FMA version only drops a bit below the 2.9 GHz all-core AVX max Turbo frequency, and the 11-FMA version is still slightly above 2.8 GHz. 
  • In contrast, the Intel DGEMM code running 10 threads on the same hardware uses enough power to drop the frequency to below 2.7 GHz.
  • The Intel LINPACK benchmark code (bundled with the Intel 2015 compilers) uses enough power to drop the frequency to about 2.3 GHz. 

Runs with VTune show that the DGEMM and LINPACK codes are spending almost all their time in exactly the same MKL DGEMM functions, but I have not yet figured out how to obtain the block sizes used by the two codes.   I presume I can run a debugger and set a breakpoint in the code so I can read the registers holding the loop trip counts, but I have not gotten quite that far yet.   Anyway, this difference in power consumption across these cases is the reason that I am trying to build my own highly tuned version of DGEMM for experimentation.


(3) Although Haswell can sustain well over 1 load per cycle from L1 cache, my L2-contained DGEMM test case has its data coming from L2.  The array that gets broadcast should be the easier of the two -- the broadcasts are element-wise, so each of the four streams will have an initial L1 miss that brings in 8 elements, broadcasts 1, and leaves the rest in the L1 for subsequent broadcasts.    The other array is loaded in four contiguous vectors, loading 12 elements of each.   This should activate the L1 hardware prefetcher to bring in the next line, but I have not worked through the details.  Any of these data items that are not prefetched will generate a latency of at least 11 cycles, and at a desired instruction rate of almost 3 per cycle it probably would not take very many of these to exceed the latency tolerance provided by the OOO mechanisms of the processor core.   Clearly more detailed analysis is needed in this area.   (And obviously the Intel MKL people have already done that analysis, but until I work through it myself I won't really understand it well enough to be able to modify the code in ways that help me to understand the power consumption issues.)