I'm currently having a hard time trying to optimize a performance oriented library for Intel Core.
With most essential ingredients in place, I'm now progressing more slowly, making small modifications, checking performance change at each iteration.
Problem is, I'm in a difficult phase : modifying a part of a code can trigger large performance differences into other *unrelated* parts of the same code.
This is not a measurement noise : it's perfectly reproducible, and significant (between 10-20% depending on scope tested).
At first, it looks completely random, but since it's reproducible, there must be an explanation.
The following blog post sent me in the right direction :
It points at instruction loop alignment.
This makes sense : modifying a part of a code can move others part at different adress location, hence impact their alignment and performance.
As I'm primarily compiling on GCC 4.8.4 at -O3, the performance setting -falign-loops is supposed to be automatically triggered.
It is, but by default, its value is 16.
However, since I'm also using recent Intel Core CPU for performance tests (broadwell, haswell), the alignment should be 32 (according to above article).
Apparently, the GCC guys did not receive instructions to modify this part of performance heuristic for modern CPU.
As a consequence, hot-loops are 16-bytes aligned, giving them a 50% chance to be properly 32-bytes aligned, or not.
Hence the "randomness".
A test convinced me it was the right explanation :
I compiled using -falign-loops=32 a limited portion of the code with a single critical hot loop.
This time, the performance was strong and remained solid, whatever the peripherical changes in other parts of the code.
Unfortunately, what is right for a small portion of code is not for the full corpus.
Forcing -falign-loops=32 for the entire code base results in a measurable loss of performance.
My guess is that all these alignments introduce too many mock instructions (noop) to ensure proper alignment,
including for numerous loops which are not critical.
This is a blow. And I don't find a solution to improve this situation.
GCC targeted instruction (__attribute or #pragma GCC) do not work for loop alignment.
I'm currently almost blind, since every code change results in substantial performance modifications which are not tied to the modification itself.
I don't know if I'm improving/degrading performance at large, or just for a single atypical compiler / source suite (this library is supposed to be integrated into many other codes).
Questions are :
- Is this issue known ?
- Is there a (portable) solution to ensure critical loops (and only them) can be aligned on 32-bytes boundaries ?
- Parallel Computing
The loop body alignment parameters for gcc are set in gcc/config/i386/i386.c. For example, if you are compiling with -march=core2 or similar, gcc source code comes with the settings align_loop=16, align_loop_max_skip=10. If you examine .s code generated by gcc, you will see the conditional alignment directives for 16-byte alignment conditional on maximum of 10 padding bytes. You will see that k6 gets 32-byte alignment using a maximum of 7 bytes padding. You can also look at the .o to see the end effect of conditional alignment.
I saw that Intel compilers had been improving performance on recent CPUs by using frequent unconditional padding to 16-byte alignment, while gcc performance may fall off in the case where a loop is unrolled by 4 but the loop placement falls just short of where 10 bytes padding would be sufficient for alignment. So I rebuilt gcc with align_loop_max_skip=12 and got a marginal improvement in performance on my haswell laptop.
Among the reasons for limiting the number of padding bytes is to limit overall growth of code size and improve instruction cache locality. As you can see by the name of the gcc directive, Intel originally recommended a conditional padding scheme for Pentium 2, to avoid the situation where the first partial cache line of the loop has less than 7 bytes. Too much padding not only increases code size, but takes more time when the loop is entered from the top across the padding bytes.
I believe, on the current CPUs which include Loop Stream Detector, that misaligned loops will limit the effectiveness of loop unrolling; in my tests of mostly vectorized code, max-unroll-times=2 is generally best with the gcc default of 10 conditional padding bytes (sometimes limiting the micro-op cache size), while Intel compilers (from 15.0) do better with -qunroll4, on account of most of the critical loops having unconditional alignment. It's not possible to make a direct comparison between the Intel and gcc loop unrolling schemes due to the totally different way of handling remainder loops.
As you indicated, there were Intel CPUs in the past which wanted 32-byte loop alignment to maximize performance. As far as I know, this was seldom dealt with, and it may be the architects decided they needed to make the architecture less dependent on this. It does appear that the architects succeeded in setting things up so that the compiler unroll limits become useful most of the time.
I haven't understood how compilers decide which loops should get (conditional) alignment. Where there is no loop alignment, evidently, changing alignments leading up to a loop may affect performance of the loop. The benefit of loop alignment seems to depend, among other things, on the specific instructions in the loop, so one would hope that the compiler skips alignment only when alignment is likely not to matter.
In the interests of "code modernization" as I understand the term, it's important to reduce the criticality of these effects, letting compilers deal with them as most people expect. We are generally looking for things which work across a range of current CPUs, rather than spending development time optimizing performance of old ones.
What is striking in this description is that it seems one can only hope the compiler will "guess" properly, depending on its own heuristics. Yes, in best situations, it should.
On the hand, the coder seems in excellent position to know which loop is going to be "hot" or not, hence would deserve to be aligned or not. The fact that it's not possible to do anything about it is disappointing.
More concretely, for this particular piece of code and today, it puts me into a fairly difficult situation : each time a minor code change is done, performance is going to move substantially faster or slower, but for reasons completely different than the code change itself.
I'm therefore no longer in a position to know if latest changes are good or not, since the far-away-loop-alignment side-effect is much larger, leading to wrong conclusions.
This is not necessarily a case of hot or not. From my experience that it is often the case that the best padding and/or alignment is best found by runtime testing (e.g. VTune and/or wall clock).
As an example of why this is so, if the loop contains any flow control it can potentially be more important for the cache line alignment to be sensitive to the placement of the position of the displaced branch than to the top of the loop. IOW there will be no general rule that will always be correct.
Additional factors are:
Alignment of entry points of functions. Each function may have a different optimal entry point cache line offset.
Inlining and loop unrolling. Sometimes inlining and/or overly aggressive loop unrolling bloats a loop such that it no longer fits within the L1I cache.