Here's my problem. I'm generating a C++ file with large arithmetic expressions (mostly adds and muls) from a high level specification of a numerical problem. Basically, I have a loop nest and at least one such expression in the innermost loop. The innermost loop is decorated with a #pragma simd to enforce vectorisation; the outermost loop is decorated with a #pragma omp for for OpenMP parallelism. In some complex cases, an expression can consists of thousands of arithmetic operations. There are common sub-expressions, which are already captured and assigned to temporaries.
We'd like to compile this kind of codes with `-O3 -xHost`, but the compilation is insanely slow, as in, taking hours to produce an executable. If I instead use `-O2`, the compilation time drops (as expected), but it's still unacceptably high. Disabling vectorisation helps significantly, but again, still not enough. Using a #pragma ivdep in place of #pragma simd also helps (by avoiding data dependency analysis), but again, still not enough. Summing up, using `-O2 -xHost` and #pragma ivdep caused a remarkable drop in compilation time, but we are still far from the target (tens of seconds). Without the high level common sub-expressions elimination, the compilation time is significantly worse. So here's the question: what can I do to improve this ? I'd like to switch on/off the individual optimisations applied at O2/O3 until I find an adequate compromise, but apparently this is not possible -- at least, I couldn't find useful information about how to do so in the various manuals.
An extreme solution would be disabling optimizations entirely... but that'd be too bad. Another possibility would be *not* generating such expressions and expressing the computation as composition of function calls, for performance,but this again would be terrible (e.g., by being much more difficult to vectorize, hiding redundancies in the expressions, missing loop-invariant code motion opportunities, ...).
Another hypothesis for such a horrible compilation time is that the compiler spends a lot of time applying scalar replacement -- ie, basically unpicking (partially) the high level common sub-expressions elimination that we apply.
I think this is an interesting problem because it's really stretching the optimization capabilities of the Intel compiler.
I can try to attach a self-contained example to reproduce the issue, but before that I'd need to consult with my boss to see what I can share.
Ah, the above occurs with both icc 2016 and 2017
What are the runtimes of the resultant code using your discussed optimization options and #pragma use? In particular, does '-O3 -xHost' provide any runtime benefit (or how much) over -O0 and -O2?
Have you determined if the extremely slow compilation is, or is not, due to exceeding resident RAM and entering paging mode?
We would need to have a reproducible test case to investigate your issue. You can provide us a preprocess (.i) file along with your command line options and the OS you compile on.
Hi Jim, Viet,
Thanks for the quick reply. Here's the complete picture.
For the test case available here (a preprocessed file, as requested by Viet) https://gist.github.com/FabioLuporini/d5dbefb060b527bd78b1b456420bdad9 the following are the times I get:
I've repeated the tests a few times. As you can see, no drop from O3 to O2. These times were obtained by compiling the linked code on Ubuntu with the following cmd line (icc 16.0.2, 4 openmp threads with affinity properly set):
icpc -Ox -g -fPIC -Wall -xhost -shared -qopenmp filename.cpp -o filename.so
where -Ox was any of -O0, -O1, -O2 and -O3 (haven't tried Ofast)
Jim: I'm not sure, but I don't think I'm ending up swapping (htop didn't show anything particularly bad). Also, replacing #pragma simd with #pragma ivdep results in the same vector code performance (no surprise I'd say).
Thanks for looking into this
EDIT: I was about to forget: this is a relatively simple test, although it suffices to expose the problem. With more complex tests, the compilation times above grow significantly
Does your largest test case show no runtime benefit between -O2 and -O3 as does your small test case?
If not significant, then use -O2, but have Intel investigate this (Viet appears to be doing this).
Assuming -O3 yields additional improvement:
Due to the (assumed) compile time is approximately O N**2 or O N! with N being the number of operands, you might find significant reduction in compile time by splitting the humongous statement into two functions (no-inlined). It should be easy enough to make a quick test. Note, doing so might eliminate a few common sub-expression eliminations. IOW an intervening text processing (with specific foreknowledge of expected patterns) might significantly speed-up the build process.
RE: common sub-expression elimination.
(code unseen) I must make the assumption that for any such humongous statement generation that this statement must be generated by a program. As such, it appears that the author of this program did not provide for itself to perform common sub-expression elimination and passed this responsibility off onto the receiver of the code (icpc). If you have the source of the code generation, it might be trivial to change the code generation to produce output that contains code with common sub-expressions already eliminated (note, this potentially can be done without analysis of the code), and thus not require to use the higher level optimizations. IOW, the code generator may already know the common sub-expressions but is too "lazy" to generate and use temporaries.
I agree with Jim. If the runtime is not significant, then use -O2 for now.
I have escalated it to the compiler developer with tracker number: DPD200415697. I'll keep you posted.
With O2 the compile time decreases, that's true, but already in slightly more complex test cases it quickly becomes unacceptable -- just like O3.
Jim: I tried in the past splitting the loop nest into smaller loop nests with fewer operands, but the compile time actually increased (note that this splitting, as you also say, results in a higher operation count globally). I haven't tried splitting it into separate functions, though. I could try to hack something up, but I feel this approach isn't easily scalable to more advanced problems (eg 1) how do I know in advance how many functions would I need to get acceptable compile times and 2) how much do I lose in performance).
Also, to clarify: the common sub-expressions elimination that I was talking about is (automatically) applied already by our code generator -- ie, *before* icpc receives the code. Without it, the compile time is basically plus infinite :-) even with O2.
"Due to the (assumed) compile time is approximately O N**2 or O N! with N being the number of operands" -> what are the transformations that might have such a computational cost ? Skimming through the optimizations (allegedly) applied at O3, I can't think of anything that would result in such a cost. Something costly must also be performed at O2, otherwise I couldn't explain the growth in compile time in more complex codes. I think the compiler should provide per-optimization knobs to work around precisely these issues, otherwise individual transformations become bottleneck for the entire optimization level.
When splitting into separate functions, in this case, be sure to disable interprocedural optimizations. IPO will defeat your intentions to reduce the permutations of search for common sub expressions.
Hello Viet, Jim,
Viet: is there any news from this front?
Jim: I see your point about disabling IP if I split the big expression into smaller functions. However, don't you agree with me that this approach (of splitting into multiple routines) is an overkill ? This will end up increasing the operation count, which is something I'm trying to avoid given the present complexity. I believe that *the right* thing to do here (not just the best) is to teach the Intel compiler not to waste time in performing certain passes when expressions are too complex (or to give up if a certain time threshold is exceeded).
Could you also confirm that, with the current release, there's no way of disabling individual optimisations (as in, performing all of the O3 optimizations except, for instance, loop blocking)?