Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

Scalable compilation and inlining of very large codes

Gabriel_R_4
Beginner
1,030 Views

Hi all,

I am using ICC to compile very large programs (think 1GB of code). In order to achieve reasonable compile times I break down my code into small-ish functions artificially, as very large functions will take exponentially more time to compile. This is not nearly enough, so I further split my program into different files with a few thousands of functions each in order to call multiple instances of ICC, each generating its own .o file. Later, a skeleton function calls each of the generated functions and everything is linked together. So, to summarize, I have lots of functions split across multiple files which are later called by a large main().

I am finding that this approach incurs a substantial overhead due to function calling in main(), so I would like to actually inline the calls to the artificial functions I generated. This is a snake biting its own tail: if I do this at the .c level, the problem of exponential compilation time comes back. So I guess I have three questions at this point:

1. Is there anyway to perform link-time inlining, similar to the optimizations provided by GCC LTO? I have been unsuccessful using GCC to link and inline my program so far.

2. If the answer to 1) is yes, can the LTO inlining be forced?

3. And in any case, can anyone think of an approach to achieve scalable compilation and inlining in this type of codes?

 

Thanks in advance!

0 Kudos
10 Replies
jimdempseyatthecove
Honored Contributor III
1,030 Views

Intel C++ has cross-file InterProcedural Optimizaton (aka IPO)

Interprocedural Optimization (IPO)
----------------------------------


/Qip[-]   enable(DEFAULT)/disable single-file IP optimization
          within files

/Qipo  enable multi-file IP optimization between files

/Qipo-c   generate a multi-file object file (ipo_out.obj)

/Qipo-S   generate a multi-file assembly file (ipo_out.asm)

/Qip-no-inlining
          disable full and partial inlining

/Qip-no-pinlining
          disable partial inlining

/Qipo-separate
          create one object file for every source file (overrides /Qipo)

/Qipo-jobs<n>
          specify the number of jobs to be executed simultaneously during the
          IPO link phase

Note, adding inter-file IPO will lengthen the times of the compile and link phase.

After the initial rebuild (many source files), any subsequent build following minor edits to one or few non-header source files, the time should be reduced. ipo creates intermediary files that are somewhat like precompiled headers. The link phase will have more work to do in merging these files as opposed to appending to the image.

Jim Dempsey

 

0 Kudos
Gabriel_R_4
Beginner
1,030 Views

Thanks a lot Jim. After playing a bit with these flags I confirm that using -ipo and the various flags for controlling inlining limits and factors I have managed to do what I intended, but for limited program sizes. I'm compiling with:

-ipo -no-inline-max-size -no-inline-max-total-size -no-inline-max-per-routine -no-inline-max-per-compile -no-inline-factor

And marking all calls with #pragma forceinline. Now everything is inlined. Unfortunately, the compilation segfaults even for small-ish codes (e.g., approximately 75 MB of object code to be inlined will cause the compilation to fail). So not scalable enough for my purposes, regrettably.

Best,

G.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,030 Views

Have you verified that inlining to the extent you want (everything) yields faster performance?

Be mindful that a smallish function that is inlined and used in many places will tend to cause the code to spill out of the instruction cache.

foo(...) // inlined
..
foo(...) // inlined
..
foo(...) // inlined

In the above case, each inline foo, the code will reside at different addresses, and require the code to be fetched from a higher level cache or, from your program description, from RAM
However, by not inlining, the second and subsiquent function calls may reside in the L1 instruction cache, or that failing, in L2 or L3 cache.

See: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/282481

for some informative talk.

The same is true of loop unrolling.

Too often a new programmer, or inexperienced CS professor, will assume that if inlining or unrolling is shown to be good in some cases, that it is equally good in all cases.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,030 Views

Additional problem with C++ and lower opportunities for caching functions.

Background:

In the early 1990's when C++ first came out, a compiler vendor by the name of Borland produced one of the better C++ compilers. There were other good vendors too, but in my opinion MS wasn't at the top of the list. When using templates in the Borland compiler...

all source modules using a specific template function would generate their own copy of the same named template function. IOW the collection of object files would have many duplicate function names and code. Then, the linker was smart enough to perform a Difference on the object code, and discard the duplicates (or in rare cases complain that the functiion did not match).

When MS got into more use, you could not expand a template function in different source files as the linker would complain about duplicate function signatures. What this requires is to force the programmer to specify the template function be explicitly stated as to be inlined. And thus cause highly utilized template functions to have code duplication throught the application (sans entry point) and more likely to have the instruction cache (and L2 and L3) become less effective. Not to mention code bloat.

While a programmer can code around this (template function calls external function), programmers are lazy and let the compiler & linker work it out (inefficiently as it may be).

Jim Dempsey

0 Kudos
Gabriel_R_4
Beginner
1,030 Views

Thanks again Jim! Those are very useful insights indeed about the classical tradeoffs of inlining. However, I am not dealing with "classical" codes. These are sparse codes which have been "fully unrolled", so to speak, in order to achieve better vectorization and remove the index arrays from the code. Each function is unique and called exactly once, which is why inlining will not worsen the I-cache behavior. In fact, it's already awful ;-)

If you are curious, we studied the tradeoffs of this type of codes in this paper: https://dl.acm.org/doi/10.1145/3314221.3314615

Best,

G.

PS. Very nice background info on C++ template code generation! I had never thought about that issue with template instantiation :-)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,030 Views

Does your model iterate on the same (pre-known at compile time) Sparse Matrix geometry. IOW one you could pre-compute the tiling of sections for a specific shape (e.g. triangular). If so, the compiled version could partition by sizes.

If the shapes are irregular, this becomes more difficult, but I think one could algorithmicly partion the commonly dimensioned Sparse Matrix into multiple smaller sparse matrices that fit your compile algorithms of which you can then individually pair, process, reduce. This has an optional benefit of being able to cache align sections of the sparse matrix to best utilize the systems SIMD instructions (and cache).

One of two methods:

loop:
   TraditionalDefinedSparseMatrix -> multipleInternalRepresentation
   process(multipleInternalRepresentation)
   multipleInternalRepresentation -> TraditionalDefinedSparseMatrix
    otherWork(TraditionalDefinedSparseMatrix )
end loop

IOW improve the performance of matrix multiply at the expense of two copy operations inside loop

or:

TraditionalDefinedSparseMatrix -> multipleInternalRepresentation
CreateVirtualTraditionalDefinedSparseMatrix(multipleInternalRepresentation)
loop:
   process(multipleInternalRepresentation)
   otherWork(VirtualTraditionalDefinedSparseMatrix)
end loop
multipleInternalRepresentation -> TraditionalDefinedSparseMatrix

IOW improve the performance of matrix multiply at the expense of two copy operations outside loop and indirect access inside loop.

The requirements of the application would suggest one method (or neither) over the other.

Jim Dempsey

0 Kudos
Viet_H_Intel
Moderator
1,030 Views

Unfortunately, the compilation segfaults even for small-ish codes (e.g., approximately 75 MB of object code to be inlined will cause the compilation to fail

Can you provide a test case for us to look at this SegFault?

 

0 Kudos
Gabriel_R_4
Beginner
885 Views

I have just read this reply many months later. If you are still interested in a test case, please let me know, I will try to dig the codes that caused this issue.

0 Kudos
Viet_H_Intel
Moderator
879 Views

Yes, please provide us a test case, compiler version along with compiler options to reproduce the issue.

Thanks,


0 Kudos
Viet_H_Intel
Moderator
861 Views

Xcode12 is now supported in OneAPI. We will no longer respond to this thread.  

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Thanks,



0 Kudos
Reply