- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Abstract:
Compilers allow varying levels of optimization, but it is often impractical to apply the most aggressive levels on all files of large commercial applications. Applying high optimization levels to low time consuming functions increases compile time and adds a degree of risk, without even the possibility of a measurable gain in performance. Additionally, some compilers are capable of issuing optimization reports, but the size of such reports for entire projects makes them unusable.
A solution to these dilemmas can be found in applying aggressive compiler optimizations only to those source files that contribute in a significant manner to total elapsed time. To this end, the use of a performance analyzer to guide makefile modifications allows a developer to quickly focus on only those code sections where their efforts will be rewarded.
The article will discuss the coordinated use performance analysis data to drive optimizing compiler to usage, with the aim of improving performance while minimizing effort, time and risk.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have checked this idea, and I have found that the files that do not contribute to the computational time safely can be compiled with the /O1 option (minimize .exe size). This is nice because the /O1 option has the following advantages:
1. Compilation with /O1 is much faster than /O2 and /O3.
2. Compilation with /O1 gives typically 20% smaller .exe.
3. Sometimes /O2 and /O3 impairs performance.
Best Regards,
Lars Petter Endresen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your application is not too cumbersome to run the important sections under Vtune, you could obtain the required performance data, along with more opportunities to diagnose causes of performance variations.
Ifort, since version 8.0, doesn't have a space saving -O1 optimization level for the SSE architectures. The greatest space and compile time saving is found by removing vectorization options (-QxW et al.), which often comes at a greater cost in performance than with other compilers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You state "Ifort, since version 8.0, doesn't have a space saving -O1 optimization level for the SSE architectures." while the compiler documentation states that "O1 Enables optimizations for speed and disables some optimizations that increase code size and affect speed".
Our empirical observation is that the .exe size is reduced by around 20% when 11 of the 599 files are compiled with /O3 while the rest is compiled with /O1. With all 599 files compiled with /O2 the application is slightly slower, the compilation time is doubled, and the size of the .exe is increased from 8488 bytes to 10184 bytes.
Lars Petter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reduced code size certainly is an an advantage where it is seldom executed. I don't see nearly as much code size reduction as you did, but it could happen if every loop were to come out with fewer versions.
Thanks for pointing out the difference in these options.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is surely a question of the "science fiction" type...
:-)
As processor cache is increasing at a steady rate while processor speed has approached a (temporal?) halt these days, maybe executables that fit entirely into cache may increase the performance much? (I have heard about processors with 24 Mb cache just around the corner). So let us assume that the next generation of Intel Pentium Turbo processors released in a few years have about 8 Mb cache, would it then be a good idea reduce .exe code size to make the .exe stay always in cache?
:-)
Regards
Lars Petter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, if you can make your entire program fit in cache AND you're the only thing running on the system, that speeds up things a lot. But what with prefetching, you can do fine if you use cache wisely - it isn't necessary to restrict code to what fits in cache.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes PGO is a very good idea for the files that takes a significant part of the CPU time. This is my experiency for source code files that are floating point intensive (division, math functions) and includes many branches.
Regarding compilation time a reduction from 20 minutes to 10 minutes (for a full rebuild) really makes a difference for the software developers in our company.
Lars Petter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello and thx for interesting topic
lpe@scandpower.no wrote:
Regarding compilation time a reduction from 20 minutes to 10 minutes (for a full rebuild) really makes a difference for the software developers in our company.
It depends on what wins for you in battle Compilation Time vs. Execution Time, doesnt it. I clearly understand that for developers who had to compile big projects for many times a compilation time is more important, but what about execution. Sometimes (and often) its more expensive. Imagine that execution takes a day or a week. Good and quite long (!) optimization will save you hours and even days.
I really understood is that all are up to your own problem domain and product area. What about me I use /Qipo and /Othree.
Stanislav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the confusion. I was not saying that compilation time is more important than execution time, execution time is indeed more important, in particular for our customers. What I was trying to say is that the proper combination of /O1, /O2 and /O3 actually reduces both compilation time and execution time, you can have it both ways: to have one's cake and then eat it!
Compilation time was reduced from 20 minutes to 10 minutes and (as I stated in the message 6 in this discussion), standard compilation with /O2 made the application slightly slower (a few percent) than the performance focused compiler usage approach. Indeed there are many cominations of options that may be particularly useful for some files, like /O1, O2/, /O3, Qprec-div-, /Qip, /Qipo and PGO, but only serve to increase the compilation time for other files.
Performance focused Compiler Usage is indeed intersing as it can simultaneously speedup compilation and execution time.
Regards,
Lars Petter Endresen
Message Edited by lpe@scandpower.no on 08-17-2005 07:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have justed checked that compiling our software (599 fortran files) with maximum optimization enabled /O3 /QxN /Qprec-div- /Qipo /Qprof_use (latter option is PGO), decreases the performance of the software with around 20% relative to the standard "/O2 /QxN /Qprec-div-" options. However, the techiques with Intel VTune and Performance focused Compiler Usage, actually speeds up the program around 2.5% relative to compilation with "/O2 /QxN /Qprec-div-". Another issue is that compilation with "O3 /QxN /Qprec-div- /Qipo /Qprof_use" actually took around one hour...
Lars Petter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thank you for explanation, Lars Petter. Is it evenly for all your files: "/O2 /QxN /Qprec-div-" faster than O3 /QxN /Qprec-div- /Qipo /Qprof_use? 20% its very serious. I didnt know that aggressive optimization can injure so much, even with PGO. But how determine for which files which optimization-parameters use?
Unfortunately I cant download attached approach. We should be patient and Ill do it later.
Stanislav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have found that (typically) the top 10 functions are good candidates for aggressive optimization (/O2 and /O3). Using Intel VTune you can easily find the top 10 clockticks functions.
I have also tried to download the paper again, but it seems to be 0 bytes at the moment. I am copying the first page here wich contains the basic idea of the approach.
Regards,
Lars Petter
### from Perf_compile1_nofigures.doc by David Levinthal and Vladimir Tsymbal ###
Abstract:
Compilers allow varying levels of optimization, but it is often impractical to apply the most aggressive levels on all files of large commercial applications. Applying high optimization levels to low time consuming functions increases compile time and adds a degree of risk, without even the possibility of a measurable gain in performance. Additionally, some compilers are capable of issuing optimization reports, but the size of such reports for entire projects makes them unusable.
A solution to these dilemmas can be found in applying aggressive compiler optimizations only to those source files that contribute in a significant manner to total elapsed time. To this end, the use of a performance analyzer to guide makefile modifications allows a developer to quickly focus on only those code sections where their efforts will be rewarded.
The article will discuss the coordinated use performance analysis data to drive optimizing compiler to usage, with the aim of improving performance while minimizing effort, time and risk
Article:
Compiling large applications (even as small as say 500 source files for example) can be problematic. While the desire to produce the fastest possible executing binary is the final desire, there are competing requirements from the impact of compilation time, algorithm debugging, numerical stability, and even sometimes compiler stability. Raising the compilation options to their highest and most aggressive levels may sound like a reasonable thing to do but the reality of these other issues frequently makes that impossible. Further, for applications with very large active binary footprints the ability to keep a large part of the core binary in cache can be the critical performance factor. Thus the optimal set of compiler options is unlikely to be a simple, single choice.
The complexity is usually due to the applications ability to handle a variety of problems and their data sets, essentially acting like many different applications glued together. A large part of the complexity in the source exists to handle the initialization and problem setup code. These files frequently represent a large fraction of the source, much of the difficulty, but virtually none of the consumed CPU cycles.
In spite of all of the above, most make files have only a single set of options for building all of the files in the project. Due to this, the standard approach to optimization is to simply raise the aggressiveness of the compilation options, usually from O2 to O3. A serious debugging effort then ensues to determine which files cannot be optimized, and the creation of special make rules for those files.
A simpler and more powerful approach is to simply run the base build through a performance analyzer, such as the Intel VTune Performance Analyzer and create a list of which source files ac count for ~85-95% of the CPU utilization. This is frequently only around 1% of all the files. If the developer immediately creates individual rules for each of these files they are in a more flexible and powerful position. Changing optimizations will now only result in a small number of files being recompiled and further, a full rebuild will be enormously faster as no time is wasted optimizing functions that consume no time.
There are many obvious cases where the benefits of such an approach are immediately apparent. One is the issue of pointer disambiguation in C/C++ where it is assumed as part of the language standard that pointers alias to each other, and the compiler must assume that the data areas that pointers reference will overlap in some manner. Lets examine what this means.
### from Perf_compile1_nofigures.doc by David Levinthal and Vladimir Tsymbal ###
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In a large project with many files it is not an easy task to determine the best combination of compiler options for the various filles, in particular if the final program can be used in a manner that is not expected by the programmer. The best thing to do is to optimize the program for the most typical use cases. This involves a little trial and error as it is not certain the the most aggressive optimization level will improve the performance of the file that is responsible for most of the CPU time.
In a makefile or in developer studio it is not so difficult to alter the compiler options for each file individually, please study the documentation of the compiler to achieve this.
Lars Petter Endresen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How determine which parameters set and what parameters set to other files (or leave them base-build ?)?
My experience is that it can be wise to leave as many files as possible in the project with the /O1 option, as this clearly makes cache usage more efficient for the files that are compiled with /O2 and /O3: a less frequently called part of the code is better to keep small (/O1 - minimize size) in order to not fill the cache with instruction or data that is not frequently called.
It is beneficial if the /O2 and /O3 optimized part of the code, that represent most of the CPU time, would always have as much cache as possible available.
Best Regards,
Lars Petter Endresen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Message Edited by lpe@scandpower.no on 12-13-2005 01:14 PM

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page