Re: Performance focused Compiler Usage

Corey_A_Intel · ‎07-14-2005

Abstract:

Compilers allow varying levels of optimization, but it is often impractical to apply the most aggressive levels on all files of large commercial applications. Applying high optimization levels to low time consuming functions increases compile time and adds a degree of risk, without even the possibility of a measurable gain in performance. Additionally, some compilers are capable of issuing optimization reports, but the size of such reports for entire projects makes them unusable.

A solution to these dilemmas can be found in applying aggressive compiler optimizations only to those source files that contribute in a significant manner to total elapsed time. To this end, the use of a performance analyzer to guide makefile modifications allows a developer to quickly focus on only those code sections where their efforts will be rewarded.

The article will discuss the coordinated use performance analysis data to drive optimizing compiler to usage, with the aim of improving performance while minimizing effort, time and risk.

Intel_C_Intel · ‎07-15-2005

Hello,

This is a very useful aproach as it is not at all trivial to determine which source code file that should be compiled with /O1 /O2 or /O3 in a large project with many hundred files. We have experienced that /O3 may actually reduce the performance for some files while it clearly increses the performance of others.

I enjoyed reading the article and will check the approach in our software withnearly 1000Fortran files.

Regards,

Lars Petter Endresen

Intel_C_Intel · ‎08-15-2005

Hello,

I have checked this idea, and I have found that the files that do not contribute to the computational time safely can be compiled with the /O1 option (minimize .exe size). This is nice because the /O1 option has the following advantages:

1. Compilation with /O1 is much faster than /O2 and /O3.
2. Compilation with /O1 gives typically 20% smaller .exe.
3. Sometimes /O2 and /O3 impairs performance.

Best Regards,

Lars Petter Endresen

TimP · ‎08-15-2005

This idea seems much less of an innovation in the Unix tradition, where more basic tools are available to carry it out. Run gprof and verify correctness on all important workloads with all candidate optimization levels. If you are lucky, you need build the application only once with each candidate optimization, to obtain all the performance data by subroutine. Set up a Makefile specifying which subroutines get each combination of options. For a large application, when correctness deficiencies are exposed for some of the options, this may take a while.
If your application is not too cumbersome to run the important sections under Vtune, you could obtain the required performance data, along with more opportunities to diagnose causes of performance variations.
Ifort, since version 8.0, doesn't have a space saving -O1 optimization level for the SSE architectures. The greatest space and compile time saving is found by removing vectorization options (-QxW et al.), which often comes at a greater cost in performance than with other compilers.

Intel_C_Intel · ‎08-15-2005

Hello

You state "Ifort, since version 8.0, doesn't have a space saving -O1 optimization level for the SSE architectures." while the compiler documentation states that "O1 Enables optimizations for speed and disables some optimizations that increase code size and affect speed".

Our empirical observation is that the .exe size is reduced by around 20% when 11 of the 599 files are compiled with /O3 while the rest is compiled with /O1. With all 599 files compiled with /O2 the application is slightly slower, the compilation time is doubled, and the size of the .exe is increased from 8488 bytes to 10184 bytes.

Lars Petter

TimP · ‎08-15-2005

I see that the -O1 option does sometimes reduce the unrolling of vectorized loops, and occasionally suppresses multiple vectorized versions for alignment. The reduced unrolling often performs better for short loops. Aligned versions usually have a significant performance advantage, if the conditions for executing them are met.

Reduced code size certainly is an an advantage where it is seldom executed. I don't see nearly as much code size reduction as you did, but it could happen if every loop were to come out with fewer versions.
Thanks for pointing out the difference in these options.

Intel_C_Intel · ‎08-15-2005

Hello,

This is surely a question of the "science fiction" type...

:-)

As processor cache is increasing at a steady rate while processor speed has approached a (temporal?) halt these days, maybe executables that fit entirely into cache may increase the performance much? (I have heard about processors with 24 Mb cache just around the corner). So let us assume that the next generation of Intel Pentium Turbo processors released in a few years have about 8 Mb cache, would it then be a good idea reduce .exe code size to make the .exe stay always in cache?

:-)

Regards

Lars Petter

Steven_L_Intel1 · ‎08-15-2005

It has always been a good idea to provide "code locality" - modules of the program that tend to execute together should be near each other in memory. The Intel compilers, when used with Profile-Guided Optimization, try to do this, but you can help too by putting some thought into how your program is structured.

Sure, if you can make your entire program fit in cache AND you're the only thing running on the system, that speeds up things a lot. But what with prefetching, you can do fine if you use cache wisely - it isn't necessary to restrict code to what fits in cache.

Intel_C_Intel · ‎08-16-2005

Hello,

Yes PGO is a very good idea for the files that takes a significant part of the CPU time. This is my experiency for source code files that are floating point intensive (division, math functions) and includes many branches.

Regarding compilation time a reduction from 20 minutes to 10 minutes (for a full rebuild) really makes a difference for the software developers in our company.

Lars Petter

jim_dempsey · ‎08-16-2005

I was unable to download your .doc file (appeared as 0 bytes).

Will there be additions to compiler directives regarding performance hints? Example:

!DEC$ SELDOM

IF(expression) call foo(arg)

!DEC$ SELDOM

IF(expression) then

...

ENDIF

Where the compiler can then place the statements out of line (i.e. implicit GOTOs)

Equivilent to:

IF(expression) GOTO Seldom0001

SeldomDone0001:

IF(expression) GOTO Seldom0002

SeldomDone0002

...

RETURN

Seldom0001:

call foo(arg)

goto SeldomDone0001

Seldom0002:

...

goto SeldomDone0002

Jim Dempsey

Steven_L_Intel1 · ‎08-16-2005

We're having trouble with attachments in the forum. Please be patient while we work on this. I'll ask Corey to repost the article.

Intel_C_Intel · ‎08-16-2005

Hello and thx for interesting topic

lpe@scandpower.no wrote:

Regarding compilation time a reduction from 20 minutes to 10 minutes (for a full rebuild) really makes a difference for the software developers in our company.

It depends on what wins for you in battle Compilation Time vs. Execution Time, doesnt it. I clearly understand that for developers who had to compile big projects for many times a compilation time is more important, but what about execution. Sometimes (and often) its more expensive. Imagine that execution takes a day or a week. Good and quite long (!) optimization will save you hours and even days.

I really understood is that all are up to your own problem domain and product area. What about me I use /Qipo and /Othree.

Stanislav

Intel_C_Intel · ‎08-17-2005

Hello,

Sorry for the confusion. I was not saying that compilation time is more important than execution time, execution time is indeed more important, in particular for our customers. What I was trying to say is that the proper combination of /O1, /O2 and /O3 actually reduces both compilation time and execution time, you can have it both ways: to have one's cake and then eat it!

Compilation time was reduced from 20 minutes to 10 minutes and (as I stated in the message 6 in this discussion), standard compilation with /O2 made the application slightly slower (a few percent) than the performance focused compiler usage approach. Indeed there are many cominations of options that may be particularly useful for some files, like /O1, O2/, /O3, Qprec-div-, /Qip, /Qipo and PGO, but only serve to increase the compilation time for other files.

Performance focused Compiler Usage is indeed intersing as it can simultaneously speedup compilation and execution time.

Regards,

Lars Petter Endresen

Message Edited by lpe@scandpower.no on 08-17-2005 07:20 AM

Intel_C_Intel · ‎08-17-2005

Hello,

I have justed checked that compiling our software (599 fortran files) with maximum optimization enabled /O3 /QxN /Qprec-div- /Qipo /Qprof_use (latter option is PGO), decreases the performance of the software with around 20% relative to the standard "/O2 /QxN /Qprec-div-" options. However, the techiques with Intel VTune and Performance focused Compiler Usage, actually speeds up the program around 2.5% relative to compilation with "/O2 /QxN /Qprec-div-". Another issue is that compilation with "O3 /QxN /Qprec-div- /Qipo /Qprof_use" actually took around one hour...

Lars Petter

Intel_C_Intel · ‎08-17-2005

Hello,

Thank you for explanation, Lars Petter. Is it evenly for all your files: "/O2 /QxN /Qprec-div-" faster than O3 /QxN /Qprec-div- /Qipo /Qprof_use? 20% its very serious. I didnt know that aggressive optimization can injure so much, even with PGO. But how determine for which files which optimization-parameters use?

Unfortunately I cant download attached approach. We should be patient and Ill do it later.

Stanislav

Intel_C_Intel · ‎08-18-2005

>>> But how determine for which files which optimization-parameters use?

Hello,

I have found that (typically) the top 10 functions are good candidates for aggressive optimization (/O2 and /O3). Using Intel VTune you can easily find the top 10 clockticks functions.

I have also tried to download the paper again, but it seems to be 0 bytes at the moment. I am copying the first page here wich contains the basic idea of the approach.

Regards,

Lars Petter

### from Perf_compile1_nofigures.doc by David Levinthal and Vladimir Tsymbal ###

Abstract:

Compilers allow varying levels of optimization, but it is often impractical to apply the most aggressive levels on all files of large commercial applications. Applying high optimization levels to low time consuming functions increases compile time and adds a degree of risk, without even the possibility of a measurable gain in performance. Additionally, some compilers are capable of issuing optimization reports, but the size of such reports for entire projects makes them unusable.

A solution to these dilemmas can be found in applying aggressive compiler optimizations only to those source files that contribute in a significant manner to total elapsed time. To this end, the use of a performance analyzer to guide makefile modifications allows a developer to quickly focus on only those code sections where their efforts will be rewarded.

The article will discuss the coordinated use performance analysis data to drive optimizing compiler to usage, with the aim of improving performance while minimizing effort, time and risk

Article:

Compiling large applications (even as small as say 500 source files for example) can be problematic. While the desire to produce the fastest possible executing binary is the final desire, there are competing requirements from the impact of compilation time, algorithm debugging, numerical stability, and even sometimes compiler stability. Raising the compilation options to their highest and most aggressive levels may sound like a reasonable thing to do but the reality of these other issues frequently makes that impossible. Further, for applications with very large active binary footprints the ability to keep a large part of the core binary in cache can be the critical performance factor. Thus the optimal set of compiler options is unlikely to be a simple, single choice.

The complexity is usually due to the applications ability to handle a variety of problems and their data sets, essentially acting like many different applications glued together. A large part of the complexity in the source exists to handle the initialization and problem setup code. These files frequently represent a large fraction of the source, much of the difficulty, but virtually none of the consumed CPU cycles.

In spite of all of the above, most make files have only a single set of options for building all of the files in the project. Due to this, the standard approach to optimization is to simply raise the aggressiveness of the compilation options, usually from O2 to O3. A serious debugging effort then ensues to determine which files cannot be optimized, and the creation of special make rules for those files.

A simpler and more powerful approach is to simply run the base build through a performance analyzer, such as the Intel VTune Performance Analyzer and create a list of which source files ac count for ~85-95% of the CPU utilization. This is frequently only around 1% of all the files. If the developer immediately creates individual rules for each of these files they are in a more flexible and powerful position. Changing optimizations will now only result in a small number of files being recompiled and further, a full rebuild will be enormously faster as no time is wasted optimizing functions that consume no time.

There are many obvious cases where the benefits of such an approach are immediately apparent. One is the issue of pointer disambiguation in C/C++ where it is assumed as part of the language standard that pointers alias to each other, and the compiler must assume that the data areas that pointers reference will overlap in some manner. Lets examine what this means.

### from Perf_compile1_nofigures.doc by David Levinthal and Vladimir Tsymbal ###

Intel_C_Intel · ‎08-20-2005

Hello

So we look for source files which account 8595% of the CPU utilization and aggressively optimize them. But who does say that this files (even 1%) not in a risk group. Or I didnt understood what means individual rules for each of these files. Probably because there are not so much of this files we can in detail research which optimize parameters should be set to them. How determine which parameters set and what parameters set to other files (or leave them base-build ?)? And what about DLLs?

Intel_C_Intel · ‎08-21-2005

Hello.

In a large project with many files it is not an easy task to determine the best combination of compiler options for the various filles, in particular if the final program can be used in a manner that is not expected by the programmer. The best thing to do is to optimize the program for the most typical use cases. This involves a little trial and error as it is not certain the the most aggressive optimization level will improve the performance of the file that is responsible for most of the CPU time.

In a makefile or in developer studio it is not so difficult to alter the compiler options for each file individually, please study the documentation of the compiler to achieve this.

Lars Petter Endresen

Intel_C_Intel · ‎08-29-2005

Hello,

How determine which parameters set and what parameters set to other files (or leave them base-build ?)?

My experience is that it can be wise to leave as many files as possible in the project with the /O1 option, as this clearly makes cache usage more efficient for the files that are compiled with /O2 and /O3: a less frequently called part of the code is better to keep small (/O1 - minimize size) in order to not fill the cache with instruction or data that is not frequently called.
It is beneficial if the /O2 and /O3 optimized part of the code, that represent most of the CPU time, would always have as much cache as possible available.

Best Regards,

Lars Petter Endresen

Intel_C_Intel · ‎08-30-2005

Thanks for explanation, Lars. Now its time to experience all this.

Intel_C_Intel · ‎12-13-2005

Hello,

Recently I have had some time to experiment more with "Performance focused Compiler Usage", and it now seems that also applying "Profile Guided Optimization" to the hot spot files (files compiled with /O2 or /O3) may be beneficial. Typically this can be the last step in the optimization process - adding PGO too early may make it difficult to interpret the results. Thus a project may have the following options for the various files, based on VTune profiling:

1. /O1 /QxN

2. /O2 /QxN /Qip /Qprof_use(PGO)

3. /O3 /QxN /Qip /Qprof_use (PGO)

Sometimes /Qip may be replaced with /Qipo also, but in our case /Qip was better (/Qipo increased .exe size 10%). To summarize both:

1. Performance focused Compiler Usage, and

2. Profile Guided Optimization

may serve to guide the compiler to optimize hot spots in the code in a proper way, simultaneously increasing compilation and execution speed.

Lars Petter Endresen

Message Edited by lpe@scandpower.no on 12-13-2005 01:14 PM