PGO/IPO confusion, and some other questions.

Marctraider_M_ · ‎12-18-2014

Hi all! I've been using the Intel Compiler for a while now, and the runtime performance increase is pretty satisfactory.

Despite this I still have some questions which need some better explanation because I've been really confused with all the different documentation out there.

PGO/IPO:
How do these two interact and as from what I've understood, you can use them both together? So how exactly does one use both PGO and IPO together? The compiling process usually spits out a warning that IPO multi file optimizations are disabled while using PGO. Or are you supposed to only use one of them instead and observe which one gives the best speed benefit?

Also, what is the best method of trying to profile an application for best performance? Does one run the application and head over to the area that requires the most CPU? Or the area that you want to be most optimized? Or does one go over all the possible functions in the program and still focus most time on the piece of code you want optimized most? And leave that piece of code running for most of the time until you decide to end the profiling process?

IPP/TBB/MKL etc:
These libraries, are they automatically optimizing code or does the code need alteration before they get used?

Parallel compilation:
So far I've understood that you can use either /Qparallel or OpenMP, or a combination of these to trigger Automated Parallel compilation. Is it right when I conclude that OpenMP requires an external DLL file and just Qparallel does not?

Compiler Processor specific instruction optimizations:
The option in Visual studio that configures the Optimized Code Path, does selecting (for instance) AVX or SSE4.2 also build in code for lower instructionsets like SSE4.1, SSE3, SSE2 by default? or do you manually have to specify all of them with /QaxSSE4.1,SSSE3,AVX etc?

According to the manual this would be the case, not 100% sure though. Also /arch seems to be deprecated and gets overrided by /Qax according to Visual studio?

Thanks! sorry for all the newbie questions :)

TimP · ‎12-18-2014

The Profile Guided Optimization with Intel compilers disables many optimizations, including IPO. The data collection is optimization neutral, that is, you can rebuild (/Qprof_use) with a variety of optimizations, such as IPO, using the same execution profile data. Performance while collecting prof_gen is relatively low and meaningless. This is unlike the corresponding situation with g++, for example, where profile data have to be collected for each combination of compile options. Some of the static profile features for which you required PGO at one time have been built into the usual compiler options such as -O3.

In performance optimization, you usually want to collect data to determine where there is significant potential for improvement. In the interest of time spent profiling, you may well cut down the data set temporarily so that it doesn't spend the full time in all areas of interest. Not all advice about how to determine where to spend your effort is necessarily productive; for example, undue emphasis is placed sometimes on CPI.

Once you have engaged performance libraries appropriately in your application, the scope for further optimization there may be limited, e.g. to choosing numbers of threads and affinities, but now there are additional possibilities such as the MKL_DIRECT_CALL.

/Qparallel is a means of engaging OpenMP automatically, using the same OpenMP .dll either way.

The idea of asking the compiler to generate multiple architecture paths has been over-rated. There are penalties in code size and switching for adding additional paths. While the compiler attempts to prune the list of separate paths, it doesn't always make the best decision.

The default with /Qax is to support both SSE2 (implicitly) and the explicit architecture. You have the option to change the implicit SSE2 to something newer such as SSE4.1, in case that is the oldest architecture you want to support. In the end, you probably have to test to find out where in your application adding an additional architecture may be useful.

KitturGanesh · ‎12-18-2014

Tim has nicely responded to your question. In general IPO is a multi-pass optimizaion which performs a static topological analysis of the application enabling inter-procedural optimizations for current (with just -ip) or across files (if using -ipo) for many optimizations such as for better register usage, procedural inlining, dead code elimination, constant propagation etc. That said, it's better to use IPO with PGO to guide function inlining but may increase build time & binary size as well.

The static analysis above doesn't answer many questions such as which code touched how often or questions related to execution which PGO does as it performs a dynamic analysis of your application. PGO provides information to the compiler about areas of an application that are most frequently executed so the compiler can be more selective and specific in optimizing the application (such as reduce instruction cache issues, branch mis-predictions etc. So, if your application source tree is not changed too often and is profile based (frequently executed paths) you should try out PGO accordingly.

You can find a lot of knowledge base articles in the Intel Developer Zone regarding optimizing applications using Intel Compiler and the various parallel programming models you can use for exploiting parallelism in your code (Intel Cilk Plus, Intel TBB, MKL, IPP, OpenMP etc) depending on the context of your application. A few useful ones:

https://software.intel.com/en-us/articles/being-successful-with-the-intel-compilers-you-need-to-know/
Technical presentations on various features at: http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations
Vectorization essentials: http://software.intel.com/en-us/articles/vectorization-essentials
Using Cilk Plus: http://software.intel.com/en-us/intro-to-vectorization-using-intel-cilk-plus
http://software.intel.com/en-us/articles/call-site-dependence-for-elemental-functions-simd-enabled-functions-in-c#!

And so on......

Hope the above helps as well.

_Kittur

Marctraider_M_ · ‎12-18-2014

Thanks for the answers guys. That explains alot. and yeah i forgot that profiling an app disables all optimizations until the final compilation (facepalm) So let me sum it up: Qparallel is a means of enabling OpenMP, but i can recall having read in some doc that you can try either one of them, seperately or combined (either use intels method or with OpenMP directives). Also why would there be an option in Visual studio to enable/disable OpenMP language then? And if i understood it correctly regarding the /Qax flag (path optimized code), unless you define each architecture manually and seperated with commas like the document describes, it will take SSE2 as basepath and your specified architecture as the optimized code path? And nothing in between? Lets say i would specify all possible options here, wouldnt that be better for a program especially if it is supposed to be released for a wide audience? Or would Tims assessement be correct by saying that the compiler does not always pick the best path for fastest runtime execution speed? In which case the compiled is unable to decide or calculate which path would be fastest? Thx :)

TimP · ‎12-19-2014

I think when OpenMP is enabled, auto-parallelization is disabled in the scope of the OpenMP region. As you say, it is feasible to use both in the same source file. There may be possibilities of unintentional nested parallelism across files which I haven't seen discussed.

The visual studio option to enable OpenMP is simply the usual hook to allow a correctly written OpenMP application to build and run in serial mode.

Yes, the default with /Qax is (at most) 2 architecture code paths. Clearly, that is a recommended option.

The option to specify multiple paths, which may have to be tested individually, on multiple platforms, for quality assurance and efficacy, is not put forward as one which facilitates rapid development.

In my experience, when the compiler prunes code paths, it may choose to produce SSE2 only even when the developer wants something more. That of course could be a good choice for code which isn't performance critical, but the developer can accomplish that by specifying a single code path for those source files.

SSE3 is likely to be useful only in applications with complex arithmetic, for platforms which don't support newer Intel instruction sets. SSSE3 probably doesn't get much testing, as there have been no CPUs available in recent years which would use it.

KitturGanesh · ‎12-19-2014

A little more elaboration:

The /Qparallel swicth basically is used to enable auto parallelism per-se so the compiler can find loops/tasks in the code that can be parallelized. The compiler will find those relevant loops and perform a data flow analysis to verify proper parallel execution and then uses some of the work sharing concepts similar to what you have in openmp. The compiler also supports OpenMP (4.X as well) where you enable that using explicitly using the Qqopenmp switch. Of course, the user needs to identify specific portions of the code that's suitable for parallelism and add proper compiler directives available in OpenMP thereof but the user will have finer control which you'll not with auto parallelization (/Qparallel) as the compiler takes control.

Yes, the default is sse2 but you can use /Qax to generate multiple code paths to target processors. Ex: "/QaxAVX /QxSSE4.2" targets Nehalem and AVX. Additionally you can use the manual cpu dispatch routines to target multiple processors as well. The articles

https://software.intel.com/en-us/articles/how-to-compile-for-intel-avx/
https://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/

goes over some of those aspects (related to vectorization and code paths switches usage etc).

_Kittur

Bernard · ‎12-30-2014

>>>Also, what is the best method of trying to profile an application for best performance?>>>

For this purpose you can use Intel VTune profiler. Of course you can always write your own test cases but you will not be able to gain insight into various CPU counters implemented as a software facing MSR registers.

http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf

https://software.intel.com/sites/default/files/m/6/5/2/c/f/6734-vtune_getting_started_linux.pdf

https://software.intel.com/sites/default/files/article/394181/using-intel-vtune-amplifier-xe-on-4th-generation-intel-core-processors.pdf