Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Full optimization of binaries vs. -ax options

drMikeT
New Contributor I
1,160 Views

Hello,

I am using the multi-target Intel compiler options “

-xBase -axArch1,Arch2, ….”

”. I heard that if we use this multi-target option we cannot get the entire breadth and depth of optimizations that are possible by just specifying a specific target. Does this mean that if, for instance, I use other compiler optimization flags alongside the -ax ones, the compilers will not actually carry those out, even if they may be applicable on all or some of the specified targets? 

Can you please provide a detailed answer or a pointer to Intel documents that explains this in depth?

Thank you!

Michael

0 Kudos
13 Replies
TimP
Honored Contributor III
1,160 Views

I haven't spent much time on similar questions lately.

I have seen cases in the past where the compiler decided to use only the base architecture when several were requested, even though significant advantages could be seen in a specific single architecture build.

Due to the possibility of excessive code expansion, I think the compiler is intended to make a rudimentary evaluation of whether each version is of value and to prune the requested selection.  I wouldn't count, for example, on the compiler recognizing cases where SSE3 is the best choice.

Performance problems with excessive code size, such as I-cache misses, frequently out-weigh the potential advantage of multiple paths.

Multiple paths also present QA issues. A customer may choose a platform where the executed path differs from any you tested.

I'm still not certain whether to believe all of what the docs say.  It seems to say that for example

-march=corei7 -axCORE-AVX2

would not generate an AVX2 path.

Also it seems annoying that some non-Intel compilers want AVX2 spelled simply that way, while Intel wants CORE-AVX2.

0 Kudos
drMikeT
New Contributor I
1,160 Views

Hi Tim,

I think that even though the binary contains code possibilities for all targets, the dynamic code path (hence the demand for L1-I) should only traverse code for the current architecture. What I need to understand is if the optmizers take a serious shortcut and ignore other optimization options when we request multi-target binaries. I think "-march" option applies to Intel compatible (not genuine Intel) processors. In that case I am not expecting the compiler to make any tremendous effort to provide top optimizations.

We really need to hear from the Compiler team and focus of course on the latest Intel compilers (15.1 and 16.x)

 

thanks

mike 

0 Kudos
Thomas_W_Intel
Employee
1,160 Views

I'm moving this thread to the forum for the Intel compiler. You should get more answers there.

0 Kudos
Kittur_G_Intel
Employee
1,160 Views

Hi Mike,
The switches -ax<code> (ex: -axavx) when used should generate AVX instructions on systems that support those instructions but will end up using only SSE instructions on other intel or compatible non-intel systems and relevant optimizations are used for corresponding targets that  support those optimization switches.  The article at: https://software.intel.com/en-us/articles/how-to-compile-for-intel-avx is not current as it doesn't go beyond Sandy Bridge and hence doesn't mention avx2 etc.  Additionally a relevant topic that might come in handy is manual dispatching as well. The article at: https://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors-with-support-for-intel-avx should give more info. Also,  there has been a lot of optimization report enhancements as well and the link https://software.intel.com/en-us/articles/getting-the-most-out-of-your-intel-compiler-with-the-new-optimization-reports should give you additional info on using the optimization reports as well.  BTW, I'll ping my peer who authored the avx article I referenced above and ask him to add any additional input on your question as well, fyi.

_Kittur

0 Kudos
Martyn_C_Intel
Employee
1,160 Views

Hi Mike,

One more article that bears on this is https://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/ , section 3.

Using the -ax options does not in general prevent the compiler from performing other, unrelated optimizations. Occasionally, there could be an effect; for example, the creation of an additional dispatch layer might have some impact on inlining. Dispatch works at the function level - the compiler only creates an additional, processor-specific version of a function if it thinks that is likely to result in a performance benefit. But when it does so, it generates and optimizes the entire function for that processor target, with no special restrictions.

    Building an application with multiple code paths does carry some overhead, as Tim points out. The binary is larger and there is the dispatch code that branches to the appropriate function version. I've seen applications built with a dual code path that had 1-2% overhead compared to a single path build for the same processor, but of course that is application dependent. Applications with many tiny functions with multiple code paths will tend to have more overhead.

   I advise against generating large numbers of code paths. This increases both overhead and complexity, and the potential benefit of additional paths for slightly different instruction sets is small. It may make sense to generate one path for SSE, one for AVX and possibly one for AVX2 (much of the time, the compiler won't need to generate all 3 versions). It is very unlikely to be worthwhile targeting SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 etc. independently.

    The -m options do provide good optimization for non-Intel processors as well as for Intel processors. However, they can be used only for the baseline (default) code path, not for the additional paths generated by the -ax switch.

 

     Finally, if you should come across an example where a significant optimization is performed with the -x switch, but not with the corresponding -ax switch, we'd like to hear about it, and investigate if you can provide a test case. The optimization report is a good indication of which major optimizations were performed and it also notes which functions had multiple code paths generated. It's normal that the compiler chooses not to generate multiple code paths for some functions.

Martyn Corden

Intel Developer Support

0 Kudos
Kittur_G_Intel
Employee
1,160 Views

Thanks Martyn, for the additional elaboration which makes sense especially on overhead involved with multiple code paths and the ROI.

@Mike, as Martyn suggested if you come across a test reproducer that shows issue with -ax unlike -x with optimization per-se please let us know and attach to this thread so we can investigate further - appreciate much. 

_Kittur

0 Kudos
drMikeT
New Contributor I
1,160 Views

Kittur Ganesh (Intel) wrote:

Hi Mike,
The switches -ax<code> (ex: -axavx) when used should generate AVX instructions on systems that support those instructions but will end up using only SSE instructions on other intel or compatible non-intel systems and relevant optimizations are used for corresponding targets that  support those optimization switches.  The article at: https://software.intel.com/en-us/articles/how-to-compile-for-intel-avx is not current as it doesn't go beyond Sandy Bridge and hence doesn't mention avx2 etc.  Additionally a relevant topic that might come in handy is manual dispatching as well. The article at: https://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors-with-support-for-intel-avx should give more info. Also,  there has been a lot of optimization report enhancements as well and the link https://software.intel.com/en-us/articles/getting-the-most-out-of-your-intel-compiler-with-the-new-optimization-reports should give you additional info on using the optimization reports as well.  BTW, I'll ping my peer who authored the avx article I referenced above and ask him to add any additional input on your question as well, fyi.

_Kittur

 

Kittur, thanks for the references to these articles.

Michael

0 Kudos
drMikeT
New Contributor I
1,160 Views

Hi Martyn

good to hear from you again.

I usually build binaries with -xSSE4.2 -axCORE-AVX-I,CORE-AVX2 -O[23] in order to have code possibilities for Westmere, Sandy/Ivy -Bridge and Haswell/Broadwell. 

I understand the slight additional overhead to select a function with the target processor ISA.

My question on the Intel compilers' behavior vs these options was prompted by a statement that when there are multiple function versions, the compiler cannot do a good job optimizing the binary. 

I have not myself noticed any significant performance degradation with multiple versions for functions.

Thanks for the pithy answer and pointers. I will provide further feedback if I notice something unexpected.

 

Take care

Michael

 

 

 

 

0 Kudos
TimP
Honored Contributor III
1,160 Views

In the usual case, the code generated for ivy bridge would be the same as for sandy bridge, but I doubt whether a sandy bridge would be allowed to take that branch.  I think there may be cases where the sse is faster.

In my experience, sse4.1 code was usually faster on westmere, although the differences may have been reduced in recent compilers.

0 Kudos
Martyn_C_Intel
Employee
1,160 Views

To focus on the most important part of Tim's comment:

There is very little difference in the code generated by the compiler for -xcore-avx-i and for -xavx (or the corresponding paths under -ax...). Unless your application involves conversions to and from float16, I doubt that you would see any difference at all. However, code compiled with -xcore-avx-i will not be allowed to run on a Sandy Bridge; and if you compile with -xsse4.2 -axcore-avx-i,core-avx2 and run on Sandy Bridge, the application will take the SSE4.2 code path and not make use of AVX, which might lead to a significant loss of performance. I recommend that you use -xsse4.2 -axcore-avx2 ,avx  instead. This should work just as well on Ivy Bridge, and much better on Sandy Bridge.

 

Martyn

0 Kudos
drMikeT
New Contributor I
1,160 Views

Martyn, I meant to to type "xsse4.2 -axavx,core-avx2". Indeed I do not care much about Ivy-B.

 

regards

Mike

0 Kudos
Kittur_G_Intel
Employee
1,160 Views

Mike, that makes sense. Please update on what you find out on your runs with those optimizations, thx.
_Kittur

0 Kudos
drMikeT
New Contributor I
1,160 Views

I will report  as soon as and if I run into anything out of the ordinary on performance.

thanks all for their response!

Mike

0 Kudos
Reply