Optimal Compiler Flags

Karthik_K_ · ‎04-16-2017

I see -O2 says "Optimize for maximum speed". In addition, there are also different arch flags for different instruction sets. Would O2 be the master flags that builds the most optimal code across all the platforms ? Essentially, I am looking to generate the most optimal code across different architectures (code size is not an issue).

TimP · ‎04-17-2017

If you don't set an arch option, you get SSE2 code which will run on all CPUs like Pentium 4 or newer, which includes all 64-bit CPUs.

If you target Windows, a 32-bit built will run on both 32- and 64-bit Windows OS installations. On linux, it's not that simple.

It's not easy to guess your intention well enough to give a simple answer, although your comment about code size helps. For example, assuming you will never encounter a CPU older than 15 years, you could make a dual path build which covers everything reasonably well by setting /QaxAVX /arch:SSE3 (or linux or Mac equivalent). The importance of choosing SSE3 rather than SSE2 as minimum arch applies mainly if complex arithmetic is important.

SergeyKostrov · ‎04-17-2017

>>...I see -O2 says "Optimize for maximum speed". In addition, there are also different arch flags for different instruction sets. >>Would O2 be the master flags that builds the most optimal code across all the platforms? It is better to use very tuned set of compiler options of the same C/C++ compiler for different platforms, ISAs, etc. You could try IA32 based code generation and you will see that there will be a significant performance impact if that code is executed on latest generations of Intel Core or Xeon CPUs.

TimP · ‎04-17-2017

The IA32 architecture option is optimal only for Pentium2 and, if not using float data type, for Pentium III. You should be able to get that full range of compatibility with dual path (for 32-bit target only) with options like -QaxAVX -arch:IA32 but I doubt such a combination is popular enough to be well tested.

Karthik_K_ · ‎04-17-2017

I'll provide some details. What I am looking for is to build different code paths optimized for different architecture. If SSE2 is the highest ISA (found at run time on the machine), build SSE2. If AVX is the highest, run AVX code path and so on. I thought the compiler did that already by finding the architecture where the code is executed and chose the right path. Also 64-bit is what I am looking for. Who runs 32-bit code anyways ?:)

If this cannot be done automatically, can you please point me to any cheat sheet for most optimal flags for each architecture starting SSE2.

nemequ · ‎04-17-2017

GCC ≥ 6 supports a target_clones attribute, which sounds like what you're after. I'm not sure if ICC supports it or not, but since icc 17.0.2 masquerades as GCC 6.3 (by setting __GNUC__ and __GNUC_MINOR__), so if it doesn't work i'd consider that to be a bug in ICC.

What people normally do, AFAIK, is compile the same file multiple times, using the preprocessor to generate different symbol names, then throw together a quick resolver to do the runtime detection. This has the advantage that any compiler (even MSVC) can handle it with a bit of effort… Unfortunately, the code to check CPU features isn't really standard; modern GCC (and clang, icc, etc.) have the __builtin_cpu_supports() built-in, and IIRC MSVC expects you to use the __cpuid() intrinsic. If you want something more portable I have some code you can steal.

One thing you should be careful of is putting the resolver at too low of a level; it's generally a good idea to put the resolver on a fairly high-level function which isn't called as often. That gives the compiler a bit more freedom for optimization, and of course you cut out a lot of the overhead associated with the resolver.

Anoop_M_Intel · ‎04-17-2017

Hi Karthik,

Looks like you are looking for the auto cpu dispatch feature in Intel Compiler as documented at https://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations. The best approach is identify the top hotspot functions in your application and create multiple version of those functions targeting different architectures as demonstrated at https://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors-with-support-for-intel-avx. This will make sure you are not creating targeting all the application code for different architectures and thus reduces the binary size.

nemequ · ‎04-17-2017

I just checked, ICC doesn't support target_clones :(. Anoop, any hope of adding support for it? Also, maybe it would be better to add support for clang's __has_attribute (and __has_builtin, __has_warning, __has_feature, etc.) and stop trying to match GCC version numbers, like clang has done… I seem to run into situations where ICC doesn't support features from the GCC version it advertises pretty often, I doubt I'm the only one.

Looks like you are looking for the auto cpu dispatch feature in Intel Compiler as documented at https://software.intel.com/en-us/articles/performance-tools-for-software....

That looks closer than GCC's ifunc stuff, since it handles the CPU detection an dispatching stuff automatically, but if you don't care about portability the dispatch function is trivial (just a block of conditionals based on __builtin_cpu_supports). Unfortunately it's still nowhere near what target_clones does.

Karthik, based on the question it doesn't seem like you actually have different versions of the code, but rather that you want to compile the same code into multiple versions, relying on the compiler to optimize them as appropriate? If so, with the stuff Anoop linked to, you'll still need a way of including the code multiple times and setting the options differently each time; if you use the target attribute you can use the preprocessor (either put the function bodies in macros, or #include a file multiple times).

The best approach is identify the top hotspot functions in your application and create multiple version of those functions targeting different architectures as demonstrated at https://software.intel.com/en-us/articles/how-to-manually-target-2nd-gen.... This will make sure you are not creating targeting all the application code for different architectures and thus reduces the binary size.

This seems to contradict what I wrote earlier, but to be clear there is a balance, and in my experience people tend to error on the side of making the architecture-specific parts too small. You want the function to be low-level enough that it doesn't bloat the application too much, but it should also be high enough that it doesn't get in the way of the optimizer. For example, if the function is low-level enough that an inline attribute might be appropriate, it's probably much too low-level. OTOH, it's probably a bad idea to compile a bunch of versions of your whole application.

Karthik_K_ · ‎04-17-2017

Thanks all for the responses so far.

Anoop - The __declspec(cpu_ dispatch(cpuid,cpuid,…)) is the most promising for my case. Is there a way to pass this somehow at a file level as an optimization flag ? That way I wouldn't have to edit the files [as mentioned, I am not concerned about code size right now - just want to see how much ICC would give over the existing GCC compiled binary]

Evan - What you wrote is correct. I don't have different versions of the code. I just want to build it optimally. As a first pass, I want try the best options first that will work across different architecture [before worrying about code bloat, finding hot code path etc]

SergeyKostrov · ‎04-18-2017

I do the same in a different and absolutely portable way using a two-layered architecture: ... //{{UCM_INIT( HrtALProcessingUnit ) //{{ #undef _HWCFG_UPU #undef _HWCFG_GPU #undef _INTEL_PII #undef _INTEL_PIV #undef _INTEL_ATM #undef _INTEL_IVB #undef _INTEL_KNL // #define _HWCFG_UPU 0 // Unknown Processing Unit #define _HWCFG_GPU 1 // Generic Processing Unit ( Default ) // #define _INTEL_PII 2 // Intel Pentium II // #define _INTEL_PIV 3 // Intel Pentium 4 // #define _INTEL_ATM 4 // Intel Atom N270 // #define _INTEL_IVB 5 // Intel Ivy Bridge // #define _INTEL_KNL 6 // Intel Xeon Phi //}} #endif ... and it defines what CPU needs to be targeted ( take into account that different ISAs could be used and a max ISA is selected next, not min ISA ). Then, at a layer two: ... #if __INTEL_COMPILER_BUILD_DATE == 20170213 // Intel C++ v17.0.2 #ifdef _RTENABLE_DIAGNOSTICS #pragma message ( "# Diagnostics: __INTEL_COMPILER_BUILD_DATE == 20170213" ) #endif // #define _RTPU_ISA IA32 // #define _RTPU_ISA_CODE _RTPU_ISA_IA32 // 101 // #define _RTPU_ISA MMX // #define _RTPU_ISA_CODE _RTPU_ISA_MMX // 102 // #define _RTPU_ISA SSE // #define _RTPU_ISA_CODE _RTPU_ISA_SSE // 103 // #define _RTPU_ISA SSE2 // #define _RTPU_ISA_CODE _RTPU_ISA_SSE2 // 104 // #define _RTPU_ISA SSE4.2 // #define _RTPU_ISA_CODE _RTPU_ISA_SSE4_2 // 107 // #define _RTPU_ISA AVX // #define _RTPU_ISA_CODE _RTPU_ISA_AVX // 108 // #define _RTPU_ISA CORE-AVX2 // #define _RTPU_ISA_CODE _RTPU_ISA_AVX2 // 109 #define _RTPU_ISA MIC-AVX512 #define _RTPU_ISA_CODE _RTPU_ISA_AVX512 // 110 // Default ISA #endif ... #if ( defined ( _WIN32_MSC ) || defined ( _WINCE_MSC ) || defined ( _WIN32_MGW ) || \ defined ( _WIN32_BCC ) || defined ( _COS16_TCC ) || defined ( _WIN32_WCC ) ) #define _RTTARGET_ISA( tisa ) #endif #if ( defined ( _WIN32_ICC ) && !defined ( _GOS64_GCC ) ) #if ( defined ( _RTUNSUPPORTED_BYCOMPILER ) ) #define _RTTARGET_ISA( tisa ) #else #define _RTTARGET_ISA( tisa ) intel optimization_parameter target_arch=##tisa #endif #endif #if ( defined ( _WIN32_ICC ) && defined ( _GOS64_GCC ) ) #if ( defined ( _RTUNSUPPORTED_BYCOMPILER ) ) #define _RTTARGET_ISA( tisa ) #else #define _RTTARGET_ISA( tisa ) intel optimization_parameter target_arch=tisa #endif #endif ... All that stuff is controlled at a project and since I use more than 18 different C/C++ compilers lots of time was spent to find that solution. In all cases where _RTTARGET_ISA is not supported, that is not defined, additional settings are done manually in projects. Unfortunately, a full automation of ISA selection is not possible.

SergeyKostrov · ‎04-18-2017

>>...The __declspec(cpu_ dispatch(cpuid,cpuid,…)) is the most promising for my case... I'm not sure that this is a portable solution when multi C++ compiler support is needed. As I've already mentioned in some cases ISA selection is done at a project settings level.