I see -O2 says "Optimize for maximum speed". In addition, there are also different arch flags for different instruction sets. Would O2 be the master flags that builds the most optimal code across all the platforms ? Essentially, I am looking to generate the most optimal code across different architectures (code size is not an issue).
If you don't set an arch option, you get SSE2 code which will run on all CPUs like Pentium 4 or newer, which includes all 64-bit CPUs.
If you target Windows, a 32-bit built will run on both 32- and 64-bit Windows OS installations. On linux, it's not that simple.
It's not easy to guess your intention well enough to give a simple answer, although your comment about code size helps. For example, assuming you will never encounter a CPU older than 15 years, you could make a dual path build which covers everything reasonably well by setting /QaxAVX /arch:SSE3 (or linux or Mac equivalent). The importance of choosing SSE3 rather than SSE2 as minimum arch applies mainly if complex arithmetic is important.
The IA32 architecture option is optimal only for Pentium2 and, if not using float data type, for Pentium III. You should be able to get that full range of compatibility with dual path (for 32-bit target only) with options like -QaxAVX -arch:IA32 but I doubt such a combination is popular enough to be well tested.
I'll provide some details. What I am looking for is to build different code paths optimized for different architecture. If SSE2 is the highest ISA (found at run time on the machine), build SSE2. If AVX is the highest, run AVX code path and so on. I thought the compiler did that already by finding the architecture where the code is executed and chose the right path. Also 64-bit is what I am looking for. Who runs 32-bit code anyways ?:)
If this cannot be done automatically, can you please point me to any cheat sheet for most optimal flags for each architecture starting SSE2.
GCC ≥ 6 supports a target_clones attribute, which sounds like what you're after. I'm not sure if ICC supports it or not, but since icc 17.0.2 masquerades as GCC 6.3 (by setting __GNUC__ and __GNUC_MINOR__), so if it doesn't work i'd consider that to be a bug in ICC.
What people normally do, AFAIK, is compile the same file multiple times, using the preprocessor to generate different symbol names, then throw together a quick resolver to do the runtime detection. This has the advantage that any compiler (even MSVC) can handle it with a bit of effort… Unfortunately, the code to check CPU features isn't really standard; modern GCC (and clang, icc, etc.) have the __builtin_cpu_supports() built-in, and IIRC MSVC expects you to use the __cpuid() intrinsic. If you want something more portable I have some code you can steal.
One thing you should be careful of is putting the resolver at too low of a level; it's generally a good idea to put the resolver on a fairly high-level function which isn't called as often. That gives the compiler a bit more freedom for optimization, and of course you cut out a lot of the overhead associated with the resolver.
Looks like you are looking for the auto cpu dispatch feature in Intel Compiler as documented at https://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations. The best approach is identify the top hotspot functions in your application and create multiple version of those functions targeting different architectures as demonstrated at https://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors-with-support-for-intel-avx. This will make sure you are not creating targeting all the application code for different architectures and thus reduces the binary size.
I just checked, ICC doesn't support target_clones :(. Anoop, any hope of adding support for it? Also, maybe it would be better to add support for clang's __has_attribute (and __has_builtin, __has_warning, __has_feature, etc.) and stop trying to match GCC version numbers, like clang has done… I seem to run into situations where ICC doesn't support features from the GCC version it advertises pretty often, I doubt I'm the only one.
Looks like you are looking for the auto cpu dispatch feature in Intel Compiler as documented at https://software.intel.com/en-us/articles/performance-tools-for-software....
That looks closer than GCC's ifunc stuff, since it handles the CPU detection an dispatching stuff automatically, but if you don't care about portability the dispatch function is trivial (just a block of conditionals based on __builtin_cpu_supports). Unfortunately it's still nowhere near what target_clones does.
Karthik, based on the question it doesn't seem like you actually have different versions of the code, but rather that you want to compile the same code into multiple versions, relying on the compiler to optimize them as appropriate? If so, with the stuff Anoop linked to, you'll still need a way of including the code multiple times and setting the options differently each time; if you use the target attribute you can use the preprocessor (either put the function bodies in macros, or #include a file multiple times).
The best approach is identify the top hotspot functions in your application and create multiple version of those functions targeting different architectures as demonstrated at https://software.intel.com/en-us/articles/how-to-manually-target-2nd-gen.... This will make sure you are not creating targeting all the application code for different architectures and thus reduces the binary size.
This seems to contradict what I wrote earlier, but to be clear there is a balance, and in my experience people tend to error on the side of making the architecture-specific parts too small. You want the function to be low-level enough that it doesn't bloat the application too much, but it should also be high enough that it doesn't get in the way of the optimizer. For example, if the function is low-level enough that an inline attribute might be appropriate, it's probably much too low-level. OTOH, it's probably a bad idea to compile a bunch of versions of your whole application.
Thanks all for the responses so far.
Anoop - The __declspec(cpu_ dispatch(cpuid,cpuid,…)) is the most promising for my case. Is there a way to pass this somehow at a file level as an optimization flag ? That way I wouldn't have to edit the files [as mentioned, I am not concerned about code size right now - just want to see how much ICC would give over the existing GCC compiled binary]
Evan - What you wrote is correct. I don't have different versions of the code. I just want to build it optimally. As a first pass, I want try the best options first that will work across different architecture [before worrying about code bloat, finding hot code path etc]