To attain maximum performance

Andrew_Smith · ‎02-22-2015

My code currently targets a baseline architecture of SSE2 (/arch:SSE2) and an additional code path (/QaxAVX).

I want users of the Haswell (AVX2) to benefit without making the code fall back to SSE2 when AVX is available. Does that make sense?

The help for /Qax says it adds multiple code paths without specifying how many paths that could be so I thought if I add /QaxCORE-AVX2 as well as /QaxAVX I would get 3 code paths in total but the compiler says /QaxCORE-AVX2 overrides /QaxAVX. What should I do?

Steven_L_Intel1 · ‎02-22-2015

You say /QaxCORE-AVX2,AVX This will add two optimized code paths and one generic (SSE2 by default) code path. You should do some testing to see if adding the AVX2 option really makes a difference for your application. The documentation says:

"You can use more than one of the feature values by combining them. For example, you can specify -axSSE4.1,SSSE3 (Linux OS and OS X) or /QaxSSE4.1,SSSE3 (Windows OS). "

TimP · ‎02-22-2015

The compiler should perform some pruning so as to avoid adding all 3 paths where it is counter-productive. I suppose there may be reason to have 2 paths even where vectorization doesn't occur. I agree with Steve's comment about checking whether there is advantage in adding the 3rd path.

The most common difference between AVX and AVX2 code generation is the treatment of unaligned vector references, which AVX splits to 128-bit memory references on account of Sandy Bridge being very slow with 256-bit unaligned references.

ifort seems to prefer AVX code for dot_product, on account of some cases where the additional latency of fma makes the AVX2 code slower. gfortran errs on the other side, using the fma both where it's faster and where it's slower than ifort.

The example in the documentation strikes me as a little odd. If you have an application which takes advantage of SSE3, it makes more sense to make that the "fall back" option and leave out the minor differentiation of SSSE3. For years now, ifort has not held back on optimization for SSE3. Also there haven't been any CPUs produced without SSE3 support for more than a decade.

Steven_L_Intel1 · ‎02-22-2015

FWIW, I noticed that the IDZ article on selecting processor options didn't mention being able to specify two paths - I added that.

jimdempseyatthecove · ‎02-23-2015

To attain maximum performance for a given architecture, architecture unknown until run time, I suggest you break your program into two parts:

a) A stub of a main program that determines the architecture, which then under program control loads a specific .dll from a group of .dlls
b) A series of dll's that you create using a single architecture.

Note, you do not link the DLLs to your program, instead you load the dll after program start using a program selected name. You then make a call to get the entry point, e.g. main_sse2, main_avx, ...

This would eliminate all the architecture tests and branches from your code. While this may not be convenient, it will produce faster code (if you have a lot of compiler generated code that tests for architecture).

Jim Dempsey

Targetting best performance on any available SIMD architecture