We have four clusters composed of nodes of different vintage Intel Xeon processors.
Intel(R) xeon(R) CPU E5-2697
Intel(R) Xeon(R) E5-2690
Intel(R) Xeon(R) x5675
Intel(R) Xeon(R) e5530
We are using 16U3 versions of the Intel ifort compiler.
Are there compilation optimization parameters I should use, looking for ultimate performance, that would produce an executable best for each machine?
Or is there one set of optimization parameters that should just as good an executable that I could compile on any machine and execute on another (or all) machines? Again, we are looking for ultimate performance as opposed to portability as a prime concern.
Typically, setting the "highest" target ISA which works for all your machines may be a satisfactory tactic. If you have both Sandy Bridge and Ivy Bridge machines (you didn't give full identification), it's unlikely there would be any advantage in setting the Ivy Bridge ISA. The gain for AVX2 code (if you have a -v3 machine) is unlikely to exceed 5%, but you could test 2 builds on those machines in order to make your decision. The gain for AVX2 over AVX may be wiped out if you use the multiple target path option (which would result in larger executable). As you have both AVX-capable and non-AVX machines, compiling for the Nehalem box may give up some performance on the AVX machines, possibly enough that dual path options such as -axAVX -msse4.2 could prove advantageous.
Taking advantage of compiler optimization reports plus run-time profiling using Intel Parallel Advisor ought to produce a clearer picture.
We use the run-time dispatch functionality and have never seen performance degradation relative to the native versions. The executable is larger, of course, because it contains multiple versions of any function that the compiler thinks will get a benefit from the "higher" ISA.
Run-time dispatch is by function, so if you have code that spends a lot of time passing pointers to short functions as arguments, the overhead could be a problem. I am not aware of any such codes in my shop (but some might be hiding inside interpreters or JITs).
For the four processors above, the options "-xsse4.2 -axAVX" should generate the best code. (The "-msse4.2" option will generate code that will run on both Intel and non-Intel processors, and may not be optimized as well as with the "-xsse4.2" option that generates code that will only run on Intel processors.)
If either of the Xeon E5 systems are v2, v3, v4, then additional options might be helpful. There is seldom a benefit to specialization for Xeon E5 v2, but Xeon E5 v3/v4 will want a third flag: "-axCORE-AVX2"