processor optimisations

forall · ‎02-18-2005

I will be running floating-point-intensive CVF code (with average runtime up to several days per process) on several processors, including a Pentium-M 1.7GHz, an (dual)Opteron 2Ghz and a (dual)Xeon 3.2GHz. Is it worthwhile fiddling with the processor-optimisation options, given that none of these processors existed when CVF was written. All the compilation will be done on the P-M and I will be transferring the exes to the other machines? I am trying to determine whether to embark on a potentially lengthy benchmarking exercise.

Is the situation different for the Intel Compiler?

many thanks!

Steven_L_Intel1 · ‎02-18-2005

For CVF, go ahead and tell it to compile for Pentium 4. I think that will be your best choice.

For the Intel compilers, you have more choices that make much more difference. If you were going to run on Xeon and Opteron, I would suggest /xW /O3. You'll get better performance on the Xeon with /xN (or /xP if it's a Nocona-type Xeon), but specifying these won't let your code run on Opteron.

forall · ‎02-18-2005

thanks Steve,
Incidentally, if I set "optimise for host" on the pentium-m, will it pick pentium-4 or will it default to blend?

dima333a · ‎02-18-2005

I am not sure, but I think that at some instances the optimization for "Pentium III" may work better than optimization for "Pentium 4" in case of Pentium M.

forall · ‎02-18-2005

probably a dumb question, but how do I specify /xW (and what does it mean)? Also, what is the difference between 'allow' and 'require' pentium extensions?
thanks

Steven_L_Intel1 · ‎02-18-2005

For the purposes of CVF, I would expect the Pentium 4 mode to be preferable. The only difference between P3 and P4 for CVF is how often data prefetch instructions are issued. If you use "host" on a Pentium-M system with CVF, my guess is that it will think it is a P4, but I'll be honest and say I've never tried it to see what will happen. I would advise against using "host" unless you are building and running on the same system.

The Intel compilers have two different processor extension modes. "Require" means that you promise that the program will be run on a processor with those extensions. The command line syntax starts with /x followed by a single letter (B, W, N, P, etc.) designating the required set of processor extensions. If you run a program compiled in this mode on a processor other than the indicated type, the program may behave unpredictably or get a run-time error. If you choose codes B, N or P, an additional run-time check is added to the main program that checks for a supported Intel processor, and if not found, it exits with an error. The other codes do not cause such a run-time check.

The other mode is "use processor extensions if available, otherwise use generic code". These are the switches starting with /ax followed by one or two letters. The program will generate up to three code paths, two processor-specific and one generic, and will detect the processor type at program start and select the appropriate path. Non-Intel processors take the generic path in such cases.

I don't see the word "allow" in the text I am looking at, so I don't know what you're referring to.

Intel_C_Intel · ‎02-18-2005

Dear forall,
You may want to browse through the on-line article at http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm as well, since automatic vectorization hasthe potential to boost the performance of FP intensive Fortran codes.
Aart

forall · ‎02-18-2005

Steve,

I dont see any "W" in the Fortran optimisation options: only K, N, B, P. (ie, seems I have /K instead of W)

when I said "allow" in the last email I meant "use" - sorry for the confusion.

It seems that the "use" is the best way to go, since I assume this will generate the best code for an intel xeon and generic code for the opteron. I assume there is some increase in the size of the code but thats no big deal. Am I correct?

now, regarding the "optimise for intel processor" (options GB,G5 etc.), what are the best options for a (dual)Xeon and (dual)Opteron. Blend for opteron and P-III for the Xeon? (which from what I understand is closer to P-III than to P-4?) How does this option interact with the "use" and "request" extensions?

thanks.

forall · ‎02-18-2005

thanks Aart - will read this article in detail. As a starting point, will enable the /Qparallel options. Could there be disadvantages in doing so for all the codes? thanks

Intel_C_Intel · ‎02-18-2005

DearForall,
Code that has been automatically parallelizedqueries the runtime to determine what number of threads are best for the actual architecture it is running on. On a dual core, core with HT technology, or both, this should typically yield speedup (unless nothing is automatically parallelized, of course). Multithreaded code may exhibit a slight slowdown when run on a single core, however, even though our team tries to minimize this overhead.
You may want to consider adding OpenMP directives to make the parallelism in the program explicit if the implicit parallelism is not extracted automatically by the compiler.
Aart

Steven_L_Intel1 · ‎02-18-2005

I am not sure which documentation you are looking at. W should be there. And I made an error above - for Windows, the switches start with /Qx and /Qax.

If you are going to run on Opteron and Xeon, use /QxW and not any of the other processor options.

forall · ‎02-18-2005

Steve,

I dont have a problem with generating 3 different sets of exe's (for the pentium-m vs xeons vs opterons) if its worth it in terms of speed (eg, if the differences hover around ~10-20% I wouldnt worry about it).

But I am somewhat confused:

If I go to project properties > configuration props > fortran > optimisation I have 3 pulldown options for processor-dependent optimisations:

1. "optimise for intel processor". this sets flags /GB, /G5, etc. I've been using P-4 (/G7) for Pentium-M runs, but not sure whats the best choice for the Xeon and Opteron;

2. "use extensions". this has flags /QaxK,/QaxN,/QaxB,/QaxP. If I understood you correctly there should be an option /QaxW, but its definitely not there. I am using Intel Fortran 8.1 standard ed.(not EMT64 - should I be?). I was going to use /QaxB for the code for the Pentium-M, /QaxW for the opteron and /QaxN (or QaxP) for the Xeon (how do I find out if it's a "nocoona" chip?)

3. "require extensions". same flags as above but without the "a", eg, /QxK. Again, definitely cant see /QxW. I am not going to use these flags just in case the exe accidentally ends up on the wrong processor and generates bad results.

Did I understand your suggestions correctly?

also, I am going to use optimisation /O3 as recommended.

thanks.

Intel_C_Intel · ‎02-18-2005

Dear forall,
If the size of the application and compile-time requirements allow, also consider the /Qipo switch to enable inter-procedural optimizations of the complete program. In fact, the shorthand /fast currently expands into /QxP /O3 /Qipo /Qprec-div-. For your use setting, you would like to start with /QaxP /O3 /Qipo (since Qprec-div- only helps if your application performs a lot of FP divisions and is numerically stable enough to allow a few additional optimizations on them).
Aart

Steven_L_Intel1 · ‎02-18-2005

Ok, I see what you mean.

The /QxW option is not available from the property page. Officially, it has been superseded by /QxN, but if you are compiling for Opteron, you can't use that. So click on Command Line and type in /QxW manually.

The "Optimize for" option is similar to CVF's /tune switch. It has a smaller effect and adjusts for the fact that some processors "prefer" certain instructions over others, even though both support both instructions. You should select "Optimize for" for the processor you expect to run on most often - it may cause other processors to run a bit slower but the program will still run. This is something you need to test for yourself.

There is a lot of information on these topics in "Volume 2" of the Intel Fortran Programmer's Manual.

forall · ‎02-19-2005

ok, i've tried /QxW but getting error:"Link: warning: ignoring unknown option '-/QxW'" at link stage. I checked in the commandprompt and its typed as /QxW. Will now try /QxN for the Xeon

Steven_L_Intel1 · ‎02-19-2005

It sounds as if you typed that in on the Linker page. You want Fortran.

TimP · ‎02-19-2005

The switch /QxB was added to optimize for Pentium M Banias and Dothan CPUs, as they have a limited issue rate for SSE instructions, but still get a benefit from SSE vectorization. The option will run on other Intel CPUs, but usually not with as good performance as /QxW. I'm wondering if my impression is correct, that /QxB is not getting much attention, and is no longer consistently better for Pentium M performance. Apparently, future mobile core processors will have no need for it.

forall · ‎02-20-2005

ok, i did have both /QxP and /QxW simultaneously which probably caused some problems. however I still cannot get any of these switches to work.

I should point out that I have the following setup: a static library that is compiled separately with switch /QxW and then I link to it a Fortran Windows application, also with /QxW (other options same as far as I can see, have checked them). Definitely typed /QxW in the Fortran command prompt. The compilation is carried out on a Pentium-Dothan 735, with the exes to be moved to a Xeon and Opteron (I was trying /QxN for the Xeons and /QxW for the Opteron as suggested earlier by Steve).

I rebuilt the projects from scratch just in case and now get the following error when building the Fort-Windows app with /QxW: "error LNK2019: unresolved external symbol _vmldExp2 referenced in function _sumgauss". function sumgauss is rather simple and contains the following line "ex(:)=exp(-arg(:)**2)" which I assume is the cause of the problem (Exp2??). There were no problems building the static library.

Now, when using /QxN (or /QxP) instead of /QxW (in both the lib and the fortran-windows app) the compilation of the library seems to abort at some file (which didnt have problems at other settings). No internal error - just 'compilation aborted (code 1)'. The same file in both cases.

Message Edited by forall on 02-19-2005 07:00 PM

TimP · ‎02-20-2005

/QxN is practically the same as /QxW, except for the restriction to Intel CPUs. Using ifort or ICL with any of those vectorizing options to drive the link should cause it to link against one of the svml_disp libraries. That will take care of the short vector exponential. If you used another way of linking, such as a default CL link, it wouldn't automatically include the Intel svml library.

forall · ‎02-20-2005

I assume by 'vectorising options' you mean a switch like /Qparallel. However after doing this I still get unresolved vmldExp2.

TimP · ‎02-20-2005

No, the options /QxW, /QxN, /QxP, /QxB and the like invoke SSE2 vectorization, and include the corresponding short vector math library in the command passed by ifort to the linker.
The additional library invoked by /Qparallel or /Qopenmp is libguide.