Solved: Having trouble with optimizing a climate model using ifort 11.1

Jay · ‎09-04-2009

Hello compiler gurus. I am having trouble migrating a climate model from ifort 10.1 on a Xeon X7350 machine to a new E5530 Nehalem machine using ifort 11.1. The working model on the older Xeon was compiled for 32-bit with -O3 optimization and that has worked well for the last year or so. I have tried to compile the model on the new Nehalem machine with ifort 11.1 with either 32-bit or 64-bit, but I can only get it to work at -O0. Higher than -O0 using 64-bit will core dump and higher than -O0 using 32-bit gives all NaNs. The climate model is rather complex, so tracking down the source of the NaNs is non-trivial.

Here are the flags that we've been using on the Xeon X7350 with a 32-bit ifort/icc 10.1:
-O3 -w -openmp -r8

I have tried using the -axSSE4.2 flag on the Nehalem machine, but I still can't get past -O0

I am not sure where to go from here. The model is running, but only at -O0. We have about 12 of these Nehalem machines and they run the model for months at a time, so every bit of performance counts. The model relies on NetCDF and Udunits, but since it runs ok at -O0 I think I have a good build of each of those.

Can anyone recommend a set of flags to try to get this thing optimized? The brass ring would be 64-bit and -O3, but I would be happy with 32-bit and -O2.

Thanks

TimP · ‎09-04-2009

I would step up gradually. I just finished tearing hair out over one where (intel64)
-O2 -fp-model source
is good enough. Performance difference between 10.1 and 11.1 was not enough to show repeatably.
I'm even confident enough of the 64-bit compiler that I would try to concentrate on it.
The only common major optimization given up by -fp-model source is sum (and dot product) vectorization.
If you need -ftz you can append that. -prec-div -prec-sqrt and observance of parentheses (all included in fp-model source) may solve problems with NaNs when an application tries to come close underflow limits. The Harpertown and Nehalem CPUs don't depend nearly as strongly for performance on those risky options as older CPUs did.
Checking for any questionable things which the compiler can turn up with -check or -diag-enable sc could also be a useful step.
I do suspect the combination -O3 -xSSE4.2 brings in some loop nest transformations which are best restricted to where their performance and correctness effect can be studied carefully.
If you get the application working stably with safer compiler flags, or maybe even as low as -O1, you may gain performance from higher optimizations only in a limited number of source files.
Of course, you would keep -r8 if that worked for you historically.

View solution in original post

TimP · ‎09-04-2009

I would step up gradually. I just finished tearing hair out over one where (intel64)
-O2 -fp-model source
is good enough. Performance difference between 10.1 and 11.1 was not enough to show repeatably.
I'm even confident enough of the 64-bit compiler that I would try to concentrate on it.
The only common major optimization given up by -fp-model source is sum (and dot product) vectorization.
If you need -ftz you can append that. -prec-div -prec-sqrt and observance of parentheses (all included in fp-model source) may solve problems with NaNs when an application tries to come close underflow limits. The Harpertown and Nehalem CPUs don't depend nearly as strongly for performance on those risky options as older CPUs did.
Checking for any questionable things which the compiler can turn up with -check or -diag-enable sc could also be a useful step.
I do suspect the combination -O3 -xSSE4.2 brings in some loop nest transformations which are best restricted to where their performance and correctness effect can be studied carefully.
If you get the application working stably with safer compiler flags, or maybe even as low as -O1, you may gain performance from higher optimizations only in a limited number of source files.
Of course, you would keep -r8 if that worked for you historically.

Jay · ‎09-08-2009

Quoting - tim18

I would step up gradually. I just finished tearing hair out over one where (intel64)
-O2 -fp-model source
is good enough. Performance difference between 10.1 and 11.1 was not enough to show repeatably.
I'm even confident enough of the 64-bit compiler that I would try to concentrate on it.
The only common major optimization given up by -fp-model source is sum (and dot product) vectorization.
If you need -ftz you can append that. -prec-div -prec-sqrt and observance of parentheses (all included in fp-model source) may solve problems with NaNs when an application tries to come close underflow limits. The Harpertown and Nehalem CPUs don't depend nearly as strongly for performance on those risky options as older CPUs did.
Checking for any questionable things which the compiler can turn up with -check or -diag-enable sc could also be a useful step.
I do suspect the combination -O3 -xSSE4.2 brings in some loop nest transformations which are best restricted to where their performance and correctness effect can be studied carefully.
If you get the application working stably with safer compiler flags, or maybe even as low as -O1, you may gain performance from higher optimizations only in a limited number of source files.
Of course, you would keep -r8 if that worked for you historically.

Thanks, the -fp-model source did the trick to fix the NaNs so I was able to get up to -O3 on 32-bit. The improvement isn't massive (maybe 10%), but every bit counts when we run for months.

In the process of doing these 32 vs 64-bit tests we saw that in the case of our model, 64-bit was ~3x slower than 32-bit. I was rather surprised by that. Either way, we have 32-bit running with -O3 so I am happy. Thanks for your help.

TimP · ‎09-08-2009

Quoting - Jay

In the process of doing these 32 vs 64-bit tests we saw that in the case of our model, 64-bit was ~3x slower than 32-bit.

If you used -mia32 in order to optimize mixed single and double precision on 32-bit, there is a possible loss of performance in (non-vector) 64-bit compilation. Otherwise, it seems something went wrong. A few isolated 64-bit intrinsics are slow, such as eoshift(), but I'm not aware of other such problems.

jimdempseyatthecove · ‎09-08-2009

Quoting - Jay

Thanks, the -fp-model source did the trick to fix the NaNs so I was able to get up to -O3 on 32-bit. The improvement isn't massive (maybe 10%), but every bit counts when we run for months.

In the process of doing these 32 vs 64-bit tests we saw that in the case of our model, 64-bit was ~3x slower than 32-bit. I was rather surprised by that. Either way, we have 32-bit running with -O3 so I am happy. Thanks for your help.

When using 32-bit for speed you might want to use the trick surveyors use to improve accuracy.

Incorporate geographic markers, then use relative measurements from a nearby marker. For Earth, 1km or 100m geographic marker separations might work well.

Essentially the geographic markers define a grid. 32-bit calculations can be use between points within a single grid zone as well as between adjacent grid zones. Anything further away and you might as well be using aggregates of pressure (or whatever is the interactive force).

Jim