Does compiler optimization affect Openmp performance?
Hai, I am working with parallelising QCD codes to effectively use both the processors of a node.I found that, with few codes, -g option or -O0 (Supress all optmization) was very efficient (85%-100% gain in performance ie execution time), but the moment , i compile with -O2 otion and run, its only 15%-20% improvement in execution time. And again, this happens only with few programs and i hav other programs that prove to be good with both _o0 and -O2 level of optimization .
1. In some programs, One more observation with the above situation was, the scaling down factor in execution time of serial code from -O0 to -O2 was more tha openmp code , eq serail : -g option 5 unit of time, -O2 option 2.6 unit of time Openmp : -g option 2.5unit of time ,-O2 option 2 units of time.
So, openmp is still better than serial, -g is better than -O2 , but the improvement factor is less with -O2 option
2. I also have few programs that perform worse with -O2 option
Can somebody please explain the cause of such behaviour and a possible soln? Does the problem arise due to the code structure or architectural features like cache , bus bandwidth etc?
Yes, it often happens that performance improvements from OpenMP parallelism and from compiler optimizations don't add up.
If you are parallelizing the same loop which you are asking the compiler to optimize, you shorten the loops worked on by each CPU. For example, if you split a loop of length 100 in 2, you leave a loop length of 50 for each CPU. If the optimization is done by taking it in groups of 8, that optimization becomes less effective.
The parts of your program which can't parallelize with OpenMP also are likely not to benefit from other optimizations, so you have a kind of Amdahl's law effect, with a part of your program not responding to either way of improving performance.
If you have nested loops, combining OpenMP parallelization of the outer loop with optimization of the inner loop is much more effective if the inner loop has stride 1, while the outer loop retains enough iterations for effective threading. If your OpenMP performance happens to be limited by need to pass data between the caches of the 2 processors, you will approach a performance ceiling which can be raised only by improving the data access behavior.
As you suggest, if a single optimized thread of your program uses 50% or more of the memory bus capacity, and 2 CPUs share the same resources, parallel scaling suffers. There also you could have a performance ceiling.
On Intel Xeon style processors, if you have these memory problems, you should be giving attention first to having the threads write to distinct memory regions (no false sharing). Then to using the Write Combine buffers efficiently. Northwood is optimized for at most 4 open data streams to write to separatecache linesin a loop, Prescott for 6. If you are using HT, you share these buffers between 3 logical processors. It is very difficult to optimize WCB except with stride 1 arrays, similar to requirements for SSE vectorization.
More than I should have said without better clues about your application and platform.
> Northwood is optimized for at most 4 open data streams to write to
> separatecache linesin a loop, Prescott for 6.
I've been wondering about this for some time... how do you tell what processor you really have ? Where I was previously wondering this was in regards to -xN and -xP (which -fast default has changed, IIRC, going from 8.0 to 8.1 of the C Compiler).
If you don't have a Prescott CPU, a -xP build from an Intel compiler would refuse to run. I hope that if -fast defaults to -xP, it would do so only when compiling on a Prescott CPU. The first Prescott CPUs became available on the market less than 2 months ago. Among the easier ways to distinguish them is by L2 cache size, Northwood having 512kB and Prescott having at least 1MB. Statistics in /proc/cpuinfo should give you such information.
This is always a tricky issue when it comes up. In the past, I've seen things like this happen. I suspect that the problem might be within the compiler and is caused by some configuration that is unique to your code. It may be the number of statements before or after the loop, the arrangement of statements, the number of blank lines between the pragmas, or some other weird little "corner" situation. This makes creation of a separate test case for demonstration purposes nearly impossible.
To create such a case, may I suggest starting with the orignal code and removing chunks around the loop (likely the code before the loop). You can set up a sort of binary chop to attempt to narrow down the source of the problem. Since this is part of a larger application, there may be consequences for doing this, though, so you may not be able to perform this kind of investigation without drastic changes to other parts of the app to get a stable enough run while trying to find the problem in the loop. If it happens on the first run through this part of the code, simply exit once you've seen whether or not the problem remains.
If you can try the more recent release of the compiler and the problem is resolved, then you needn't worry about it. If the problem remains, you're back to square one.
If you have uncovered a compiler problem, you should be able to submit the subroutine to Intel Premier Support (modifying the variable names and removing comments in order to disguise the code). The support people should be able to compile the code fragment you give them and compare the generated code to see if there is some adverse interaction between optimizations and OpenMP parallelization triggered from your source.