I am parallelizing a large FORTRAN loop consisting of deep call hiearchies using OpenMP. I am getting great speedups when I run Debug version on my machine. The issue is that this improvement goes down the drain with a Release build. Of course my execution times are consistently better on the release build. Although results with a Debug version should be considered with a pinch of salt, I am wondering there may be some compiler optimization switches that might give me similar performance boost on release version?
I am using Intel FORTRAN compiler with MS VS.
This is an apples and oranges comparison when using runtime checks in debug mode.
For a better comparison, create a new configuration base on Debug. Call it DebugOpenMPFast. In that configuration turn off runtime checks (array bounds, uninitialized variables, etc...), and set for maximum speed. Note, stepping through optimized code with the debugger is somewhat difficult. You may want to make your main and a few selected routines compile with optimizations off. These would be a major control loop that does not contain compute intensive code.
My configurations are
Debug(full debugging, no optimizations, no OpenMP)
DebugFast(full debugging, debugged routines optimized, no OpenMP)
DebugOpenMP(full debugging, no optimizations,OpenMP)
DebugOpenMPFast(full debugging, debugged routines optimized,OpenMP)
Or you may have just fallen into one of the common pitfalls of parallel programming. If you write a single-threaded program that is very inefficient, you will find that running multiple copies of it will give you great scaling. This is because the inefficient code leaves lots of performance on the table, often in the form of large latencies, mostly waiting for memory. When you add more threads, the processor can do a better job of keeping busy. That's essentially what you're doing when you compile the code fordebug. We've seen a few cases in the past where a parallelization project starts with parallelizing the code, gets great scaling, and later inproves the performance of their underlying code only to discover that the scaling is gone.
That's way we recommend in any parallelization project to start with squeezing as much performance out of the single-threaded code as you canbefore parallelizing. Then you are less likely lulled into a false sense of accomplishment when you do start the parallelization effort.