OpenMP Speedups in Debug version

harris_raymann · ‎02-14-2008

This is a great forum. I am finding answers in the archive to a lot of issues I was dealing with.

I am parallelizing a large FORTRAN loop consisting of deep call hiearchies using OpenMP. I am getting great speedups when I run Debug version on my machine. The issue is that this improvement goes down the drain with a Release build. Of course my execution times are consistently better on the release build. Although results with a Debug version should be considered with a pinch of salt, I am wondering there may be some compiler optimization switches that might give me similar performance boost on release version?

I am using Intel FORTRAN compiler with MS VS.

Harris

jimdempseyatthecove · ‎02-14-2008

This is an apples and oranges comparison when using runtime checks in debug mode.

For a better comparison, create a new configuration base on Debug. Call it DebugOpenMPFast. In that configuration turn off runtime checks (array bounds, uninitialized variables, etc...), and set for maximum speed. Note, stepping through optimized code with the debugger is somewhat difficult. You may want to make your main and a few selected routines compile with optimizations off. These would be a major control loop that does not contain compute intensive code.

My configurations are

Debug(full debugging, no optimizations, no OpenMP)
DebugFast(full debugging, debugged routines optimized, no OpenMP)
DebugOpenMP(full debugging, no optimizations,OpenMP)
DebugOpenMPFast(full debugging, debugged routines optimized,OpenMP)
Release
ReleaseOpenMP

Jim Dempsey

TimP · ‎02-14-2008

If your application supports vectorization, you must enable that at least before evaluating OpenMP. You can over-ride the debug options with an option which enables vectorization (e.g. -O2 -QxW), at the expense of more compile time, less visibility of data under debug,.... Adding -fp:precise to those options would help reduce risk of different numerical results.

robert-reed · ‎02-25-2008

Or you may have just fallen into one of the common pitfalls of parallel programming. If you write a single-threaded program that is very inefficient, you will find that running multiple copies of it will give you great scaling. This is because the inefficient code leaves lots of performance on the table, often in the form of large latencies, mostly waiting for memory. When you add more threads, the processor can do a better job of keeping busy. That's essentially what you're doing when you compile the code fordebug. We've seen a few cases in the past where a parallelization project starts with parallelizing the code, gets great scaling, and later inproves the performance of their underlying code only to discover that the scaling is gone.

That's way we recommend in any parallelization project to start with squeezing as much performance out of the single-threaded code as you canbefore parallelizing. Then you are less likely lulled into a false sense of accomplishment when you do start the parallelization effort.