This is an apples and oranges comparison when using runtime checks in debug mode.
For a better comparison, create a new configuration base on Debug. Call it DebugOpenMPFast. In that configuration turn off runtime checks (array bounds, uninitialized variables, etc...), and set for maximum speed. Note, stepping through optimized code with the debugger is somewhat difficult. You may want to make your main and a few selected routines compile with optimizations off. These would be a major control loop that does not contain compute intensive code.
My configurations are
Debug(full debugging, no optimizations, no OpenMP)
DebugFast(full debugging, debugged routines optimized, no OpenMP)
DebugOpenMP(full debugging, no optimizations,OpenMP)
DebugOpenMPFast(full debugging, debugged routines optimized,OpenMP)
Or you may have just fallen into one of the common pitfalls of parallel programming. If you write a single-threaded program that is very inefficient, you will find that running multiple copies of it will give you great scaling. This is because the inefficient code leaves lots of performance on the table, often in the form of large latencies, mostly waiting for memory. When you add more threads, the processor can do a better job of keeping busy. That's essentially what you're doing when you compile the code fordebug. We've seen a few cases in the past where a parallelization project starts with parallelizing the code, gets great scaling, and later inproves the performance of their underlying code only to discover that the scaling is gone.
That's way we recommend in any parallelization project to start with squeezing as much performance out of the single-threaded code as you canbefore parallelizing. Then you are less likely lulled into a false sense of accomplishment when you do start the parallelization effort.