Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

OpenMP slower than no OpenMP

New Contributor I

Here is a Friday post that has a sufficient lack of information that will probably be impossible to answer.  I have some older Fortran code I'm trying to improve the performance of. VTune shows 75% of serial execution is consumed calculating the numerical Jacobian of an expensive function.  It's easily to parallelize.  I first used MPI, and that does show modest improvement when a few processes are added, but it does not scale very well probably because of the large Jacobian matrix that must be broadcast to all the processes.

So, I also tired OpenMP, thinking that it might do slightly better since it does not need to broadcast the matrix.  However, when I run the serial code with OMP directives disabled, it runs 4 times fast than the code with the OMP directives enabled, but using only one thread.  If more threads are used, some improvement occurs, but it never gets better than the code with out the OMP directives.  

My question: Does OpenMP incur a large overhead even if only one thread is used so there is no forking?  

0 Kudos
2 Replies
Black Belt

The question might be more appropriate for a help forum pertaining to the compiler of your choice, although you don't give much information.  A  possible reason for the performance loss you describe is that OpenMP may prevent a compiler optimization, such as loop switching, fusion, or dead code elimination.  You are correct that some OpenMP errors, such as race conditions, are likely not to show up when running a single thread.

New Contributor I

Thanks Tim.  Sorry for the lack of info, as this is one of those cases where providing an example problem that exhibits the problem is not easy to come up with, and providing the whole code is not practical.  I am using the latest Intel PS XE, cluster edition running on Windows 10 (I have not tested on my CentOS cluster yet).  I'll continue to play around with the various Intel tools to see if I can locate the slow down, but your hypothesis of loss of compiler optimization sounds interesting.