Debugging OpenMP programs

mullervki · ‎02-05-2019

Hello,

I have a program - call it A - that makes heavy use of OpenMP. I see the CPU time going down as I increase the number of threads in my "#pragma omp for" loops.

When this program is called from within another program - call it B - that also uses OpenMP, suddenly I no longer see the CPU going down as when A runs in stand-alone mode.

I could use help in two questions:

1) Does anybody know of something that program B could be doing that would prevent program A from parallelizing properly?

2) Is there a way that I can compile/link my program (perhaps both A and B), so that every call/instance of an OpenMP directive is sent to a console window so I can see what program B may be doing that is making A no longer run in parallel?

Thanks.

-Arthur

jimdempseyatthecove · ‎02-06-2019

Describe what you mean by "When this program is called from within another program - call it B - that also uses OpenMP". Be specific.

Are you referring to your parallel program A calling a parallel library B (e.g. MKL)? (within same process)
Are you referring to your parallel program A spawning/messaging a different parallel program B? (separate processes)
Or something different

Jim Dempsey

mullervki · ‎02-06-2019

Hi Jim,

Thanks for the reply - and for the good follow-up questions.

A is my stand-alone program. But I can easily change to be a function called from program B. This is what I did. In stand-alone mode program A shows everything is being properly parallelized, multi-threading efficiency with OpenMP is high.

But once I just call the entire program A as a function from program B, the entire OpenMP parallelization appears to not be happening. It's like program B is making some call to an omp_* function that is preventing program A (now a function inside program B), from running efficiently.

-Arthur

Menger__Bill · ‎02-06-2019

It sounds like you are saying no parallelization occurs when called from the wrapper program? Or are you saying you use OpenMP in both the wrapper program and the function? If the latter, then you are "overthreading" and are being limited by external factors, such as number of memory channels, which memory banks are assigned to which chip and thread (if you have a dual-socket server), or things of that nature. For instance, if I am on a 20 core 8 memory channel server and call for 20 threads with my function in OpenMP but then call the function from a main that is compiled with the -parallel flag, I'll likely get 20x20 threads but realistically I'll only see about a 6-7x speedup. 8x would be the most I'd probably ever see unless I could do most of my work within L1 cache and registers.

If the former, then you probably have something configured incorrectly in the main that is causing the omp_number_threads to be 1 or similar issue.

Menger__Bill · ‎02-06-2019

A diagnostic you can try is to run "top" (then hit 1 and H to show all cores and subsequently all threads) and run your program. See how many threads you are getting.

mullervki · ‎02-06-2019

Hi Bill,

Unfortunately, I control program A, but have no control over program B. So I don't really know how A gets integrated into B.

As far as diagnostic tool, I was looking for something that I could enable during compilation - perhaps a compiler flag - or an environment variable to set at runtime that would dump to the console every time a thread is started/ended plus other information that could be useful. The idea is to generate the output of this tool on my end, then compare with what the people that control B get. Ideally, I could see some omp call issued before B gets to A that could serve as a clue.

-Arthur

Menger__Bill · ‎02-06-2019

Ok, in that case, you might try putting in something like this (if in C)

fprintf(stderr,"%s:%d: THREAD# %d\n"__FILE__,__LINE__,omp_thread_num());

I may not be remembering the exact subroutine in omp that you need, but it is similar to omp_thread_num(). Place this line inside your #pragma_omp area and see how many threads you are getting.

If you are in F90/F95 there is a similar call to omp from Fortran but I don't remember it.

mullervki · ‎02-06-2019

Bill,

Is it possible that all threads are being created as requested but for some reason unknown to me (machine limit, program B limitations, etc.) the threads are not really running in parallel, but sequentially?

-Arthur

McCalpinJohn · ‎02-06-2019

If the "outer" program uses OpenMP and the "inner" program is called from inside a parallel region of the "outer" program, then the "inner" program's parallel regions are "nested". I don't need (or want) "nested parallelism" in any of my programs, so I have not learned the details about what OpenMP does in these cases, but there are two issues that can be investigated:

1. It is possible that (due to parallelism in the "outer" program), your function is being executed by a single thread. You have parallel regions, but instead of spreading across multiple threads, all of the "chunks" of work are being executed in sequence by a single thread. To check this, I use something like:

#ifdef _OPENMP
#pragma omp parallel 
    {
#pragma omp master
    {
        k = omp_get_num_threads();
        printf ("Number of Threads requested = %i\n",k);
        }
    }
#endif

#ifdef _OPENMP
    k = 0;
#pragma omp parallel
#pragma omp atomic 
        k++;
    printf ("Number of Threads counted = %i\n",k);
#endif

2. Another possibility is that you are getting the threads you are asking for, but they are being bound to run on a subset of the processor cores. In the "nested parallelism" case, the processors may be allocated in the "outer" parallel region, and the "inner" parallel region is limited to running on the set of cores allocated to the "outer" thread that called the "inner" function. In the Linux world, you can check this by setting up an OpenMP parallel section and having each thread call "sched_getaffinity()". I usually put this in a OMP CRITICAL section so that the output from the various threads does not get interleaved.

jimdempseyatthecove · ‎02-06-2019

mullervki wrote:
Hi Jim,
Thanks for the reply - and for the good follow-up questions.
A is my stand-alone program. But I can easily change to be a function called from program B. This is what I did. In stand-alone mode program A shows everything is being properly parallelized, multi-threading efficiency with OpenMP is high.
, the entire OpenMP parallelization appears to not be happening. It's like program B is making some call to an omp_* function that is preventing program A (now a function inside program B), from running efficiently.

-Arthur

>>But once I just call the entire program A as a function from program B

The statement above, to me, states you have two processes involved, as opposed to a single process with a main program B calling a function A.

I had asked earlier if this is two (multiple processes/programs) .OR. a single process, with a single main and object files/shared libraries from both B and A. Which is it???

If multiple processes, there are several different ways for process B to instantiate process A. Depending on method, A can be instantiated with the environment of B, or A can be instantiated with a different environment. B's environment may such that it precludes A from having an OpenMP thread pool other than with 1 thread. .OR. both processes may start with a full complement of threads. In the event of having two processes with the full count of threads, you now have 2x the maximum of available number of hardware threads. Thus, if B is running in a parallel region, or immediately after exiting a parallel region, on the threads from B spawns A, then A attempts to start up its thread pool while B's threads are in spin-wait state (~300ms after end of parallel region by default). IOW you have oversubscription while B's threads are spinwaiting.

IFF single process, IOW you remove main of A or rename main of A to something else (formerly_main_of_A), things are different. If B is (or has) OpenMP threading, And B's threads within a parallel region call A, then you have nested parallelism (see John McCalpin's post), and if A and B do not coordinate, you could have an oversubscription of (# HW threads) * (# HW threads), if nested parallelism is enabled (OMP_NESTED=TRUE). If nested is false, and A is called from within a parallel region of B then A is run sequentially.

On the other hand, if B is OpenMP parallel, but only the master thread of B calls upon A, then you may have success by setting the blocktime to 0 (KMP_BLOCKTIME=0)

There are other things to consider... depending on how you are piecing B and A together. We need the specifics.

Jim Dempsey

Menger__Bill · ‎02-06-2019

beautifully stated.

mullervki · ‎02-07-2019

I had asked earlier if this is two (multiple processes/programs) .OR. a single process, with a single main and object files/shared libraries from both B and A. Which is it???

Apologies for not having made myself clear. While testing, A is a stand-alone program. When I pass my source to the B developers they simply put all of A inside a function and make a function call. So when A is part of B it is NOT a separate process: it becomes an integral part of B as B now just calls A as a function.

That's why it's important to know what's going on in B as far as OpenMP is concerned (which I don't). When B calls A as a function, there may have been previous calls to omp functions and that may be preventing B from working efficiently as a multi-threaded program.

If nested is false, and A is called from within a parallel region of B then A is run sequentially.

If OMP_NESTED=FALSE, would new threads be created in A but run sequentially, or would A end up NOT creating new threads? I ask because it appears that threads are being created, but the lack of multi-threaded performance may indicate that the threads are created but may be running sequentially.

Regarding KMP...: unfortunately, I'm not familiar with any KMP. Perhaps some help here would be appropriate. Are these environment variables I can set?

Perhaps a good approach now would be to ask the developers of B the following questions:

1) What OMP/KMP environment variables are they setting, and to what value?

2) What omp_ functions are being called prior to calling A as a function? Any function in particular I should call their attention to?

Thanks.

-Arthur

jimdempseyatthecove · ‎02-07-2019

>>If OMP_NESTED=FALSE, would new threads be created in A but run sequentially, or would A end up NOT creating new threads?

This depends:

1) if A were called from a serial region of B then A can have parallel sections running in parallel
2) if A were called from a parallel region of B, then for each such call, the parallel regions of A would run serially within each call as well as parallel amongst all calls from B. IOW no nesting, parallel region expansion in either A or B but not both.

Jim Dempsey

mullervki · ‎02-07-2019

Jim,

I will try to find out how A is being called from within B: parallel or serial region.

My question referred to the case where A is being called from a parallel region of B. If my pragma says, for example "num_threads(4)", would I see omp_get_thread_num() returning 0 through 3 - even though the threads are run sequentially one after the other? Or would OpenMP not even thread the area and always show either 0 (that is, the thread number in A), or a single non-zero number referring to the thread number in B - since B is assumed to be a parallel section?

But I can see already that finding out whether A is called from within a parallel or serial section of B is critical. Thanks!

-Arthur

jimdempseyatthecove · ‎02-07-2019

When nested=false , then:

1) if A were called from a serial region of B then A can have parallel sections running in parallel
2) if A were called from a parallel region of B, then for each such call, the parallel regions of A would each have a thread pool per calling thread from B. IOW nesting, parallel region expansion in either A or B or both.

Note, in addition to OMP_NESTED=TRUE you can also specify a maximum nest level/depth.

Also, it is the responsibility of the programmer (code) and/or operator (environment variables, or command line options) to specify how and where each nest level is to place and the extents of each thread pool. For example:

Assume system has 16 hardware threads and you desire to have parallel regions of B call A and have A's parallel regions run in parallel. SET OMP_NESTED=TRUE and have...

B restricted to 2 threads and A restricted to 8 threads, or
B restricted to 3 threads and A restricted to 5 threads, or
B restricted to 4 threads and A restricted to 4 threads, or
B restricted to 5 threads and A restricted to 3 threads, or
B restricted to 8 threads and A restricted to 2 threads, or
...

Additionally, IIF you know that A's threads are going to perform thread blocking operations (I/O, messaging, timed waits), you may consider oversubscribing the number of A's threads.

Jim Dempsey

jimdempseyatthecove · ‎02-07-2019

num_threads(n) on omp pragma means "at most n threads"
#pragma omp parallel ... (potentially) creates/enters a parallel region consisting of 1 or more threads in a thread team. This thread team is in the context of the thread issuing the #pragma omp parallel ...
omp_get_thread_num() returns the team member number of the thread team (parallel region) and .NOT. a global thread number.
omp_get_num_threads() returns the number of threads in the thread team and not the total number of threads created by all threads.

Additional note regarding the outermost layer of nested levels.

IIF B is a C# program that repeatedly calls A or calls some other portion of B (in C or Fortran) which then calls A, then assure that the thread in C# that makes these calls is instantiate only once (i.e. the pid is created once for first call and reused on subsequent calls). Failure to do this will result in each new thread creating an additional base level OpenMP thread pool and result in continually consuming resources (handles, thread contexts, memory, paging, overhead, oversubscription, and other messy things).

Jim Dempsey

mullervki · ‎02-07-2019

Jim,

num_threads(n) on omp pragma means "at most n threads"

Good information. I didn't know that. So it's possible that the number of threads being used in my "for" loops are actually smaller than what I had requested. I would need to print omp_get_num_threads() in each location as an instrumented program to be embedded into B. In fact, I would print both omp_get_num_threads AND the requested value to make sure they're the same.

I have queried B developers into their use of omp_/kmp_ function calls and environment variables. Let's see what they have to say.

Thanks to everyone!

-Arthur

jimdempseyatthecove · ‎02-07-2019

Also, assure that both A and B are using the same OpenMP runtime libraries that they were built to use (e.g. do not mix both gcc's OpenMP and Intel's and ...).

Consider using

if(omp_get_thread_num()==0) cout << omp_get_level() << ":" << omp_get_num_threads() << endl;

There are other omp_.... functions you may find of use. See:

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-openmp-run-time-library-routines

Jim Dempsey