OpenMP deadlocks for long running jobs

mriedman · ‎03-13-2017

We observe deadlocks of long running OpenMP jobs with different executables built with ifort15. Reproduction of the issue typically takes >12 hours running with 12 threads on a Haswell platform.

In all cases the master thread deadlocks at an OpenMP PARALLEL directive in the inner iteration loop of a BiCGStab solver. That loop has 5 parallel blocks featuring sum and max reductions. We have several variants of that solver and the issue reproduces with either one. When the deadlock occurs the solver has done a total of >100 million iterations which translates to almost a billion thread forks and joins.

Has anybody come across a similar issue ? Any hints for further toubleshooting are appreciated.

Master thread stack trace:

     pthread_cond_wait,                                          FP=7ffc465e2c80
     __kmp_suspend_64,                                           FP=7ffc465e2de0
     _Z26__kmp_hyper_barrier_gather12barrier_typeP8kmp_infoiiPFvPvS2_ES2_, FP=7ffc465e2f10
     _Z18__kmp_join_barrieri,                                    FP=7ffc465e2fd0
     __kmp_internal_join,                                        FP=7ffc465e2ff0
     __kmp_join_call,                                            FP=7ffc465e3040
     __kmpc_fork_call,                                           FP=7ffc465e3130
     bicgstab_solv`bicgstab_solv_all,                            FP=7ffc466f1490

Worker threads stack trace:

     pthread_cond_wait,                                          FP=2b15bc3fe770
     __kmp_suspend_64,                                           FP=2b15bc3fe8d0
     _Z26__kmp_hyper_barrier_gather12barrier_typeP8kmp_infoiiPFvPvS2_ES2_, FP=2b15bc3fea00
     _Z18__kmp_join_barrieri,                                    FP=2b15bc3feac0
     __kmp_launch_thread,                                        FP=2b15bc3feb00
     _Z19__kmp_launch_workerPv,                                  FP=2b15bc3fecd0
     start_thread,                                               FP=2b15bc3fed70
     __clone,                                                    FP=2b15bc3fed78

jimdempseyatthecove · ‎03-13-2017

Have you tried a newer version than V15?

Also, on some of the older versions of pthread, in non-OpenMP multi-threaded programming environments, I've experience a similar situation where the application hung in pthread_cond_wait, and where inspection with the debugger, the condition was satisfied. IOW an apparent race condition (small window) between one thread entering pthread_cond_wait and a different thread signaling the condition. i.e. iif the signal occurs within a small window near the time a different thread enters pthread_cond_wait, then the thread issuing the wait will miss seeing the event. My solution in the code I produced was to replace pthread_cond_wait with pthread_cond_timedwait, then if timeout occurred use alternate means to determine if condition was met. This is not possible for you to implement, however, if your system has an older version of pthread (or Intel Fortran runtime library with pthread), upgrading may resolve the issue.

Jim Dempsey

mriedman · ‎03-13-2017

Thanks, Jim, I will try ifort17 and possibly play with -qopenmp-lib as well. Testing on RHEL7 actually gives me a rather new libpthread.

If the issue is a race condition then it could theoretically show up at any time, not just after >12 hours runtime. That's why I also imagine some kind of resource exhaustion or integer overflow issue.

Michael

jimdempseyatthecove · ‎03-13-2017

When I saw the missed signal situation, this was observed only during a stress test on my machine on overnight runs. Several hours of expressly trying to force the error to occur. This was several years ago, I cannot say if the error was corrected (my code still has the pthread_cond_timedwait). My analysis of the situation was that the race condition occurs at the confluence (near simultaneous occurrence) of the call to pthread_cond_wait and pthread_cond_signal.

Try placing this untested code at the end of the parallel section(s) that experience the hang.

static int delayShared = 0; // must be visible to all threads of team
...
int delay = __sync_fetch_and_add(&delayShared, 1);
if(delay) usleep(delay);
__sync_fetch_and_add(&delayShared, -1);

The above code isn't quite built proof. A better solution would be to use RDTSC and assure that threads distance their exits by a us or so.

This kind of hack should not be required. However, if the fix works, you can get on with your daily business.

Jim Dempsey

mriedman · ‎03-20-2017

The resolution is to use ifort16 for linking. I did not bother to rebuild entirely, the relink is sufficient. The job terminated normally after 55 hours.