We observe deadlocks of long running OpenMP jobs with different executables built with ifort15. Reproduction of the issue typically takes >12 hours running with 12 threads on a Haswell platform.
In all cases the master thread deadlocks at an OpenMP PARALLEL directive in the inner iteration loop of a BiCGStab solver. That loop has 5 parallel blocks featuring sum and max reductions. We have several variants of that solver and the issue reproduces with either one. When the deadlock occurs the solver has done a total of >100 million iterations which translates to almost a billion thread forks and joins.
Has anybody come across a similar issue ? Any hints for further toubleshooting are appreciated.
Master thread stack trace:
Worker threads stack trace:
Have you tried a newer version than V15?
Also, on some of the older versions of pthread, in non-OpenMP multi-threaded programming environments, I've experience a similar situation where the application hung in pthread_cond_wait, and where inspection with the debugger, the condition was satisfied. IOW an apparent race condition (small window) between one thread entering pthread_cond_wait and a different thread signaling the condition. i.e. iif the signal occurs within a small window near the time a different thread enters pthread_cond_wait, then the thread issuing the wait will miss seeing the event. My solution in the code I produced was to replace pthread_cond_wait with pthread_cond_timedwait, then if timeout occurred use alternate means to determine if condition was met. This is not possible for you to implement, however, if your system has an older version of pthread (or Intel Fortran runtime library with pthread), upgrading may resolve the issue.
Thanks, Jim, I will try ifort17 and possibly play with -qopenmp-lib as well. Testing on RHEL7 actually gives me a rather new libpthread.
If the issue is a race condition then it could theoretically show up at any time, not just after >12 hours runtime. That's why I also imagine some kind of resource exhaustion or integer overflow issue.
When I saw the missed signal situation, this was observed only during a stress test on my machine on overnight runs. Several hours of expressly trying to force the error to occur. This was several years ago, I cannot say if the error was corrected (my code still has the pthread_cond_timedwait). My analysis of the situation was that the race condition occurs at the confluence (near simultaneous occurrence) of the call to pthread_cond_wait and pthread_cond_signal.
Try placing this untested code at the end of the parallel section(s) that experience the hang.
static int delayShared = 0; // must be visible to all threads of team
int delay = __sync_fetch_and_add(&delayShared, 1);
The above code isn't quite built proof. A better solution would be to use RDTSC and assure that threads distance their exits by a us or so.
This kind of hack should not be required. However, if the fix works, you can get on with your daily business.