Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28874 Discussions

Has anyone else had OpenMP programs hang at the end of execution?

Mark_Lewy
Valued Contributor I
2,011 Views

Before I open a case with Intel Support, I wonder whether anybody else has encountered this problem?

OS: Windows 10 Pro 1909

Compiler: XE2019U5 x64

Dev environment: VS2017 15.9.21

Was working with XE2018U3.

On rare occasions our simulation program hangs; here's a typical stack trace from a dump file:

     libiomp5md.dll!__kmp_suspend_initialize_thread(kmp_info * th) Line 359    C++
>    libiomp5md.dll!__kmp_free_team(kmp_root * root, kmp_team * team, kmp_info * master) Line 5941    C++
     libiomp5md.dll!__kmp_internal_end_library(int gtid_req) Line 4154    C++
     libiomp5md.dll!DllMain(HINSTANCE__ * hInstDLL, unsigned long fdwReason, void * lpReserved) Line 774    C++
     [External Code]    
     libifcoremd.dll!00007ffaf7224f66()    Unknown
     libifcoremd.dll!00007ffaf71966d1()    Unknown
     sim.exe!BBEXIT(long * STATUS) Line 64    Unknown
     sim.exe!CS_ENGINE_mp_CS_FINALISE() Line 410    Unknown
     sim.exe!SIM() Line 17    Unknown
     [External Code]    

Line 64 in our subroutine bbexit is a STOP statement and is our standard way of exiting sim.exe.  Rerunning the offending simulation usually succeeds; it usually takes a few seconds to complete.

In the soak test we found this problem, sim was being used for real time modelling, so hanging simulations are undesirable!

This looks like a regression to me in the OpenMP RTL, but it could also be a problem in our code, so has anyone else had similar issues?

0 Kudos
5 Replies
jimdempseyatthecove
Honored Contributor III
2,011 Views

>>Line 64 in our subroutine bbexit is a STOP statement and is our standard way of exiting sim.exe

Is that STOP being executed within a parallel region? (should be easy enough to test).

>>sim.exe!CS_ENGINE_mp_CS_FINALISE() Line 410 

Is UDT CS_ENGINE (which I assume is a shared/global object) being finalized from within a parallel region?

Jim Dempsey

0 Kudos
Mark_Lewy
Valued Contributor I
2,011 Views

Hi Jim,

Thanks for the reply, cs_engine::cs_finalise (a module subroutine) is not being called from within a parallel region.

Here's the source of sim.f90, the main program:

program sim
    use cs_engine

    implicit none
    save

    call cs_initialise
    call cs_perform_timestep
    call cs_finalise

end program

And this is cs_finalise (from module cs_engine)

    subroutine cs_finalise
        integer :: ios
        integer :: retval

        if (perr /= 0) then
            call bbwrit ('FATAL')
            call bbexit (mme_finish_fail)
        else
            call sim_iw2d_pre_finalise
            call evloop_finalise
            call sim_iw2d_finalise(retval)

            if (simffl(sfnprn) /= 0) then
                call closdn (0, aday, elaps, get_summary_unit())
            end if

            call massbal_close ! SWMM5 - originally called from swmm_end()
            call rdii_closeRdii ! SWMM5 - originally called from swmm_end() after massbal_close()

            call delete_objects ! SWMM5 objects

            call carchk

            call log_run_statistics
            close (get_log_unit(), iostat=ios)

            call bbwrit ('EXITING')
            if (fail) then
                call bbexit (mme_finish_incomplete)
            else if (get_warn()) then
                call bbexit (mme_finish_warning)
            else
                call bbexit (mme_finish_ok)
            end if
        end if
    end subroutine

I believe the call to bbexit is for the mme_finish_ok case, but it shouldn't matter either way.

Mark

jimdempseyatthecove (Blackbelt) wrote:

>>Line 64 in our subroutine bbexit is a STOP statement and is our standard way of exiting sim.exe

Is that STOP being executed within a parallel region? (should be easy enough to test).

>>sim.exe!CS_ENGINE_mp_CS_FINALISE() Line 410 

Is UDT CS_ENGINE (which I assume is a shared/global object) being finalized from within a parallel region?

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,011 Views

The stack dump shows program sim inside call cs_finalize, and subroutine cs_finalize inside one of the calls to bbexit...
and bbexit, apparently executing STOP which is attempting to shutdown the OpenMP thread pool, and in which it has hung.

Let me assume that the hang occurs in your Release Build.

Select the compiler option to generate Debug Information (for your Release Build).
Select the Linker option to NOT strip Debug Information (keep debug information).
Rebuild
Keep MS VS in edit mode (iow after Rebuild, do nothing)
Without closing MS VS, launch a CMD window.
Run your sim program as many times as necessary until your program hangs.
(if this requires many iterations, write a Batch script that loops running your program)

Now then, when it hangs, do not Ctrl-C the run. Instead...

Back to MS VS: Debug | Attach to Process | select the sim.exe (or batch name).

Then, use the Threads window in the Debugger, for each: set focus on each thread, examine the stack.

Note, some or possibly all of the additional threads used, may have been terminated, what you are looking for is non-main program threads that are running, where they are at, and why they may be hung.

One possible candidate is you are getting to the STOP statement while a thread is waiting on a condition variable. As to how it reaches this point is yet to be determined.

Note, while the compiler should warn/error on RETURN issued within a parallel region, or GOTO/EXIT/CYCLE that escapes a parallel region, I cannot say if a I/O statement with ERR=nn, EOR=nn, END=nn branching out of region is caught by the compiler.

Jim Dempsey

0 Kudos
Mark_Lewy
Valued Contributor I
2,011 Views

The stack trace is from a dump file created by attaching to a hanging (release build) simulation; there is only one (main) thread running.  We already generate PDB files for our release builds.

To put this into context, we are running about 25,000 simulations per day on the soak test system. We have observed only three or four hangs over an approximate 8 day period. So, it only occurs very rarely on the soak test system.

I've submitted a support query for this and we're also going to try running the soak tests with an earlier version of the engine, just after the change to XE2019U5 to see if that hangs too.  That will take a while.

jimdempseyatthecove (Blackbelt) wrote:

The stack dump shows program sim inside call cs_finalize, and subroutine cs_finalize inside one of the calls to bbexit...
and bbexit, apparently executing STOP which is attempting to shutdown the OpenMP thread pool, and in which it has hung.

Let me assume that the hang occurs in your Release Build.

Select the compiler option to generate Debug Information (for your Release Build).
Select the Linker option to NOT strip Debug Information (keep debug information).
Rebuild
Keep MS VS in edit mode (iow after Rebuild, do nothing)
Without closing MS VS, launch a CMD window.
Run your sim program as many times as necessary until your program hangs.
(if this requires many iterations, write a Batch script that loops running your program)

Now then, when it hangs, do not Ctrl-C the run. Instead...

Back to MS VS: Debug | Attach to Process | select the sim.exe (or batch name).

Then, use the Threads window in the Debugger, for each: set focus on each thread, examine the stack.

Note, some or possibly all of the additional threads used, may have been terminated, what you are looking for is non-main program threads that are running, where they are at, and why they may be hung.

One possible candidate is you are getting to the STOP statement while a thread is waiting on a condition variable. As to how it reaches this point is yet to be determined.

Note, while the compiler should warn/error on RETURN issued within a parallel region, or GOTO/EXIT/CYCLE that escapes a parallel region, I cannot say if a I/O statement with ERR=nn, EOR=nn, END=nn branching out of region is caught by the compiler.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,011 Views

>>the stack trace is from a dump file created by attaching to a hanging (release build) simulation

Do not abort the program to get a dump file....
After attach to process, click on || (pause all)
Then check the call stack of all threads.

Jim Dempsey

0 Kudos
Reply