topic Quick note regarding timing: in Intel® MPI Library

MPI library crash when spawning > 20 processes using MPI_COMM_SPAWN

Ed_F_1 — Thu, 13 Oct 2016 10:20:57 GMT

I'm currently running MPI (5.03.048) on Windows 10 (64 bit) 8 core machine with 32GB RAM. I am using MPI_COMM_SPAWN from a C++ app (that is launched using mpiexec.exe -localonly -n 1) to spawn N MPI workers - actually, I call MPI_COMM_SPAWN N times each for a single worker (FT pattern). If I try to spawn 21 or more workers, I often get a crash from the MPI library itself. This is not consistent i.e. sometimes I can spawn 32 workers with no problems, sometimes I get a problem with 21. Has anyone else come across such a problem? Can anyone suggest what the issue might be?

One more piece of information

Ed_F_1 — Fri, 14 Oct 2016 12:43:47 GMT

One more piece of information. My app generates a dmp file. When I open this in Visual Studio, it wants to load impi_full.pdb which I don't have with my distribution. I do have impi.pdb - so I rename that temporarily - the Visual Studio debugger shows that the problem is in MPIDI_CH3U_Handle_connection.

I have just tried the 2017

Ed_F_1 — Mon, 17 Oct 2016 14:04:43 GMT

I have just tried the 2017 version - same problem. Although now I am able to use SEH in my app to allow my program to run with however many workers were successfully spawned. One other thing I noticed is that the spawning itself appears to be much slower than in version 5

Hello Ed,

Mark_L_Intel — Wed, 19 Oct 2016 00:33:10 GMT

Hello Ed,

Could you provide the source of the code you are using and also the output after you set I_MPI_DEBUG=5?

thanks

Mark

Hi. I'm afraid I can't

Ed_F_1 — Thu, 20 Oct 2016 10:48:09 GMT

Hi. I'm afraid I can't provide the actual source code. What I'm essentially doing in my main app is calling MPI_COMM_SPAWN inside a loop, as many times as required workers e.g. 32 times. Each invocation, sets maxProcs to 1; I also pass the working directory through MPI_Info. Inside the for loop I checked the return value of the spawn call; if successful, I send a couple of messages to the worker. Once the loop is finished, I then prepare to send other work to the workers. The worker is simple - it receives the expected couple of messages and then listens for the actual work. The crash occurs during one of the calls to MPI_COMM_SPAWN; however, it does not always crash.

Here is the exception message and stack trace of the most recent crash (from Visual Studio 2015):

Unhandled exception at 0x00007FFD405EA1A3 (impi.dll) in gServerErr_161020-102908.dmp: 0xC0000005: Access violation reading location 0x0000000000000000.

If there is a handler for this exception, the program may be safely continued.

>   impi.dll!MPID_nem_newtcp_module_cleanup
    impi.dll!MPID_nem_newtcp_module_cleanup
    impi.dll!MPID_nem_newtcp_module_cleanup
    impi.dll!MPIU_ExProcessCompletions
    impi.dll!MPID_nem_newtcp_module_connpoll
    impi.dll!MPID_nem_tcp_poll
    impi.dll!00007ffd40786f38()

I have also set I_MPI_DEBUG=7 - here is the tail of diagnostic output:

STDOUT: [0] MPI startup(): Intel(R) MPI Library, Version 5.0 Update 3 Build 20150128
STDOUT: [0] MPI startup(): Copyright (C) 2003-2015 Intel Corporation. All rights reserved.
STDOUT: [0] MPI startup(): Multi-threaded optimized library
STDOUT: [0] MPI startup(): shm and tcp data transfer modes
STDOUT: [0] MPI startup(): Internal info: pinning initialization was done
STDOUT: [0] MPI startup(): Device_reset_idx=8
STDOUT: [0] MPI startup(): Allgather: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Allgatherv: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Allreduce: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Alltoall: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Barrier: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Bcast: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Gather: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Gatherv: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Reduce_scatter: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Reduce: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Scatter: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
STDOUT: [0] MPI startup(): Rank Pid Node name Pin cpu
STDOUT: [0] MPI startup(): 0 13516 PSEUK1207 {0,1,2,3,4,5,6,7}
STDOUT: [0] MPI startup(): Recognition=0 Platform(code=32 ippn=0 dev=1) Fabric(intra=1 inter=1 flags=0x0)
STDOUT: [0] MPI startup(): I_MPI_DEBUG=7
STDOUT: [0] MPI startup(): I_MPI_PIN_MAPPING=1:0 0
STDOUT: [0] MPI startup(): Intel(R) MPI Library, Version 5.0 Update 3 Build 20150128
STDOUT: [0] MPI startup(): Copyright (C) 2003-2015 Intel Corporation. All rights reserved.
STDOUT: [0] MPI startup(): Multi-threaded optimized library
STDOUT: [0] MPI startup(): shm and tcp data transfer modes
STDOUT: [0] MPI startup(): Internal info: pinning initialization was done
STDERR: The following diagnostic file has been created: 'gServerErr_161020-102908.dmp'

I never get a problem spawning up to 20 workers; but 21 and above produces these random crashes.

Other notes:

I am using initialising MPI environment using the multiple threading level option
We are using Boost MPI as the wrapper; but as 1.55 does not wrap MPI_COMM_SPAWN we call this directly

Hello Ed,

Mark_L_Intel — Thu, 20 Oct 2016 23:17:06 GMT

Hello Ed,

It sounds like some cleanup is not happening. Do you mind to modify this code

http://mpi-forum.org/docs/mpi-2.0/mpi-20-html/node98.htm

to your likening (spawn > 20 workers, etc.), run it on your system and report back results?

Thanks,

Mark

Many thanks for the

Ed_F_1 — Fri, 21 Oct 2016 10:08:40 GMT

Many thanks for the suggestion. I have taken that code and adapted it to my circumstances - I cannot get it to fail. The only other major difference is that in my app, in the master, I have a secondary thread running and in the worker I also have a secondary thread running - these threads fulfill different tasks. I'll see if I can modify the code to factor that in.

Some more notes: I have taken

Ed_F_1 — Mon, 24 Oct 2016 10:48:11 GMT

Some more notes: I have taken the MPI example code, changed the structure to match more closely my application and tested it - this works fine. From the other end, I have tried to simplify my application so that it in effect is only calling MPI_Comm_spawn() in a loop - the command-line, single thread version, still crashes once the loop hits its 21st iteration. I'm not sure what to try next. I don't understand if the problem is due to 'complexity' of the spawned worker app or complexity in the main program. One thing is for sure: the call to MPI_Comm_spawn() results in something inside impi.lib trying to de-reference a null pointer.

I have to apologise for these

Ed_F_1 — Mon, 24 Oct 2016 11:03:59 GMT

I have to apologise for these little snippets of news. I am running my app through a Cygwin shell. As I mentioned earlier, this crashes reproducibly if I attempt to run with > 21 workers. However, by simply setting I_MPI_DEBUG to 10 (i.e. export I_MPI_DEBUG=10), my program runs without any problems. In fact if I set I_MPI_DEBG to < 5 I still get a crash; setting I_MPI_DEBUG >= 5 ---- no crash! If anyone has any ideas, that would be greatly appreciated.

Hallo Ed,

MarkV — Wed, 09 May 2018 11:34:36 GMT

Hallo Ed,

We’re currently facing the same issue with our application. The controller starts multiple workers using MPI_Comm_spawn. Most often, when the 21^st worker is started, the controller crashes in impi.dll. At that moment, the newly created worker is busy calling MPI_Init_thread.

The crash is : Exception thrown at 0x00007FFA5EB4A013 (impi.dll) in controller.exe: 0xC0000005: Access violation reading location 0x0000000000000000.

If we replace the multithreaded release version of impi.dll by the multithreaded debug version, the problem does not occur.

We are using MPI version 2018.0.2.0 on Windows.

Have you found the cause/fix for this issue ?

Thanks,

Mark

Hi Mark,

Figura__Ed — Tue, 22 May 2018 14:39:41 GMT

Hi Mark,

Thanks for that info. I'm trying to get version 18. I have tried 17 but still have the same problems with that. And I still do not know the cause!

Further update: I have tried

Figura__Ed — Wed, 23 May 2018 13:17:02 GMT

Further update: I have tried 2018 Update 8 (on Windows 10). First thing: it is about 20x slower than version 5.0.3 in spawning worker processes. My application use MPI as-is out-of-box. I call MPI_Comm_spawn() and tell it to spawn 1 child process - this is done in a loop N times. Perhaps there are some configuration variables that need tweaking but this came as a shock. Secondly, I appear to observe the same behaviour as Mark: i.e. with the debug DLLs I can spawn 32 worker processes (albeit rather slowly). I have not tested exhaustively whether this always works but with the release DLL, I got quite a few failures.

Unfortunately, I need to get to the bottom of the performance hit before considering switching to this version.

I also have the option of launching my app in MPMD mode. This works with any number of workers without a problem. It is extremely fast with 5.0.3 and quite the opposite with 2018 Update 2.

Quick note regarding timing:

Figura__Ed — Thu, 24 May 2018 07:17:54 GMT

Quick note regarding timing: it transpires I was also capturing a call to MPI_Barrier - and it is this function that is responsible for the performance hit. Most probably there is some configuration that now is required?