- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was running a program on skylake nodes. If I run it using one node (np=2, ph=2), the program is able to complete successfully. However, if I run it using two nodes (np=2, ph=1), I would get the following assertion failure:
rank = 1, revents = 8, state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
Does anyone know what are the possible causes for this type of assertion failure? Weird thing is: all my colleagues who are using csh can run the program reporting no error, but all other colleagues who are using bash (including me) always saw the same issue failed at the same line (2988).
- Tags:
- Cluster Computing
- General Support
- Intel® Cluster Ready
- Message Passing Interface (MPI)
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel mpiexec 2019.0.6 does not show option -ph.
Are you intending to use -ppn instead (Processes Per Node)?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, ph (per host) was an option used by our wrapper to call the binary. It is equivalent to ppn.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
This type of error occurs when one of the MPI processes is terminated by a signal (for example, SIGTERM or SIGKILL) from TCP.
The reasons might be host reboot, receiving an unexpected signal, OOM manager errors and others.
Could you check whether you are able to ssh to other nodes?
Can you look into this thread once and see if this helps you https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/747448
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The solution in that post didn't resolve the issue I had. I put another program (np=2, ph=1) into the job script and it terminated successfully, so I don't think it's related to the hardware.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Someone may close the ticket now. I found the issue was related to limited stack size. Some dynamic arrays are not passed into subroutines, resulting in them being allocated as static arrays in subroutines. This leads to memory issue and eventually crashes one of the nodes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the confirmation. We will go ahead and close this thread. Feel free to reach out to us for more queries.
--Rahul
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page