Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Alex10
Beginner
227 Views

Error message form dapl_module_poll.c while running MPI job

Dear experts,

 

I have experienced an error while running a parallel code compiled with Intel MPI. To start my jobs I am using an environmental variable $DO_PARALLEL, having the following content:

[bash]

mpiexec -machinefile /tmp/user/mpihosts-22 -np 16 -env I_MPI_DEBUG 5

[/bash]

Our cluster uses PBS to submit jobs.

I am getting a rather unpredictable behavior, sometimes my code runs without problems, while others It fails with the following error:

[bash]

OS: Scientific Linux SL release 5.5 (Boron)

[0:n010106] unexpected DAPL connection event 0x4008 from 34

Assertion failed in file ../../dapl_module_poll.c at line 4287: 0

internal ABORT - process 0

[9:n010404] unexpected disconnect completion event from [0:n010106]

[11:n010404] unexpected disconnect completion event from [0:n010106]

[22:n010312] unexpected disconnect completion event from [0:n010106]

Assertion failed in file ../../dapl_module_util.c at line 1593: 0

Assertion failed in file ../../dapl_module_util.c at line 1593: 0

Assertion failed in file ../../dapl_module_util.c at line 1593: 0I guess that this is a communication problem.

[7:n010106] unexpected disconnect completion event from [15:n010404]

Assertion failed in file ../../dapl_module_util.c at line 1593: 0

internal ABORT - process 7

[/bash]

Each node is equipped with 8 Quad-Core Intel® Xeon® Processor 5400 Series processors and has 16 GB of memory.

I have performed a little research on the internet and came to the conclusion that this might be a communication issue. Those errors started appearing, when I began communicating large arrays to the slaves. I would appreciate any Ideas and/or explanations what is the reasoning behind this rather strange behavior.

 

Thanks,

Alex

0 Kudos
11 Replies
James_T_Intel
Moderator
227 Views

Hi Alex,

What are the contents of your /etc/dat.conf file?  What version of the Intel® MPI Library are you using?  Can you run with I_MPI_DEBUG=5 and provide the output?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Alex10
Beginner
227 Views

HI James,

 

Actually the output I provided is generated using I_MPI_DEBUG=5. It is included by default in the enviormental variable $DO_PARALLEL, used to start my MPI-run. In my first message I have an echo of it:

[bash]

echo $DO_PARALLEL =mpiexec -machinefile /tmp/user/mpihosts-22 -np 16 -env I_MPI_DEBUG 5

[/bash]

I am using 11.1 version of the Intel mpif90 compiler. I have no/etc/dat.conf file.

 

Thanks,

Alex

James_T_Intel
Moderator
227 Views

Hi Alex,

Strange.  You should have additional output with I_MPI_DEBUG=5.  Can you send the output from

[plain]env | grep I_MPI[/plain]

11.1 is a compiler version.  I need the version of the Intel® MPI Library, which can be found by using

[plain]mpirun -v[/plain]

James.

Alex10
Beginner
227 Views

Hi James,

[bash]

mpirun –version

Intel(R) MPI Library for Linux, 64-bit applications, Version 4.0  Build 20100422

env | grep I_MPI

I_MPI_F77=ifort

I_MPI_F90=ifort

I_MPI_CC=icc

I_MPI_CXX=icpc

I_MPI_FC=ifort

[/bash]

Alex

James_T_Intel
Moderator
227 Views

Hi Alex,

Is there any chance you could try with the latest version of the Intel® MPI Library, Version 4.1 Update 1?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Alex10
Beginner
227 Views

Hi James,

Unfortunately I can’t do that. The only Intel(R) MPI library available on this machine is 4.0. Researching the forum here, I found a topic, which gave me some clues what to do. I have now the follwoing options included in my PBS script:

[plain]

export I_MPI_MPD_RSH=ssh

export I_MPI_USE_DYNAMIC_CONNECTIONS=0

export I_MPI_FABRICS_LIST="ofa,dapl,tcp,tmi"

export I_MPI_FALLBACK_DEVICE=1

[\plain]

I was able to run two jobs using this modification. I can’t conclude that this solves the problem because none of the nodes causing the problems was included by the job handler of PBS. Actually is it possible to see the source code of dapl_module_poll.c?

 

Thanks,

Alex

James_T_Intel
Moderator
227 Views

Hi Alex,

I'm glad you were able to find a workaround.  If you do run into further problems, let us know.

Unless you have the source code already, then no.  We typically don't share the source code for our proprietary software products.

You could also get an evaluation of the latest version and install it into your user folder.

James.

Alex10
Beginner
227 Views

Hi James,

Well, as I indicated in my previous message the workaround I figured out is not a real solution to the problem. My calculation failed today once again. In general isn’t it possible to say what is going on, by looking at line where the error occurred?

Thanks,

Alex

James_T_Intel
Moderator
227 Views

Hi Alex,

Unfortunately, not really.  That error simply indicates that something caused the DAPL connection to fail.  Can you try to get the output from a failing run with I_MPI_HYDRA_DEBUG=1?

James.

Alex10
Beginner
227 Views

Hi James,

I will try this. The situation is really strange, my code failed, than 2 min later I resubmitted the job and now it runs. It look like a quite random pattern.

Thanks,

Alex

James_T_Intel
Moderator
227 Views

Hi Alex,

I'd recommend checking your IB setup then.  You can try using I_MPI-FABRICS=shm:tcp to bypass it, and if this runs consistently, then I'd definitely suspect an IB problem.

James.

Reply