Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI-IO error when running on lustre with a high number of stripes and processes

Wencan_W_
Beginner
828 Views

Hi,

I'm trying to run pNetCDF on lustre. The test code and pNetCDF library are both compiled with intel mpi library v4.0.2. Our lustre file system has 40 OSTs.

When running with stripes = 1 or processes = 32, the test codes works well and can output data correctly.

However, when I set stripe = 40 and run with 64 processes, the test code crashed as :

  rank 19 in job 1 c25b09_39645 caused collective abort of all ranks
      exit status of rank 19: killed by signal 9

The test code is attacted. Thank you in advance.

0 Kudos
7 Replies
Wencan_W_
Beginner
829 Views

I use GDB to debug and got the error message :

Program received signal SIGFPE, Arithmetic exception.
34: 0x00002aaac33327e0 in ADIOI_LUSTRE_Get_striping_info ()
34: from /apps/intel/impi/4.0.2.003/intel64/lib/libmpi_lustre.so

It seems to be the same with http://lists.mcs.anl.gov/pipermail/mpich-discuss/2010-September/007947.html.

Is it a bug in Intel mpi library v4.0.2, And has it been fixed in new version?

0 Kudos
James_T_Intel
Moderator
829 Views

Hi Wencan,

I can't find any indication of this being a known issue.  There is an issue related to Lustre in the latest version that might cause a problem for you (undefined symbol in one of our libraries).  I would recommend trying version 4.0.3 first, and 4.1.0.030 if 4.0.3 does not work.  Please let me know if you try any and what the results are.

Can you please attach your test code?  It did not get properly attached to the first post.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Wencan_W_
Beginner
829 Views

Thanke you for your help.

0 Kudos
James_T_Intel
Moderator
829 Views

Hi Wencan,

I am only able to set striping up to 18 on the cluster I am using.  At 18 stripes, I am unable to reproduce this behavior.  Please run with I_MPI_DEBUG=5 and send the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Wencan_W_
Beginner
829 Views

Hi.

I ran with 64 processes and got the output, which is attacted.

The error occurs when the program runs in loop 1 and outputs the 2nd netCDF file. 

Thank you for your help.

0 Kudos
James_T_Intel
Moderator
829 Views

Hi Wencan,

It appears you are using LSF* as your job scheduler.  We do have some known issues with LSF*.  I don't think they're related to this case, but can you try a few things just in case?  First, try running in an interactive job.  Due to one of the known issues, you will probably need to add

[plain]-genv LD_LIBRARY_PATH $LD_LIBRARY_PATH[/plain]

to your mpirun command.  Also, please try running completely outside of LSF*.

Could you also send the output from stderr (for a failing job)?  It would be best if you have stdout and stderr in the same file.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Wencan_W_
Beginner
829 Views

Hi,

I ran with mpiexec comman as:

[bash]mpiexec -genv I_MPI_EXTRA_FILESYSTEM on -genv I_MPI_EXTRA_FILESYSTEM_LIST lustre -genv I_MPI_DEBUG 5 -genv LD_LIBRARY_PATH $LD_LIBRARY_PATH -n $num ./perform_test_pnetcdf $x_proc $y_proc $output &> $out[/bash]

and got the new output.

0 Kudos
Reply