- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to run pNetCDF on lustre. The test code and pNetCDF library are both compiled with intel mpi library v4.0.2. Our lustre file system has 40 OSTs.
When running with stripes = 1 or processes = 32, the test codes works well and can output data correctly.
However, when I set stripe = 40 and run with 64 processes, the test code crashed as :
rank 19 in job 1 c25b09_39645 caused collective abort of all ranks
exit status of rank 19: killed by signal 9
The test code is attacted. Thank you in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I use GDB to debug and got the error message :
Program received signal SIGFPE, Arithmetic exception.
34: 0x00002aaac33327e0 in ADIOI_LUSTRE_Get_striping_info ()
34: from /apps/intel/impi/4.0.2.003/intel64/lib/libmpi_lustre.so
It seems to be the same with http://lists.mcs.anl.gov/pipermail/mpich-discuss/2010-September/007947.html.
Is it a bug in Intel mpi library v4.0.2, And has it been fixed in new version?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wencan,
I can't find any indication of this being a known issue. There is an issue related to Lustre in the latest version that might cause a problem for you (undefined symbol in one of our libraries). I would recommend trying version 4.0.3 first, and 4.1.0.030 if 4.0.3 does not work. Please let me know if you try any and what the results are.
Can you please attach your test code? It did not get properly attached to the first post.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wencan,
I am only able to set striping up to 18 on the cluster I am using. At 18 stripes, I am unable to reproduce this behavior. Please run with I_MPI_DEBUG=5 and send the output.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wencan,
It appears you are using LSF* as your job scheduler. We do have some known issues with LSF*. I don't think they're related to this case, but can you try a few things just in case? First, try running in an interactive job. Due to one of the known issues, you will probably need to add
[plain]-genv LD_LIBRARY_PATH $LD_LIBRARY_PATH[/plain]
to your mpirun command. Also, please try running completely outside of LSF*.
Could you also send the output from stderr (for a failing job)? It would be best if you have stdout and stderr in the same file.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page