- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to run an atmospheric code on Itanium-2 (2 x 4-way smp) with Rocks. The code is mixed F77/90 and uses domain decomposition for parallelisation. It contains only point-point mpi communications.
I have used following mpi with same results:
a. Intel_mpi_10 (IFC 8.0 & IFC 9.0)
b. Intel MPI 2.0 (beta) (IFC 9.0)
c. mpich-1.2.7 (IFC 8.0 & IFC9.0)
1. Default settings (0-8 cpus): Initialises MPI but exits with segmentation fault: (Rank # of Task # on cluster.hpc.org_XXX caused collective abort of all processes. Code exits with signal 11).
This occurs while calling a subroutine having a long list of variable declarations (F77).
2. After setting stacksize (ulimit -s) to unlimited:
a. upto 4 cpu: Root enters the above subroutine and waits at first MPI communication, other processes do not enter so code hangs.
b. above 4 cpus (across nodes): All processes enter the above subroutine, do mpi communications normally, but same situation as 2a while calling another similar subroutine (numerous variable declarations).
I have carried out some tests by reducing the variable declarations in these subroutines and found that the above error occurs when the number of variable declarations passes a certain limit.
I used to get similar error on a SGI-Altix (Itanium-2, 16-way SMP, IFC 8.0) with LAM mpi, which got solved by using the SGI MPI (MPT).
Can anyone suggest any solution/ additional details required??
regards
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mcd_75 -
Do you get the same error results on a non-Itanium system? What about a non-Rocks cluster?
I expect that you'd see the same results if you tried using MPICH2 since the Intel MPI versions are based on this and you've seen the problems with MPICH 1.2.7.
You should report this error through the Intel Premier Support (http://premier.intel.com) technical support site for Intel MPI. Also, since it looks like the problem might be based in the MPICH used, you could look into the MPICH web sites to see if a workaround for this problem has been posted or is known.
As for things you might try, can you put the parameter lists into COMMON areas that each subroutine will have access to? This would alleviate the problem of sending so many parameters to each affected routine. Of course, this will require some code overhaul to make the transfer.
--clay
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have succesfully tested this code on IBM systems (Power4, Power5, ibm-mpi & mpich-1.2.6).
I get the same error on a Xeon system.
I used to get a similar error on SGI-Altix (It-2, 16 cpu, Linux, LAM) but it went through when I used the SGI (MPT) mpi.
Modification of the code at this stage is not feasible, it is mostly F77 except wherein I have introduced f90 during parallelisation.
I'll put this up on the premier site and see
thanks
MCD
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page