Possible Memory Error

andjhawkins · ‎02-04-2010

I have been given some code from a colleague which works for them. However, when I run it, I get a segmentation fault. We both are using the intel10 fortran compiler and are running on the same supercomputer.

The error is happening on the first line of execution of the following fortran routine. A print statement just prior to the call of this subroutine faithfully prints out.

Any suggestions? Thanks in advance!

subroutine SparseGMRES_up(
& lhsK,
& Utol, colu, rowu,
& rhsGu,
& solu,
& Kspaceu, Kspaceu_mn, Ksp_usd,
& icntu,
& NNODZu,
& Pu, Qu, Ru, NDOF,
& TuMu, TuSu, TvMu, TvSu, TwMu, TwSu,
& MCPu, NCPu, OCPu,
& lpu,lpv,lpw
& )

use mpi

implicit none

real*8 HBrg(Kspaceu+1,Kspaceu)
real*8 Rcos(Kspaceu), Rsin(Kspaceu)
real*8 lhsKBdiag(NNODZu,NDOF)
real*8 uBrg1(NNODZu,NDOF,Kspaceu+1)

real unorm_ref

integer
& n, i, j, k, iK, iKs, jK, lK, kK, Kspaceu, count, idof,
& icntu, Kspaceu_mn,Ksp_usd, k_f, k_f2,
& NNODZu, Pu, Qu, Ru, NSD, NDOF,
& is, lenseg,
& MCPu, NCPu, OCPu,
& lpu,lpv,lpw

integer, dimension(NNODZu+1):: colu
integer, dimension(NNODZu*8*(Pu+1)*(Qu+1)*(Ru+1)):: rowu

real*8
& rhstmp1(NNODZu,NDOF),
& eBrg(Kspaceu+1),
& temp1u(NNODZu,NDOF),
& yBrg(Kspaceu),
& rhsGu(NNODZu,NDOF), solu(NNODZu,NDOF),
& rr, unorm,
& rru, rrul,
& epsnrm, beta, rrglob,
& ercheck, tmp, tmp1, Utol,
& lhsK(NDOF*NDOF,icntu),
& Binv(NDOF,NDOF)

real*8
& TuMu(2*Pu,Pu), TuSu(2*Pu,Pu),
& TvMu(2*Qu,Qu), TvSu(2*Qu,Qu),
& TwMu(2*Ru,Ru), TwSu(2*Ru,Ru)

rhstmp1 = RHSGu !THIS LINE IS NEVER SUCCESSFULLY EXECUTED

!------------- zero out common values --------------
lhsKBdiag = 0d0
uBrg1 = 0d0
Rcos = 0d0
Rsin = 0d0
HBrg = 0d0

Compiler details: mpif90 -v
mpif90 for 1.2.7 (release) of : 2005/06/22 16:33:49
Version 10.1
ld /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/3.4.6/crtbegin.o --eh-frame-hdr -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /opt/apps/intel/10.1/fc/lib/for_main.o -rpath-link /opt/apps/intel10/mvapich/1.0.1/lib/shared -L/opt/apps/intel10/mvapich/1.0.1/lib/shared -L/opt/apps/intel10/mvapich/1.0.1/lib -lmpichf90nc -lmpichfarg -lmpich -L/opt/ofed//lib64 -rpath=/opt/ofed//lib64 -libverbs -libumad -lpthread -lpthread -lrt -L/opt/apps/intel/10.1/fc/lib -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64 -lifport -lifcore -limf -lsvml -lm -lipgo -lintlc -lc -lgcc_s -lgcc -lirc_s -ldl -lc /usr/lib/gcc/x86_64-redhat-linux/3.4.6/crtend.o /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crtn.o
/opt/apps/intel/10.1/fc/lib/for_main.o: In function `main':
/export/users/nbtester/efi2linuxx86_nightly/branch-10_1/20080604_000000/libdev/frtl/src/libfor/for_main.c:(.text+0x26): undefined reference to `MAIN__'
rm /tmp/ifort39BLvflibgcc

rm /tmp/ifortnJxF2ignudirs

rm /tmp/ifortR2QW6pgnudirs

System details: Dell Linux Cluster contains 5,840 cores within 1,460 Dell PowerEdge 1955 compute blades (nodes), 16 PowerEdge 1850 compute-I/O server-nodes, and 2 PowerEdge2950 (2.66GHz) login/management nodes. Each compute node has 8GB of memory, and the login/development nodes have 16GB. The system storage includes a 103TB parallel (WORK) Lustre file system, and 106.5TB of local compute-node disk space (73GB/node). An InfiniBand switch fabric, employing PCI Express interfaces, interconnects the nodes (I/O and compute) through a fat-tree topology, with a point-to-point bandwidth of 1GB/sec (unidirectional speed).

Compute nodes have two processors, each a Xeon 5100 series 2.66GHz dual-core processor with a 4MB unified (Smart) L2 cache. Peak performance for the four cores is 42.6 GFLOPS. Some of the key features of the Core micro-architecture are: dual-core, L1 Instruction cache, 14 unit pipeline, eight pre-fetch units, Macro Ops Fusion, double-speed integer units, Advanced Smart (sharing) L2 cache, and 16 new SSE3 instructions. The memory system uses Fully Buffered DIMMS (FB-DIMMS) and a 1333 MHz (10.7 GB/sec) front side bus.

TimP · ‎02-04-2010

If you haven't investigated stack overflow or the remedies, you should read previous discussions of it on the linux Fortran forum. Possibly you didn't use the same stack limits as your colleagues.

Apparently, you are creating large automatic arrays. These are notorious for lacking means for error checking, thus the usual recommendation to use ALLOCATABLE arrays with error checking in ALLOCATE, so you know where the problem occurs.

You also show a linker message indicating that you didn't supply a Fortran main program. Your colleagues must have resolved that if they have been successful.

andjhawkins · ‎02-05-2010

I have investigted stack overflow. But my limit was already set to unlimited, so I don't believe that is the error. Would you have any other ideas?

The linker message I believe is for a wrapper around mpif90, but I will investigate further.

I will also try changing the arrays to allocatable.

Thank you for your suggestions.

Andrea