Solved: "double free or corruption" error with ifort but works with oth

David_Luet · ‎05-30-2011

I am helping someone to run a large, parallel, fortran, FEA code, on our cluster. I compiled his code and openmpi with Intel compilers (both version 12.0.3 and 11.0.083). The code works on 4 processes: either (1 node, 4 cores) or (4 nodes, 1 core). But it fails on 8 processes. The error we get is:
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: double free or corruption (!prev): 0x0000000013fcf290 ***
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: double free or corruption (!prev): 0x000000001d3a7bc0 ***
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: munmap_chunk(): invalid pointer: 0x0000000013e75850 ***
I have attached the full error mesage in a text file.
I have tried different versions of openmpi: 1.3.2, 1.4.2 and 1.4.3. Same results.

On the other hand, compiling with PGI compilers (11.2-1) works fine on as many core as I want. The compilation options for both compilers are: -g -O0.

I have run the code with a debugger and it fails on a deallocate:
allocate(jb(nnzloc),stat=ierr)
....
call setup1(nloc,nbnd,a,ja,ia,nproc,proc,ix,ipr,aloc,jaloc,
*ialoc,b,jb,ib,iwk,map,type,nnzmax,nl)
....
deallocate(jb,stat=ierr) WHERE IT FAILS

The function setup1 comes from a library for parallel sparse matrices that I could post.
We are running CentOS 5.5, kernel: 2.6.18-194.32.1.el5, gcc version 4.1.2 20080704. The infiniband network is from Qlogic with Open Fabrics Enterprise Distribution (OFED) version 1.5.2. The CPU is an Intel, 64-bit, Quad core (E5345 @ 2.33GHz).

Intel + openmpi works on other, simpler, codes.

I am running out of things to try so I am looking for any clues on how to make it work. Maybe some special compile options? There might be a problem with the code but it's hard for me to argue that since it works with PGI.

Thanks,
David

jimdempseyatthecove · ‎05-30-2011

David,

This looks like either

a) something in your code stomped on the array descriptor for jb
.or.
b) something stomped on the memory immediately preceeding the memory (heap) node for the allocation of the arrayjb

Try for diagnostic

after allocate print out the rank, sizeand loc of jb
after call to setup1 (before deallocate of jb) do the same

If rank, size or loc differ then the array descriptor got trashed
If they are the same, then likely something wrote to an address preceeding the allocation (e.g. wrote to jb(0) or jb(-1), ...).

Note, the fact that the code works on one system does not indicate that code does not trash memory. It only means the symptom did not show up.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎05-30-2011

David,

This looks like either

a) something in your code stomped on the array descriptor for jb
.or.
b) something stomped on the memory immediately preceeding the memory (heap) node for the allocation of the arrayjb

Try for diagnostic

after allocate print out the rank, sizeand loc of jb
after call to setup1 (before deallocate of jb) do the same

If rank, size or loc differ then the array descriptor got trashed
If they are the same, then likely something wrote to an address preceeding the allocation (e.g. wrote to jb(0) or jb(-1), ...).

Note, the fact that the code works on one system does not indicate that code does not trash memory. It only means the symptom did not show up.

Jim Dempsey

David_Luet · ‎05-30-2011

Jim,
You put me on the right track and I found the problem. I ran valgrind and found an " Invalid write of size 4" in setup1. I went in the code and increased the memory allocated to the problematic array and it now works.
Thanks a lot for your help.
David

jimdempseyatthecove · ‎05-31-2011

David,

Look at your code to make certain that it is correct. Increasing an array size by one element may or may not have been the correct thing to do even though it made the symptom go away. It is not unusual for a site to have FORTRAN code maintainers more familiar with C/C++ than FORTRAN. Even if this programmer is yourself, one must be cautious that FORTRAN defaults to arrays beginning at (1) as opposed to C/C++ at [0]. In porting a function from C/C++ (or interoperable calls) it is too easy of a mistake to write one element before the allocation. If this memory were temporary or transient useage, you might never see a symptom. That is until this usage disturbed something for use by the program later. e.g. the C Runtime Library hidden heap node information or some other variable used in your program subsequent to the damage.

Jim Dempsey

"double free or corruption" error with ifort but works with other compilers