- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am helping someone to run a large, parallel, fortran, FEA code, on our cluster. I compiled his code and openmpi with Intel compilers (both version 12.0.3 and 11.0.083). The code works on 4 processes: either (1 node, 4 cores) or (4 nodes, 1 core). But it fails on 8 processes. The error we get is:
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: double free or corruption (!prev): 0x0000000013fcf290 ***
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: double free or corruption (!prev): 0x000000001d3a7bc0 ***
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: munmap_chunk(): invalid pointer: 0x0000000013e75850 ***
I have attached the full error mesage in a text file.
I have tried different versions of openmpi: 1.3.2, 1.4.2 and 1.4.3. Same results.
On the other hand, compiling with PGI compilers (11.2-1) works fine on as many core as I want. The compilation options for both compilers are: -g -O0.
I have run the code with a debugger and it fails on a deallocate:
allocate(jb(nnzloc),stat=ierr)
....
call setup1(nloc,nbnd,a,ja,ia,nproc,proc,ix,ipr,aloc,jaloc,
*ialoc,b,jb,ib,iwk,map,type,nnzmax,nl)
....
deallocate(jb,stat=ierr) WHERE IT FAILS
The function setup1 comes from a library for parallel sparse matrices that I could post.
We are running CentOS 5.5, kernel: 2.6.18-194.32.1.el5, gcc version 4.1.2 20080704. The infiniband network is from Qlogic with Open Fabrics Enterprise Distribution (OFED) version 1.5.2. The CPU is an Intel, 64-bit, Quad core (E5345 @ 2.33GHz).
Intel + openmpi works on other, simpler, codes.
I am running out of things to try so I am looking for any clues on how to make it work. Maybe some special compile options? There might be a problem with the code but it's hard for me to argue that since it works with PGI.
Thanks,
David
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: double free or corruption (!prev): 0x0000000013fcf290 ***
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: double free or corruption (!prev): 0x000000001d3a7bc0 ***
*** glibc detected *** /home1/david/DynaTest/src/dynaflow/dynaflow.v02_mpi_intel_64bits: munmap_chunk(): invalid pointer: 0x0000000013e75850 ***
I have attached the full error mesage in a text file.
I have tried different versions of openmpi: 1.3.2, 1.4.2 and 1.4.3. Same results.
On the other hand, compiling with PGI compilers (11.2-1) works fine on as many core as I want. The compilation options for both compilers are: -g -O0.
I have run the code with a debugger and it fails on a deallocate:
allocate(jb(nnzloc),stat=ierr)
....
call setup1(nloc,nbnd,a,ja,ia,nproc,proc,ix,ipr,aloc,jaloc,
*ialoc,b,jb,ib,iwk,map,type,nnzmax,nl)
....
deallocate(jb,stat=ierr) WHERE IT FAILS
The function setup1 comes from a library for parallel sparse matrices that I could post.
We are running CentOS 5.5, kernel: 2.6.18-194.32.1.el5, gcc version 4.1.2 20080704. The infiniband network is from Qlogic with Open Fabrics Enterprise Distribution (OFED) version 1.5.2. The CPU is an Intel, 64-bit, Quad core (E5345 @ 2.33GHz).
Intel + openmpi works on other, simpler, codes.
I am running out of things to try so I am looking for any clues on how to make it work. Maybe some special compile options? There might be a problem with the code but it's hard for me to argue that since it works with PGI.
Thanks,
David
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
David,
This looks like either
a) something in your code stomped on the array descriptor for jb
.or.
b) something stomped on the memory immediately preceeding the memory (heap) node for the allocation of the arrayjb
Try for diagnostic
after allocate print out the rank, sizeand loc of jb
after call to setup1 (before deallocate of jb) do the same
If rank, size or loc differ then the array descriptor got trashed
If they are the same, then likely something wrote to an address preceeding the allocation (e.g. wrote to jb(0) or jb(-1), ...).
Note, the fact that the code works on one system does not indicate that code does not trash memory. It only means the symptom did not show up.
Jim Dempsey
This looks like either
a) something in your code stomped on the array descriptor for jb
.or.
b) something stomped on the memory immediately preceeding the memory (heap) node for the allocation of the arrayjb
Try for diagnostic
after allocate print out the rank, sizeand loc of jb
after call to setup1 (before deallocate of jb) do the same
If rank, size or loc differ then the array descriptor got trashed
If they are the same, then likely something wrote to an address preceeding the allocation (e.g. wrote to jb(0) or jb(-1), ...).
Note, the fact that the code works on one system does not indicate that code does not trash memory. It only means the symptom did not show up.
Jim Dempsey
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
David,
This looks like either
a) something in your code stomped on the array descriptor for jb
.or.
b) something stomped on the memory immediately preceeding the memory (heap) node for the allocation of the arrayjb
Try for diagnostic
after allocate print out the rank, sizeand loc of jb
after call to setup1 (before deallocate of jb) do the same
If rank, size or loc differ then the array descriptor got trashed
If they are the same, then likely something wrote to an address preceeding the allocation (e.g. wrote to jb(0) or jb(-1), ...).
Note, the fact that the code works on one system does not indicate that code does not trash memory. It only means the symptom did not show up.
Jim Dempsey
This looks like either
a) something in your code stomped on the array descriptor for jb
.or.
b) something stomped on the memory immediately preceeding the memory (heap) node for the allocation of the arrayjb
Try for diagnostic
after allocate print out the rank, sizeand loc of jb
after call to setup1 (before deallocate of jb) do the same
If rank, size or loc differ then the array descriptor got trashed
If they are the same, then likely something wrote to an address preceeding the allocation (e.g. wrote to jb(0) or jb(-1), ...).
Note, the fact that the code works on one system does not indicate that code does not trash memory. It only means the symptom did not show up.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
You put me on the right track and I found the problem. I ran valgrind and found an " Invalid write of size 4" in setup1. I went in the code and increased the memory allocated to the problematic array and it now works.
Thanks a lot for your help.
David
You put me on the right track and I found the problem. I ran valgrind and found an " Invalid write of size 4" in setup1. I went in the code and increased the memory allocated to the problematic array and it now works.
Thanks a lot for your help.
David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
David,
Look at your code to make certain that it is correct. Increasing an array size by one element may or may not have been the correct thing to do even though it made the symptom go away. It is not unusual for a site to have FORTRAN code maintainers more familiar with C/C++ than FORTRAN. Even if this programmer is yourself, one must be cautious that FORTRAN defaults to arrays beginning at (1) as opposed to C/C++ at [0]. In porting a function from C/C++ (or interoperable calls) it is too easy of a mistake to write one element before the allocation. If this memory were temporary or transient useage, you might never see a symptom. That is until this usage disturbed something for use by the program later. e.g. the C Runtime Library hidden heap node information or some other variable used in your program subsequent to the damage.
Jim Dempsey
Look at your code to make certain that it is correct. Increasing an array size by one element may or may not have been the correct thing to do even though it made the symptom go away. It is not unusual for a site to have FORTRAN code maintainers more familiar with C/C++ than FORTRAN. Even if this programmer is yourself, one must be cautious that FORTRAN defaults to arrays beginning at (1) as opposed to C/C++ at [0]. In porting a function from C/C++ (or interoperable calls) it is too easy of a mistake to write one element before the allocation. If this memory were temporary or transient useage, you might never see a symptom. That is until this usage disturbed something for use by the program later. e.g. the C Runtime Library hidden heap node information or some other variable used in your program subsequent to the damage.
Jim Dempsey
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page