Re: Segmentation faults with huge arrays with sources that shoudl actually work

paconrad · ‎08-18-2020

I have recently tried to add a print statement in a program that I had written in February.

he program does solve regular linear systems via iterative linear projection as I call it. But that doesn't matter.

A print emit actually shouldn't change the program's stability.

But, when executing the binary I immediately get a segmentation fault when working with arrays approximately greater than 400x400. Therefore, I removed the print statement and in opposite of what I had expected I get segmentation faults wether there is the print emit or not.

Of course, as the program works it computes arithmetically correct results as I have examined.

This is really crazy behavior because I had edited the sources in February last time and I still have some "old" binaries from that time. These "old" binaries still work up to array sizes that I can hardly handle limited by memory size without producing any faults and outputting arithmetically correct results too, of course.

At the end of July I have completely reinstalled my Computer running Ubuntu 20.04 due to an SSD exchange. Considering these circumstances I assume that there is something deeply broken with my compiler.

I compile the source via »ifort -o geometric_linsolve_main geometric_linsolve_main.f90«

Here is my source code:

program linsolve_mainprogram
implicit none

integer(8) :: N
real(8) :: sttm_0, edtm_0, mainprogramtime
real(8), allocatable :: SoLEQ(:,:)
real(8), allocatable :: sv_0(:,:)

open(1, file='/mnt/ramdisk0/N.unf', form='unformatted')
read(1) N
close(1)
allocate(SoLEQ(N, N+1))
allocate(sv_0(N+1, 1))

open(100, file='/mnt/ramdisk0/SoLEQ.unf', form='unformatted')
read(100) SoLEQ
close(100)

call cpu_time(sttm_0)
call linsolve_subroutine(SoLEQ, N, sv_0)
call cpu_time(edtm_0)

open(10, file='/mnt/ramdisk0/sol_vector.unf', form='unformatted')
write(10) sv_0
close(10)
mainprogramtime = edtm_0 - sttm_0

open(2, file='/mnt/ramdisk0/fortrantime.unf', form='unformatted')
write(2) mainprogramtime
close(2)

end program linsolve_mainprogram

subroutine linsolve_subroutine (SoLEQ, N, sv_0)
implicit none

integer(8), intent(in) :: N
real(8), intent(in) :: SoLEQ(N, N+1)
integer(8) :: i, j, k, q
integer(8) :: dim_0
real(8), parameter :: alpha = 1.0, beta = 0.0
real(8) :: av_0, tol_0, sttm_0, edtm_0, con_tmp
real(8) :: cfm_0(N, N), cnc_0(N, 1), cfl_0(1, N)
real(8) :: pos_0(N, N+1), pv_0(N, 1), vec_0(N, N+1)
real(8) :: scpd_0(1, 1), scpd_1(1, N+1), lvr_0(1, N+1)
real(8), intent(out) :: sv_0(N+1, 1)

tol_0 = 1e-09 * N
av_0 = sum(abs(SoLEQ)) / (N**2 + N)
call random_seed()
call random_number(pos_0)
write(*,*) "N =", N

do i = 1, N, 1
cnc_0(i, 1) = SoLEQ(i, 1)
pos_0(i, 1) = pos_0(i, 1) - 0.5
do j = 1, N, 1
cfm_0(i, j) = SoLEQ(i, j+1)
pos_0(i, j+1) = pos_0(i, j+1) - 0.5
end do
end do

do dim_0 = 1, N, 1
write(*,*) "dim0 =", dim_0

con_tmp = cnc_0(dim_0, 1)
do j = 1, N, 1
cfl_0(1, j) = cfm_0(dim_0, j)
end do

pv_0(:, 1) = pos_0(:, N+2 - dim_0)
do j = 1, N+1 - dim_0, 1
vec_0(:, j) = pos_0(:, j) - pv_0(:, 1)
end do

scpd_0 = matmul(cfl_0, pv_0)
scpd_1 = matmul(cfl_0, vec_0)

do j = 1, N+1 - dim_0, 1
lvr_0(1, j) = (con_tmp - scpd_0(1, 1)) / scpd_1(1, j)
end do

do j = 1, N+1 - dim_0, 1
pos_0(:, j) = pv_0(:, 1) + lvr_0(1, j) * vec_0(:, j)
end do
end do

sv_0(1, 1) = 0.0
do i = 1, N, 1
sv_0(i+1, 1) = pos_0(i, 1)
end do

end subroutine linsolve_subroutine

program linsolve_mainprogram
  implicit none
  
  integer(8) :: N
  real(8) :: sttm_0, edtm_0, mainprogramtime
  real(8), allocatable :: SoLEQ(:,:)
  real(8), allocatable :: sv_0(:,:)
  
  open(1, file='/mnt/ramdisk0/N.unf', form='unformatted')
  read(1) N
  close(1)
  allocate(SoLEQ(N, N+1))
  allocate(sv_0(N+1, 1))
  
  open(100, file='/mnt/ramdisk0/SoLEQ.unf', form='unformatted')
  read(100) SoLEQ
  close(100)
  
  call cpu_time(sttm_0)
  call linsolve_subroutine(SoLEQ, N, sv_0)
  call cpu_time(edtm_0)
  
  open(10, file='/mnt/ramdisk0/sol_vector.unf', form='unformatted')
  write(10) sv_0
  close(10)
  mainprogramtime = edtm_0 - sttm_0
  
  open(2, file='/mnt/ramdisk0/fortrantime.unf', form='unformatted')
  write(2) mainprogramtime
  close(2)
  
end program linsolve_mainprogram


subroutine linsolve_subroutine (SoLEQ, N, sv_0)
  implicit none
  
  integer(8), intent(in) :: N
  real(8), intent(in) :: SoLEQ(N, N+1)
  integer(8) :: i, j, k, q
  integer(8) :: dim_0
  real(8), parameter :: alpha = 1.0, beta = 0.0
  real(8) :: av_0, tol_0, sttm_0, edtm_0, con_tmp
  real(8) :: cfm_0(N, N), cnc_0(N, 1), cfl_0(1, N)
  real(8) :: pos_0(N, N+1), pv_0(N, 1), vec_0(N, N+1)
  real(8) :: scpd_0(1, 1), scpd_1(1, N+1), lvr_0(1, N+1)
  real(8), intent(out) :: sv_0(N+1, 1)

  
  tol_0 = 1e-09 * N
  av_0 = sum(abs(SoLEQ)) / (N**2 + N)
  call random_seed()
  call random_number(pos_0)
  write(*,*) "N =", N
    
  do i = 1, N, 1
    cnc_0(i, 1) = SoLEQ(i, 1)
    pos_0(i, 1) = pos_0(i, 1) - 0.5
    do j = 1, N, 1
      cfm_0(i, j) = SoLEQ(i, j+1)
      pos_0(i, j + 1) = pos_0(i, j + 1) - 0.5
    end do
  end do
  
  do dim_0 = 1, N, 1
    write(*,*) "dim0 =", dim_0
    
    con_tmp =  cnc_0(dim_0, 1)
    do j = 1, N, 1
      cfl_0(1, j) = cfm_0(dim_0, j)
    end do
        
    pv_0(:, 1) = pos_0(:, N+2 - dim_0)
    do j = 1, N+1 - dim_0, 1
      vec_0(:, j) = pos_0(:, j) - pv_0(:, 1)
    end do
    
    scpd_0 = matmul(cfl_0, pv_0)
    scpd_1 = matmul(cfl_0, vec_0)

    do j = 1, N+1 - dim_0, 1
      lvr_0(1, j) = (con_tmp - scpd_0(1, 1)) / scpd_1(1, j)
    end do
 
    do j = 1, N+1 - dim_0, 1
      pos_0(:, j) = pv_0(:, 1) + lvr_0(1, j) * vec_0(:, j)
    end do
  end do
  
  sv_0(1, 1) = 0.0
  do i = 1, N, 1
    sv_0(i+1, 1) = pos_0(i, 1)
  end do
  
end subroutine linsolve_subroutine

Does anyone have knowledge about segmentation faults? Which programming paradigms do I need to avoid them? Why do they only occur when handling huge array sizes?

andrew_4619 · ‎08-20-2020

Such behaviour is normally caused by a bug in your code. For example you go past an array limit and clobber (corrupt) some other data. That other data might not be important it may have been used already or may get initialised later and thus corrected. Making any change, e.g. a print statement alters the layout of your code and you now clobber a different thing. Thus the bug and appear and disappear but in reality is always there.

My recommendation is to always aim to have all compile and runtime checks enabled when debugging and to also use standards checking (/stand). Fix any items that get thrown out by that before looking harder, often you fix the problem.

paconrad · ‎08-20-2020

Thank you for your enrichening ideas.

Thinking that there was a bug in my code I have tried to completely rewrite the code thanks to the fat that it isn't very much. Doing this I intended to write, compile and execute it step-by-step in order to figure out when the bug occurs. First, I wrote the entire IO main program and the specification section of the main of the subroutine only, compiled it and tested it.

Fortunately, This "empty program" which didn't compute anything just defining variables and arrays and allocating memory for them did not caused any errors.

So I began examining the first proper computation part that simply maps an array of random floats in [-0.5, 0.5] on the array of random floats in [0, 1] generated by „call random_number({array})“.

This little double-loop was sufficient to make the issue occur. I felt a little overwhelmed and unmighty because every computation I tried accessing huge arrays led to a segmentation fault whether how simple ans trivial it actually was.

But, this morning I remembered that I had already had the issue in January if remember correctly but, this doesn't matter so exactly. There was a simple compilation argument call „-heap-arrays“ which I had used those days at the early year.

As far as I know that happens because ifort stores allocatable arrays in the stack by default. Due to the static properties of the stack the kernel can't dedicate enough of space to the array and when accessing adresses which exceed the sections which the program is allowed to access the kernel interrupts. That's what we normally call a segmentation fault.

If somebody has something to add or to correct concerning my assumption, he or she is welcome to teach me, of course.

Of course, it still works pretty well.

Johannes_Rieke · ‎08-20-2020

Using the -heap-arrays flag puts static defined variables into the heap instead of the stack and avoids the issues you described. Allocated variables are allways stored in heap by default, if I'm not wrong. You wrote it the other way round. However, in your subroutine you defined a lot of N, NxN and NxN+1 arrays statically. These had blown your stack probably.

Anyways, I would encourage you to use modules and by that explicit interfaces to your subroutines. Further allocating the big arrays in the subroutine will avoid using -heap-arrays and you're able to control the usage of heap and stack better or you should add a size, when heap shall be use (-heap-arrays size, size Is an integer value representing the size of thearrays in kilobytes. Arrays smaller than size are puton the stack). If needed you could play around the the stack-size in the linker settings (Windows OS is different to GNU/Linux). I've seen code, where the limit had been set to a very high value (LINKER -> Stack reserve size = 999999999 or /STACK:999999999 on Windows OS).

Happy coding, Johannes

mecej4 · ‎08-20-2020

Here are a couple of recommendations.

Add the STATUS = 'OLD' clause to the OPEN statements for existing files. Without that, if you did not set the correct path to the files in the OPEN statement, an empty new file will be created, and the subsequent READ statements on that file will fail.

If you want more help, please provide all the source and data files, zipped together and attached to your reply.

Too often, the compiler is suspected to be responsible when the user or some other software is at fault.

Bernard · ‎08-20-2020

>>>But, when executing the binary I immediately get a segmentation fault when working with arrays approximately greater than 400x400.>>>

Default user mode stack allocation per thread on Linux is ~8MiB. As pointed out by the others you statically allocated a lot of arrays of primitive type real(8). The total size of those arrays is not know, but you mentioned, that allocation greater than 400x400 double precision elements i.e. (1,280,000 bytes) caused a segmentation fault, probably by touching some kind of guard page or overflowing the previous frame partly or completely.

John_Campbell · ‎08-20-2020

Large automatic arrays can be a problem. It appears that there is the potential to overflow the available stack, so rather than some automatic arrays, I would recommend using ALLOCATABLE arrays.

I also don't understand why you would have two files : N.unf and SoLEQ.unf, as they are not independent.

Finally I would recommend some reporting of the reading stage for SoLEQ.unf and possibly reducing the size of the unformatted records ( You have suggested that N is an 8-byte integer ). The following is an alternative approach to this phase of your program, which assumes you can modify the creation of file SoLEQ.unf.

! ...
 integer    :: i, iostat, stat
 integer(8) :: mem

 open (100, file='/mnt/ramdisk0/SoLEQ.unf', form='unformatted')
 read (100, iostat=iostat) N
 if (iostat /= 0) then
   write (*,*) ' FILE ERROR : problem reading N, iostat=',iostat
   stop
 end if
 write (*,*) 'problem size N =',n
!
 allocate (sv_0(N+1, 1),  stat=stat)
 allocate (SoLEQ(N, N+1), stat=stat)
 if (stat/=0) then
   write (*,*) ' ALLOCATE ERROR :: could not allocate SoLEQ(N, N+1) : stat=',stat
   stop
 end if
!
 do i = 1,N+1
   read(100, iostat=iostat) SoLEQ(:,I)
    if (iostat /= 0) then
      write (*,*) ' FILE ERROR : problem reading SoLEQ, iostat=',iostat
      stop
    end if
 end do
 write (*,*) 'SoLEQ recovered from /mnt/ramdisk0/SoLEQ.unf'
!
 close(100)
!
 mem  = ( N*(N+1)*3       &          ! 11 declared arrays bytes
        + N*N    *1       &
        + (N+1)  *3       &
        + N      *3       &
        + 1      *1 ) * 8
 write (*,*) 'Expected memory demand =',mem,' bytes'
! ...