Solved: Overcoming Segmentaion fault error with efficient memory management on a cluster

Yaswanth · ‎11-03-2020

Hello,

I am trying to solve a linear system of size (263169x263169) FORTRAN code using Intel compiler on a cluster. I do not get any error while allocating the memory to such a large one dimensional array (69257922561=263169x263169). I am running into segmentation fault runtime errors while executing the program. I have run the same code on a smaller system and it does not show any errors, so I think it is due to actual memory issues. I am not sure why I would run into this error when I do not get any warning while allocating the memory. I can run my code on up to 16 nodes on cluster each of which has a 256 GB RAM. So the combined RAM should be sufficient for me to solve the problem. I was suggested that distributing the memory onto different nodes might help resolve this error. I am not sure how one would do that.

To illustrate my problem more clearly, I have attached here a baby code in which I allocate the required memory for an array and a do loop in which I access each element of that array. As I mentioned, I do not get any compiling errors but I would end up with segmentation fault run time error.

program test
        implicit none
        integer :: AllocateStatus
        integer :: n = 256, p
        real(8), dimension(:,:), allocatable :: Cgm
        real(8), dimension(:), allocatable :: Cg
        real(16) :: ns
        ns = (2.0_16*n+1.0_16)**2.0_16
        print *, int(ns**2,8)
        allocate (Cg(int(ns**2)), STAT = AllocateStatus)
        if (AllocateStatus/=0) then
                print *, "*** Not enough memory ***"
                stop
        end if

        Cg = 0.0_8
        do p = 1,int(ns**2)
                !print *,p
                Cg(p) = 1.0_8
        end do

        print *, Cg(p)
end program test

Thanks

jimdempseyatthecove · ‎11-06-2020

Note, using your 16 x 256GB cluster would have sufficient memory, but you cannot treat it as a single allocation by one system (rank) in an MPI process. Example

Each rank (one of 16) could manage 16,448 x 263,169 elements of the array (rectangular, eg rows or cols)

Or

Each rank could manage 65,792 x 65,792 square tiles of the array.

Or

some other partitioning that is suitable for your programming needs.

It has not been stated at to what your (model) problem needs to compute, if it is relatively simple, you might find COARRAYS the better choice.

Jim Dempsey

View solution in original post

Alberto_Sykes · ‎11-04-2020

Yaswanth, Thank you for posting in the Intel® Communities Support.

Just to let you know, I will move your thread to the proper department which is "Intel® Fortran Compiler" for them to further assist you with this topic.

Regards,

Albert R.

Intel Customer Support Technician

A Contingent Worker at Intel

Yaswanth · ‎11-04-2020

Thank you, Alberto.

jimdempseyatthecove · ‎11-05-2020

int(ns**2) returns an integer(4) value.

Your allocate statement would require int(n**2,8).

Also, there used to be an issue (may have been fixed), where any allocation that would exceed 2GB would work provided that the type of the allocation size variable(s) were expressed as integer(8). IOW their may have (at one time) been two different types of allocation descriptors, one for using 32-bit indexing, the other for 64-bit indexing (while the base may have been 64-bits, the limits of each dimension may have been 32 or 64 bits based on the descriptor type).

It would be a simple enough experiment for you to perform.

Also note:

integer :: p ! p is 32-bit integer(4)
...
do p=1,int(ns**2) ! iteration count is truncated as well

Jim Dempsey

Yaswanth · ‎11-05-2020

Thank you, Jim.

The allocate command showed memory error when I wrote 'int(n**2,8)' instead of int(n**2). That was a mistake. Does this mean that it is not possible to create an array of size 263169x263169?

My aim is to solve a linear system of this size, in which the matrix is full. Could you please suggest some possible workarounds if it is indeed not possible to create such a big array. I tried to rewrite my code so that I do not create the entire array and instead access each element whenever required within the code. But this takes forever to just go over one iteration loop which is involved in solving the linear system.

jimdempseyatthecove · ‎11-06-2020

263169x263169 =
69,257,922,561 elements x 8 bytes/element =
554,063,380,488 number of bytes require (for this one allocation)

This is much larger than the capacity of the system you listed in your first post.

Furthermore, should you manipulate this array, using implicit looping array expressions, the statement may be encoded by the compiler to use a temporary (on stack!!!) of same size. IOW .gt. 1TB of RAM assuming you have configured for that much stack.

Additionally, most system default settings require a page file to exceed that of the sum of the processes running on the system.

While you could handle this using cluster style programming, or COARRAYS, your problem may potentially be solvable, old school style, via file I/O processing, or via partitioning the data and processing partition(s) by partition(s).

Jim Dempsey

jimdempseyatthecove · ‎11-06-2020

Note, using your 16 x 256GB cluster would have sufficient memory, but you cannot treat it as a single allocation by one system (rank) in an MPI process. Example

Each rank (one of 16) could manage 16,448 x 263,169 elements of the array (rectangular, eg rows or cols)

Or

Each rank could manage 65,792 x 65,792 square tiles of the array.

Or

some other partitioning that is suitable for your programming needs.

It has not been stated at to what your (model) problem needs to compute, if it is relatively simple, you might find COARRAYS the better choice.

Jim Dempsey

Yaswanth · ‎11-06-2020

Thank you very much, Jim. I will try to modify my problem or make approximations to reduce the complexity. In either case, I will try to look at COARRAYS and partitioning the memory required.

Steve_Lionel · ‎11-05-2020

I assume this is on Linux. Linux does something called "lazy allocation" where, on allocate, the virtual address space is created but isn't actually added to your "working set" (my term, coming from VMS) until you touch it. (Some discussion here.)

The amount of RAM available is not relevant to whether or not the allocation succeeds or even the amount of virtual memory available - it affects mainly performance.

The compiler can't diagnose this, and if the ALLOCATE's call to the operating system's allocator succeeds, you won't get a failure from the ALLOCATE. But you can indeed get a segfault if you touch "too much" memory.

Jim's comment on the DO loop upper bound is valid, though I don't expect this would be the proximate cause of the segfault.

Yaswanth · ‎11-05-2020

Thank you, Steve.

Could you also let me know if there are better ways to solve linear systems of large size in FORTRAN if it is not possible to create such an array in the first place.

Steve_Lionel · ‎11-05-2020

If the allocate failed, then your system is not allowing you to create an array that big.

I can't help with the linear systems question, but I know there are others here who may be able to.

Arjen_Markus · ‎11-05-2020

I cannot claim to be an expert on solving linear systems of this size with a full matrix - or avoiding the memory issues that you describe (at least not without feeling thoroughly ashamed ;)). However, you say that you work on a cluster, so it should be possible to split up the NxN matrix into MxN or NxM matrices, where M is a suitable fraction of N and then apply partial Gaussian elimination or some more suitable and less basic iterative procedure on these parts. Communication between the programs that handle the parts could be done via coarrays or via MPI.

Of course whether that leads to an acceptable performance is another matter.

Ulrich_M_ · ‎11-06-2020

It is simply not feasible to solve a dense linear system of this size in useful time, irrespective of memory constraints or programming language. An efficient solver takes about one minute to solve a nxn system with n=10000 on my system. The complexity of solving linear systems is theoretically at least of the order of n^2, and in practice it scales with n^3. Even with infinite memory it would hence take ~17,500 minutes or ~12 days to solve your system. And this holds only without any additional overhead from the huge memory requirement.

You will have to change the problem or make other approximations.

John_Campbell · ‎11-08-2020

Allocating an array of this size would possibly fail if it exceeds the virtual memory address size, but may not. In Windows, it could fail if it exceeds the paging and memory capacity. Typically allocations like these do not fail until the matrix coefficients are being defined.

Irrespective, considering a rank 2 matrix of size 263,169 as a dense matrix is probably not a practical approach.

You would need to solve this by utilising the sparse properties of the matrix. These sparse properties are usually associated with the field of research.

I work in the field of Finite Element Analysis, where the banded or profile of non-zero matrix values is utilised to significantly reduce the computation time by both eliminating storage of zero values and eliminating calculations using zero values to produce a solution in an acceptable time. My recent problems have extended to N=450,000, but require sparsity properties to be practically solved. Equation re-ordering is also a significant issue.

MKL does provide sparse matrix solvers, however the practicality of defining a large set of equations that is not singular could be a challenge.

You need to research how these types of large problems are practically managed to obtain a solution in a reasonable time using available computing resources.

Assembling a large set of linear equations usually implies a direct solver approach. Rather than using a direct solver, itterative solvers can be used, where the full matrix is never assembled. The most appropriate solution approach for your field of research probably has a long history of development.