Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Novice
293 Views

Overcoming Segmentaion fault error with efficient memory management on a cluster

Jump to solution

Hello,

I am trying to solve a linear system of size (263169x263169) FORTRAN code using Intel compiler on a cluster. I do not get any error while allocating the memory to such a large one dimensional array (69257922561=263169x263169). I am running into segmentation fault runtime errors while executing the program. I have run the same code on a smaller system and it does not show any errors, so I think it is due to actual memory issues. I am not sure why I would run into this error when I do not get any warning while allocating the memory. I can run my code on up to 16 nodes on cluster each of which has a 256 GB RAM. So the combined RAM should be sufficient for me to solve the problem. I was suggested that distributing the memory onto different nodes might help resolve this error. I am not sure how one would do that.

To illustrate my problem more clearly, I have attached here a baby code in which I allocate the required memory for an array and a do loop in which I access each element of that array. As I mentioned, I do not get any compiling errors but I would end up with segmentation fault run time error.

program test
        implicit none
        integer :: AllocateStatus
        integer :: n = 256, p
        real(8), dimension(:,:), allocatable :: Cgm
        real(8), dimension(:), allocatable :: Cg
        real(16) :: ns
        ns = (2.0_16*n+1.0_16)**2.0_16
        print *, int(ns**2,8)
        allocate (Cg(int(ns**2)), STAT = AllocateStatus)
        if (AllocateStatus/=0) then
                print *, "*** Not enough memory ***"
                stop
        end if

        Cg = 0.0_8
        do p = 1,int(ns**2)
                !print *,p
                Cg(p) = 1.0_8
        end do

        print *, Cg(p)
end program test

Thanks

 

0 Kudos

Accepted Solutions
Highlighted
156 Views

Note, using your 16 x 256GB cluster would have sufficient memory, but you cannot treat it as a single allocation by one system (rank) in an MPI process. Example

Each rank (one of 16) could manage 16,448 x 263,169 elements of the array (rectangular, eg rows or cols)

Or

Each rank could manage 65,792 x 65,792 square tiles of the array.

Or

some other partitioning that is suitable for your programming needs.

It has not been stated at to what your (model) problem needs to compute, if it is relatively simple, you might find COARRAYS the better choice.

Jim Dempsey

View solution in original post

0 Kudos
13 Replies
Highlighted
271 Views

Yaswanth, Thank you for posting in the Intel® Communities Support.


Just to let you know, I will move your thread to the proper department which is "Intel® Fortran Compiler" for them to further assist you with this topic.


Regards,

Albert R.


Intel Customer Support Technician

A Contingent Worker at Intel


0 Kudos
Highlighted
Novice
261 Views

Thank you, Alberto.

0 Kudos
Highlighted
227 Views

int(ns**2) returns an integer(4) value.

Your allocate statement would require int(n**2,8).

Also, there used to be an issue (may have been fixed), where any allocation that would exceed 2GB would work provided that the type of the allocation size variable(s) were expressed as integer(8). IOW their may have (at one time) been two different types of allocation descriptors, one for using 32-bit indexing, the other for 64-bit indexing (while the base may have been 64-bits, the limits of each dimension may have been 32 or 64 bits based on the descriptor type).

It would be a simple enough experiment for you to perform.

Also note:

integer :: p ! p is 32-bit integer(4)
...
do p=1,int(ns**2) ! iteration count is truncated as well

Jim Dempsey

Jim Dempsey

0 Kudos
Highlighted
Black Belt Retired Employee
219 Views

I assume this is on Linux. Linux does something called "lazy allocation" where, on allocate, the virtual address space is created but isn't actually added to your "working set" (my term, coming from VMS) until you touch it. (Some discussion here.)

The amount of RAM available is not relevant to whether or not the allocation succeeds or even the amount of virtual memory available - it affects mainly performance.

The compiler can't diagnose this, and if the ALLOCATE's call to the operating system's allocator succeeds, you won't get a failure from the ALLOCATE. But you can indeed get a segfault if you touch "too much" memory.

Jim's comment on the DO loop upper bound is valid, though I don't expect this would be the proximate cause of the segfault.

0 Kudos
Highlighted
Novice
203 Views

Thank you, Jim.

The allocate command showed memory error when I wrote 'int(n**2,8)' instead of int(n**2). That was a mistake. Does this mean that it is not possible to create an array of size 263169x263169?

My aim is to solve a linear system of this size, in which the matrix is full. Could you please suggest some possible workarounds if it is indeed not possible to create such a big array. I tried to rewrite my code so that I do not create the entire array and instead access each element whenever required within the code. But this takes forever to just go over one iteration loop which is involved in solving the linear system.

0 Kudos
Highlighted
Novice
201 Views

Thank you, Steve.

Could you also let me know if there are better ways to solve linear systems of large size in FORTRAN if it is not possible to create such an array in the first place.

0 Kudos
Highlighted
Black Belt Retired Employee
192 Views

If the allocate failed, then your system is not allowing you to create an array that big.

I can't help with the linear systems question, but I know there are others here who may be able to.

0 Kudos
Highlighted
Valued Contributor II
176 Views

I cannot claim to be an expert on solving linear systems of this size with a full matrix - or avoiding the memory issues that you describe (at least not without feeling thoroughly ashamed ;)). However, you say that you work on a cluster, so it should be possible to split up the NxN matrix into MxN or NxM matrices, where M is a suitable fraction of N and then apply partial Gaussian elimination or some more suitable and less basic iterative procedure on these parts. Communication between the programs that handle the parts could be done via coarrays or via MPI.

Of course whether that leads to an acceptable performance is another matter.

0 Kudos
Highlighted
158 Views

263169x263169 =
69,257,922,561 elements x 8 bytes/element =
554,063,380,488 number of bytes require (for this one allocation)

This is much larger than the capacity of the system you listed in your first post.

Furthermore, should you manipulate this array, using implicit looping array expressions, the statement may be encoded by the compiler to use a temporary (on stack!!!) of same size. IOW .gt. 1TB of RAM assuming you have configured for that much stack.

Additionally, most system default settings require a page file to exceed that of the sum of the processes running on the system.

While you could handle this using cluster style programming, or COARRAYS, your problem may potentially be solvable, old school style, via file I/O processing, or via partitioning the data and processing partition(s) by partition(s).

Jim Dempsey

0 Kudos
Highlighted
157 Views

Note, using your 16 x 256GB cluster would have sufficient memory, but you cannot treat it as a single allocation by one system (rank) in an MPI process. Example

Each rank (one of 16) could manage 16,448 x 263,169 elements of the array (rectangular, eg rows or cols)

Or

Each rank could manage 65,792 x 65,792 square tiles of the array.

Or

some other partitioning that is suitable for your programming needs.

It has not been stated at to what your (model) problem needs to compute, if it is relatively simple, you might find COARRAYS the better choice.

Jim Dempsey

View solution in original post

0 Kudos
Highlighted
New Contributor I
144 Views

It is simply not feasible to solve a dense linear system of this size in useful time, irrespective of memory constraints or programming language. An efficient solver takes about one minute to solve a nxn system with n=10000 on my system. The complexity of solving linear systems is theoretically at least of the order of n^2, and in practice it scales with n^3. Even with infinite memory it would hence take ~17,500 minutes or ~12 days to solve your system. And this holds only without any additional overhead from the huge memory requirement.

You will have to change the problem or make other approximations.

 

Highlighted
Novice
135 Views

Thank you very much, Jim. I will try to modify my problem or make approximations to reduce the complexity. In either case, I will try to look at COARRAYS and partitioning the memory required.

0 Kudos
Highlighted
New Contributor II
107 Views

Allocating an array of this size would possibly fail if it exceeds the virtual memory address size, but may not. In Windows, it could fail if it exceeds the paging and memory capacity. Typically allocations like these do not fail until the matrix coefficients are being defined.

Irrespective, considering a rank 2 matrix of size 263,169 as a dense matrix is probably not a practical approach.

You would need to solve this by utilising the sparse properties of the matrix. These sparse properties are usually associated with the field of research.

I work in the field of Finite Element Analysis, where the banded or profile of non-zero matrix values is utilised to significantly reduce the computation time by both eliminating storage of zero values and eliminating calculations using zero values to produce a solution in an acceptable time. My recent problems have extended to N=450,000, but require sparsity properties to be practically solved. Equation re-ordering is also a significant issue.

MKL does provide sparse matrix solvers, however the practicality of defining a large set of equations that is not singular could be a challenge.

You need to research how these types of large problems are practically managed to obtain a solution in a reasonable time using available computing resources.

Assembling a large set of linear equations usually implies a direct solver approach. Rather than using a direct solver, itterative solvers can be used, where the full matrix is never assembled. The most appropriate solution approach for your field of research probably has a long history of development.

0 Kudos