Solved: Re: SIGSEV on assigment (=) operation for large arrays

drraug · ‎10-29-2009

In the following simple code I have SIGSEV on "w=u" line. It does not happen for small arrays, but starts from arrays of size 2**20 elements and larger.

----------------

program test_eq

implicit none
integer,parameter :: m=1024*1024,n=100
integer :: i,j
double precision,dimension(:,:),pointer :: u,w,mat
allocate(u(m,n),w(m,n),mat(n,n))
write(*,fmt='(a)')'fill in U '
forall(i=1:m,j=1:n) u(i,j)=1.d0/(dlog(dble(i))+j)
write(*,fmt='(a)')'fill in U done'
w=u
write(*,*)'w=u ok'
mat=matmul(transpose(u),w)
write(*,*)mat(1,1)
end program

-----------------

$ ifort --version

ifort (IFORT) 11.1 20090827

$ uname -a

Linux 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:58:03 UTC 2009 x86_64 GNU/Linux

tracyx · ‎10-29-2009

Quoting - drraug

In the following simple code I have SIGSEV on "w=u" line. It does not happen for small arrays, but starts from arrays of size 2**20 elements and larger.

try ulimit -s unlimited

that should help

View solution in original post

tracyx · ‎10-29-2009

Quoting - drraug

In the following simple code I have SIGSEV on "w=u" line. It does not happen for small arrays, but starts from arrays of size 2**20 elements and larger.

try ulimit -s unlimited

that should help

drraug · ‎10-29-2009

Quoting - tracyx

try ulimit -s unlimited

This really helps, thank you.

Ron_Green · ‎10-29-2009

Quoting - drraug

Quoting - tracyx

try ulimit -s unlimited

This really helps, thank you.

And take a look at -heap-arrays compiler option, as long as you are not also using -openmp (don't use these 2 together).

ron

drraug · ‎10-29-2009

Quoting - Ronald W. Green (Intel)

And take a look at -heap-arrays compiler option, as long as you are not also using -openmp (don't use these 2 together).

Thank you, Ronald, but I really need openmp a lot, that's a point.

TimP · ‎10-29-2009

Quoting - drraug

Thank you, Ronald, but I really need openmp a lot, that's a point.

In case it may be relevant, the forall in the example is likely to create a temporary array and be less efficient than f77 style. The current compilers attempt to vectorize a single extent single assignment with forall, if preceded by IVDEP directive. Currently, ifort has no OpenMP parallelization of forall().
I do agree it's desirable to be able to use transpose() and matmul() in the form shown, and the number of automatic temporaries should be minimized. For example, the compiler should recognize that the assigned array mat() is available to assemble the result, rather than building it in a temporary and copying it, if that's still the way it's done. I believe that certain cases of transpose() as an argument to matmul() are recognized for optimization without another temporary. In view of the demonstration that even the simple w=u seemed to create a temporary when using allocatable, it seems unwise to count on it. If an OpenMP parallel equivalent of matmul is required, or one of a large enough size to incur danger of stack overflow, MKL library is superior.

drraug · ‎10-29-2009

Dear tim18!
Thank you very much for your comment. It seems that you a right person to address my further questions to. Maybe this even can be done in this thread, however, because most of them rely to your comment.

(1) Does Intel plan to support OpenMP parallelization of FORALL, MAXLOC, MATMUL in nearest future? For now I see no parallelization and no speedup for this parts of my code.

(2) What alternatives could we use for parallelization of these operations on shared memory?
(2a) For MATMUL the obvious alternative are BLAS functions implemented in mkl. (BTW, does current version of MKL use parts of GotoBLAS code?), So, there is actually no problem with that.
(2b) A very strange alternative of FORALL can be OMP DO / END DO block, that seems to be supported in current version of Intel Fortran. But is it really a good idea?
(2c) It is still unclear for me, how can we boost operations like MAXLOC(abs(A)). MAXLOC does not give any speedup on multi-core system, when used with OMP WORKSHARE directives. The BLAS alternative IDAMAX function also does not seem to use many threads/cores. Also, DGER or similar FORALL construct for rank-one matrix update still are not parallized by Intel+OMP. Can you advise smth here?

Quoting - tim18

In case it may be relevant, the forall in the example is likely to create a temporary array and be less efficient than f77 style. The current compilers attempt to vectorize a single extent single assignment with forall, if preceded by IVDEP directive. Currently, ifort has no OpenMP parallelization of forall().
....
If an OpenMP parallel equivalent of matmul is required, or one of a large enough size to incur danger of stack overflow, MKL library is superior.

TimP · ‎10-30-2009

This seems to have drifted far from the original question.
OpenMP 2.5 support seems not to be a high priority, and FORALL and MAXLOC aren't ideally suited for parallel.
In my examples at http://sites.google.com/site/tprincesite/levine-callahan-dongarra-vectors you will see some of the OpenMP alternatives, including the rank 2 maxloc implemented with omp critical.
f2008 DOACROSS has been advocated as superior to FORALL, but some initial proposals include translating it to FORALL, so nothing gained, particularly as it's not often considered an important innovation.
The case where I see FORALL as a superior syntax to DO..ENDDO is where a MASK is in use, but it still depends on the compiler to recognize where multiple assignments may be fused into a single loop, a situation where the intent is clear with DO...ENDDO. Have you studied the details of what is required by FORALL, and you don't think those somewhat strange?
The most often advocated solution for matmul, for matrices large enough to benefit from OpenMP within the matmul operation, the substitution of a BLAS call behind the scene (e.g. MKL library) hasn't achieved much favor. In several of the situations where it can be done with gfortran, for example, the problem of extra temporary arrays hasn't been solved.
Vectorizable rank one operations generally have to be quite large (size several thousand) to benefit from threading. Some of the MKL BLAS rank one operations may include detection now of cases large enough for threading.

drraug · ‎11-04-2009

Dear tim18!
Thank you very much for you kind reply!

TimP · ‎10-10-2010

The f2008 version of DOACROSS is spelled do concurrent. It's generally superior in ifort performance to forall, but not as good as f77 DO, for multiple assignment constructs where all are applicable.