Solved: @Steve I am curious to know

Vishnu · ‎11-28-2018

The following code segfaults, and I'm unable to identify why:

PROGRAM segfault_transpose

    IMPLICIT NONE

    INTEGER, PARAMETER :: runs = 2
    INTEGER, PARAMETER :: matrix_size = 1024

    INTEGER :: j

    REAL, DIMENSION(matrix_size, matrix_size) :: alpha

    DO j = 1, runs
        alpha = TRANSPOSE(alpha)
    END DO

END PROGRAM segfault_transpose

My compile line is:

ifort -O3 -xHost -real-size 64 segfault_transpose.f90

The issue occurs for runs >= 2, and matrix_size >= 1024, along with 64 bit reals. Also, it only happens if I feed the result of the transpose to the matrix itself.

I am using ifort version 18.0.3

The segfault message is as follows:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040473D  Unknown               Unknown  Unknown
libpthread-2.28.s  00007F77A49893C0  Unknown               Unknown  Unknown
a.out              000000000040380C  Unknown               Unknown  Unknown
a.out              00000000004037DE  Unknown               Unknown  Unknown
libc-2.28.so       00007F77A47D7223  __libc_start_main     Unknown  Unknown
a.out              00000000004036EE  Unknown               Unknown  Unknown

a back-trace from gdb does not help; it says the issue is at line 1 !? :

#0  0x0000000000403803 in segfault_transpose () at segfault_transpose.f90:1
#1  0x00000000004037de in main ()
#2  0x00007ffff7c56223 in __libc_start_main () from /usr/lib/libc.so.6
#3  0x00000000004036ee in _start ()

valgrind's output has the following; but I don't know what it means:

==12271== Invalid write of size 8
==12271==    at 0x403803: MAIN__ (segfault_transpose.f90:1)
==12271==  Address 0x1ffe7ff348 is on thread 1's stack

Can someone help me? I'm at a loss as to what exactly is happening.

p.s. gfortran seems to be fine

Steve_Lionel · ‎11-30-2018

The problem is that, since you have alpha on both sides of the assignment, the language requires that the TRANSPOSE be completely evaluated before any assignment is done; thus requiring a temp. If you have different variables then the compiler produces a nice, vectorized sequence without a stack (or other) temp. Of course, you now have your own temp...

One thing you can do is use allocatables and MOVE_ALLOC to prevent an extra copy, like so:

PROGRAM segfault_transpose

    IMPLICIT NONE

    INTEGER, PARAMETER :: runs = 2
    INTEGER, PARAMETER :: matrix_size = 1024

    INTEGER :: j
    
    REAL, ALLOCATABLE, DIMENSION(:,:) :: alpha, beta
    
    ALLOCATE (alpha(matrix_size,matrix_size), beta(matrix_size,matrix_size))
    
    call random_number(alpha)

    DO j = 1, runs
        beta = TRANSPOSE(alpha)
    END DO
    
    CALL MOVE_ALLOC (FROM=beta,TO=ALPHA) 
    ! deallocates alpha, moves allocation from beta to alpha
    ! marks beta as deallocated
    
    PRINT *, alpha(1:10,1:2)
END PROGRAM segfault_transpose

(I added code to prevent the compiler from optimizing the whole thing away.)

When I was first playing with this, I thought that the compiler was optimizing away the deallocation of alpha in the MOVE_ALLOC. What I hadn't noticed at first is that it moved that code out of the main code path and jumped to it only if needed, then jumped back, thus improving instruction cache behavior (if alpha didn't need deallocating).

View solution in original post

Vishnu · ‎11-28-2018

Sorry, I should've searched the forum before posting; looks like I just have to move the automatic arrays to the heap, as described here:

https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/401108

But then this leads me to another question: Do I always shunt the arrays off to heap, or do I instead increase the size of the stack? Or something in-between by providing an argument to -heap-arrays[] ?

TimP · ‎11-29-2018

If forum search works properly, you should find Steve Lionel's advice about not using the threshold option for heap-arrays. It may apply only when the allocation size is known at compile time, leaving variable size allocations on stack.

The simple solution is to use heap always until you have the opportunity to find out whether shifting back to stack is the best way to solve a performance issue.

Beginners tend to go overboard with changes in stack size, and it may take some effort to find the best value. linux defaults tend to be more usable than Windows ones.

Vishnu · ‎11-29-2018

Okay; I fixed it by making large arrays ALLOCATABLE, so that they go to the heap.

Steve_Lionel · ‎11-29-2018

But note that TRANSPOSE will likely create a temporary copy, which goes on the stack (unless -heap-arrays is specified.)

Vishnu · ‎11-29-2018

@Steve, Thanks for telling me that, because I hit a snag once again, at a much higher array size, with the segfault occurring this time at the line containing the TRANSPOSE operation. Is there anyway to get it to use allocated memory, like with the other arrays, instead of using a flag?

DataScientist · ‎11-29-2018

@Steve I am curious to know why the compiler needs to create a temporary copy? Is there a way to avoid it? I suppose then that would also be slower than performing the transpose by do-loops.

Steve_Lionel · ‎11-30-2018

The problem is that, since you have alpha on both sides of the assignment, the language requires that the TRANSPOSE be completely evaluated before any assignment is done; thus requiring a temp. If you have different variables then the compiler produces a nice, vectorized sequence without a stack (or other) temp. Of course, you now have your own temp...

One thing you can do is use allocatables and MOVE_ALLOC to prevent an extra copy, like so:

PROGRAM segfault_transpose

    IMPLICIT NONE

    INTEGER, PARAMETER :: runs = 2
    INTEGER, PARAMETER :: matrix_size = 1024

    INTEGER :: j
    
    REAL, ALLOCATABLE, DIMENSION(:,:) :: alpha, beta
    
    ALLOCATE (alpha(matrix_size,matrix_size), beta(matrix_size,matrix_size))
    
    call random_number(alpha)

    DO j = 1, runs
        beta = TRANSPOSE(alpha)
    END DO
    
    CALL MOVE_ALLOC (FROM=beta,TO=ALPHA) 
    ! deallocates alpha, moves allocation from beta to alpha
    ! marks beta as deallocated
    
    PRINT *, alpha(1:10,1:2)
END PROGRAM segfault_transpose

(I added code to prevent the compiler from optimizing the whole thing away.)

When I was first playing with this, I thought that the compiler was optimizing away the deallocation of alpha in the MOVE_ALLOC. What I hadn't noticed at first is that it moved that code out of the main code path and jumped to it only if needed, then jumped back, thus improving instruction cache behavior (if alpha didn't need deallocating).

Vishnu · ‎11-30-2018

@Steve, again, thanks for the explanation. But what if, instead of using MOVE_ALLOC(), I just used the matrix 'beta' from there on forward? I can avoid that operation, then, can I not?

Steve_Lionel · ‎11-30-2018

Sure - that works too. I don't know your application and thought you might be concerned with an extra copy sitting in memory.

Vishnu · ‎11-30-2018

I'd like to avoid the extra memory allocation if possible, because RAM is a little valuable, especially when I scale up my problem to large system sizes.

But I don't think I can, because all of my 'work' is inside that DO loop. In each iteration, I get a new `alpha`, and do stuff like TRANSPOSEing it. And if I have to allocate space for a temporary array, I might as well keep it.

Unless... I do the alloc-deallocs inside the loop:

DO j = 1, runs
    ALLOCATE(alpha(size,size))
    CALL RANDOM_NUMBER(alpha)
    ALLOCATE(beta(size,size))
    beta = TRANSPOSE(alpha)
    DEALLOCATE(alpha)
    ! use beta for stuff
    DEALLOCATE(beta)
END DO

But there is still a small window where both of them are allocated, and that will be the 'memory-limiting' region. So then I should just have both `alpha` and `beta` allocated outside the loop to avoid the overhead from the allocations.

Steve_Lionel · ‎12-01-2018

I agree - allocate them outside the loop.

jimdempseyatthecove · ‎12-01-2018

Vishnu,

Your simplified code illustrates that alpha isn't used after transpose. Could you perhaps simply swap the indexing order

alpha(I,J) to alpha(J,I)

Note, the original alpha could be produced with the indexes the other way around too.

Jim Dempsey

Vishnu · ‎12-01-2018

The above is oversimplified. In my actual code, I do use it after, including in a MATMUL, and an SYEVR. I don't access it by index.

FortranFan · ‎12-01-2018

Vishnu wrote:
The above is oversimplified. In my actual code, I do use it after, including in a MATMUL, and an SYEVR. I don't access it by index.

@Vishnu.

Can you show a minimal working example of matrix calculations only (TRANSPOSE, MATMUL, [LAPACK?)]SYEVR, etc.) of your actual code that works up to a certain problem size and then runs into segmentation fault? Note you can exclude all of your domain-specific (or proprietary) details and just focus on matrix stuff. That can help other readers make suggestions too; otherwise it ends up wasting other readers' time in making the effort to offer you input only to read you find it not useful.

Vishnu · ‎12-01-2018

@FortranFan, the problem is solved now. I am using separately allocated memory.

Segfault upon Transpose