Segmentation Fault with OpenMP Tasks in Subroutines (Intel Fortran 2018 Update 1)

Martin_K_7 · ‎04-23-2018

I ran into the following problem when using the Intel Fortran 2018 Update 1 Compiler. I implemented a block algorithm to compute an out-of-place triangular matrix-matrix product C := alpha * A * B + beta *C, where A is a upper triangular matrix. Since the matrix matrix product has a great potential for parallelization I did this using OpenMP tasks and task dependencies. Ending up with the following code:

SUBROUTINE DTRMM3(M,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)
    USE OMP_LIB
    IMPLICIT NONE
    DOUBLE PRECISION ALPHA,BETA
    INTEGER LDA,LDB,LDC,M,N
    DOUBLE PRECISION A(LDA,*),B(LDB,*),C(LDC,*)
    EXTERNAL DGEMM, DTRMM
    INTRINSIC MAX
    INTEGER K,KB,L,LB,J,JB
    !     .. Parameters ..
    DOUBLE PRECISION DONE,DZERO
    PARAMETER (DONE=1.0D+0,DZERO=0.0D+0)
    INTEGER NB
    PARAMETER(NB=256)
    !     .. Local Work...
    DOUBLE PRECISION TMP(NB,NB)

    IF (M.EQ.0 .OR. N.EQ.0) RETURN

    IF (ALPHA.EQ.DZERO) THEN
        DO J = 1,N
            !$omp simd safelen(64)
            DO K = 1,M
                C(K,J) = BETA * C(K,J)
            END DO
            !$omp end simd
        END DO
        RETURN
    END IF

    DO L = 1,N,NB
        LB = MIN(NB,N - L + 1)
        DO K = 1,M,NB
            KB = MIN(NB, M - K + 1)
            !$omp task firstprivate(K,KB,L,LB) depend(inout: C(K:K+KB-1,L:L+LB-1)) shared(C,BETA)
            C(K:K+KB-1, L:L+LB-1) = BETA * C(K:K+KB-1,L:L+LB-1)
            !$omp end task
            DO J = K, M, NB
                JB = MIN(NB, M - J + 1)
                !$omp task firstprivate(K,KB,L,LB, J, JB) private(TMP) &
                !$omp& depend(in:A(K:K+KB-1,J:J+JB-1), B(J:J+JB+1,L:L+LB-1)) depend(inout: C(K:K+KB-1,L:L+LB-1)) &
                !$omp& shared(ALPHA,A,B,LDA,LDB,LDC) default(none)
                IF ( K .EQ. J ) THEN
                    TMP(1:KB,1:LB) = B(K:K+KB-1,L:L+LB-1)
                    CALL DTRMM("L","U","N","U", KB, LB, ALPHA, A(K,K), LDA, TMP, NB)
                    C(K:K+KB-1, L:L+LB-1) = C(K:K+KB-1,L:L+LB-1) + TMP(1:KB,1:LB)
                ELSE
                    CALL DGEMM("N", "N", KB, LB, JB, ALPHA, A(K,J), LDA, B(J,L), LDB, DONE, C(K,L),LDC)
                END IF
                !$omp end task
            END DO

        END DO
    END DO
    RETURN
END SUBROUTINE

and execute it using:

    !$omp parallel
    !$omp master
    CALL DTRMM3(M, N, ALPHA, A, LDA, B, LDB, BETA, C2, LDC)
    !$omp end master
    !$omp taskwait
    !$omp end parallel

The attached file contains the whole example.

I compiled the code using

 ifort -xHost -O3 dtrmm3_test.f90  -qopenmp -mkl -g

and executing it on a 16-core Xeon Silver 4110 leads to a segmentation fault:

./a.out 
   512   786     0.00000000D+00   0.00000000D+00   0.00000000D+00  T
   512   786     0.00000000D+00   0.10000000D+01   0.00000000D+00  T
   512   786     0.00000000D+00   0.20000000D+01   0.00000000D+00  T
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred

The first three lines show that the path ALPHA=0.0 works and it only crashes when the task-based part of the algorithm is called.

Uisng GCC 7.3 and Netlib BLAS everything works fine without an error.

OS: CentOS 7.4 , Intel Fortran 2018 Update 1, MKL 2018 Update 1