ifx: OMP: Error #13: Assertion failure at kmp_taskdeps.h

grisuthedragon · ‎06-13-2023

I tried to parallelize a modified version of a triangular matrix-matrix multiply, which allows 2x2 blocks on the diagonal, i.e. a quasi-triangular matrix. The code is parallelized using OpenMP Tasks and dependencies. Assuming we can to compute

C <= BETA * C + ALPHA * op(A) * B

where A is a quasi upper triangular matrix. Then the core of my implementation looks like:

!$omp parallel
!$omp master
DO L=1, M, NB
   LH = MIN(M, L + NB  - 1)
   LB = LH - L + 1

   DO K = 1, N, NB
      KB = MIN ( NB, N - K + 1)
      KH = K + KB - 1
      !$omp task firstprivate(L, LH, K, KH) depend(inout:C(L,K))
      IF ( BETA .EQ. ZERO ) THEN
         C(L:LH, K:KH) = ZERO
      ELSE
         C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
      END IF
      !$omp end task

      DO J = L, M, NB
         JB = MIN(NB, M - J + 1)
         !$omp task firstprivate(LB, KB, L, K, KH, J, JB) depend(inout:C(L,K))
         CALL DGEMM("N", "N", LB, KB, JB, ALPHA, A(L,J), LDA, B(J,K), LDB, DONE, C(L,K), LDC)
         !$omp end task
      END DO

      IF ( L.GT.1 ) THEN
         IF (A(L,L-1) .NE. ZERO ) THEN
            !$omp task firstprivate(L,K,KH) depend(inout:C(L,K))
            C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
            !$omp end task
         END IF
      END IF
   END DO
END DO
!$omp end master
!$omp taskwait
!$omp end parallel

Executing the code for NB=64, M = N = 1000 for several times, I get the following error message:

OMP: Error #13: Assertion failure at kmp_taskdeps.h(57).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source             
libpthread-2.17.s  00002AAC5FB22630  Unknown               Unknown  Unknown
libc-2.17.so       00002AAC60655387  gsignal               Unknown  Unknown
libc-2.17.so       00002AAC60656A78  abort                 Unknown  Unknown
libiomp5.so        00002AAC5F7A676C  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F78D59B  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F75DEBA  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F7D0E7B  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F7D3F0B  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F7E048F  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F74844A  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F74B146  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F753CCB  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F79AD54  Unknown               Unknown  Unknown
libiomp5.so        00002AAC5F82DC81  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AAC5FB1AEA5  Unknown               Unknown  Unknown
libc-2.17.so       00002AAC6071D9FD  clone                 Unknown  Unknown
Aborted (core dumped)

Looking at the core dump one gets the following backtrace:

#0  0x00002b90a1ee4387 in raise () from /lib64/libc.so.6
#1  0x00002b90a1ee5a78 in abort () from /lib64/libc.so.6
#2  0x0000000000404b9b in for.signal_handler ()
#3  <signal handler called>
#4  0x00002b90a1ee4387 in raise () from /lib64/libc.so.6
#5  0x00002b90a1ee5a78 in abort () from /lib64/libc.so.6
#6  0x00002b90a103576c in __kmp_abort_process () at ../../src/kmp_runtime.cpp:493
#7  0x00002b90a101c59b in __kmp_fatal (message=...) at ../../src/kmp_i18n.cpp:894
#8  0x00002b90a0feceba in __kmp_debug_assert (msg=0x2423 <error: Cannot access memory at address 0x2423>, file=0x2427 <error: Cannot access memory at address 0x2427>, line=6)
    at ../../src/kmp_debug.cpp:100
#9  0x00002b90a105fe7b in _INTERNAL7434b414::__kmp_node_deref (thread=<optimized out>, node=0x2b90b82b5080) at ../../src/kmp_taskdeps.h:57
#10 _INTERNAL7434b414::__kmp_dephash_free_entries (thread=<optimized out>, h=0x2b90b82b5104) at ../../src/kmp_taskdeps.h:91
#11 _INTERNAL7434b414::__kmp_free_task_and_ancestors (gtid=<optimized out>, taskdata=0x2b90980d21c0, thread=<optimized out>) at ../../src/kmp_tasking.cpp:980
#12 _INTERNAL7434b414::__kmp_task_finish<false> (gtid=9251, task=0x2427, resumed_task=0x6) at ../../src/kmp_tasking.cpp:1194
#13 0x00002b90a1062f0b in _INTERNAL7434b414::__kmp_invoke_task (gtid=9251, task=0x2427, current_task=0x6) at ../../src/kmp_tasking.cpp:1886
#14 0x00002b90a106f48f in _INTERNAL7434b414::__kmp_execute_tasks_template<kmp_flag_64<false, true> > (thread=<optimized out>, gtid=<optimized out>, flag=<optimized out>, final_spin=<optimized out>, 
    thread_finished=<optimized out>, itt_sync_obj=0x0, is_constrained=<optimized out>) at ../../src/kmp_tasking.cpp:3362
#15 __kmp_execute_tasks_64<false, true> (thread=0x2423, gtid=9255, flag=0x6, final_spin=-1, thread_finished=0x0, itt_sync_obj=0x2b90a1efb2cd <vfprintf+19661>, is_constrained=0)
    at ../../src/kmp_tasking.cpp:3475
#16 0x00002b90a0fd744a in kmp_flag_64<false, true>::execute_tasks (this=<optimized out>, this_thr=<optimized out>, gtid=<optimized out>, final_spin=<optimized out>, thread_finished=<optimized out>, 
    itt_sync_obj=<optimized out>, is_constrained=<optimized out>) at ../../src/kmp_wait_release.h:905
#17 _INTERNAL1311483b::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true> (this_thr=0x2423, flag=0x2427, itt_sync_obj=0x6) at ../../src/kmp_wait_release.h:570
#18 0x00002b90a0fda146 in kmp_flag_64<false, true>::wait (this=<optimized out>, this_thr=<optimized out>, final_spin=<optimized out>, itt_sync_obj=<optimized out>) at ../../src/kmp_wait_release.h:912
#19 _INTERNAL1311483b::__kmp_hyper_barrier_release (bt=9251, this_thr=0x2427, gtid=6, tid=-1, propagate_icvs=0, itt_sync_obj=0x2b90a1efb2cd <vfprintf+19661>) at ../../src/kmp_barrier.cpp:1190
#20 0x00002b90a0fe2ccb in __kmp_fork_barrier (gtid=9251, tid=9255) at ../../src/kmp_barrier.cpp:2539
#21 0x00002b90a1029d54 in __kmp_launch_thread (this_thr=0x2423) at ../../src/kmp_runtime.cpp:6220
#22 0x00002b90a10bcc81 in _INTERNAL7b50d17b::__kmp_launch_worker (thr=0x2423) at ../../src/z_Linux_util.cpp:558
#23 0x00002b90a13a9ea5 in start_thread () from /lib64/libpthread.so.0
#24 0x00002b90a1fac9fd in clone () from /lib64/libc.so.6

The crash happens after a few hundred calls to the subroutine. I recognized a similar behavior with other OpenMP task parallel codes, which run once but calling many times, the program crashes.

The whole example code is attached and I compiled it using

ifx  dla_dtrmm.f90 -o dla_dtrmm  -qopenmp   -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64     -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -O3  -xHOST

I tried it on a dual Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz system but I watched a similar behavior on other systems as well.

I tried it with ifx (IFX) 2023.1.0 20230320 from the current BASE/HPC kit 2023.1 but it happens also with ifort and older versions of the toolkits.

Ron_Green · ‎06-13-2023

Did you unlimit stack before running?

ulimit -s unlimited

or set KMP_STACKSIZE to something very large?

grisuthedragon · ‎06-13-2023

I tried both but neither one nor the other help. Since the code runs at least once and the returns to the single threaded main routine I do not think that this is problem of the stack size of the OpenMP threads.

jimdempseyatthecove · ‎06-14-2023

>>Since the code runs at least once and the returns to the single threaded main routine I do not think that this is problem of the stack size of the OpenMP threads.

Good point.... but, keep in mind that the linker stack size specification applies to the main thread and not the OpenMP created threads. To adjust those use either the environment variable or the callable function for stack size. While the run once indicates stack issue is not likely, the difference in stack sizes between main thread and team threads does not preclude a stack size issue.

What remains is:

coding errors: a) index out of bounds and/or b) code using pointers with invalid(stale) pointer.

or heap corruption

or heap fragmentation (though the error does not indicate fragmentation).

******

In looking at the error messages from the kmp_runtime system this is a strong indication that something is amiss with the condition variables. This can be from either your code or something else inside the kmp_runtime system.

As an experiment, restrict the number of OpenMP threads. IOW OMP_NUM_THREADS=2, OMP_MAX_THREADS=2, OMP_NESTED=F

If that works, then work up the number of threads.

If it is found that a diminished number of threads work, then this is an indication that insufficient iterations in the 2nd task region permit the 3rd task to run before the 2nd task region for that region. (as mentioned in my earlier post).

Jim Dempsey

EDIT:

My supposition can be tested by inserting a TASKWAIT following the 2nd task region loop.

If this works, and if you desire to keep the overlap, then use a different array element for each loop. Note, assuming that NB is > 1

the first task region can use depend(out:C(L,K))

the second task region can use depend(in:C(L,K),out:C(L+1,K))

the third task region can use depend(in:C(L+1,K))

grisuthedragon · ‎06-14-2023

I tried it with 2 threads and it crashes the same way. From my point of view and the fact that the code works with other OpenMP implementations the error lies in the kmp_runtime system.

If it is found that a diminished number of threads work, then this is an indication that insufficient iterations in the 2nd task region permit the 3rd task to run before the 2nd task region for that region. (as mentioned in my earlier post).

Here I would point to the description of the `depend(inout)` in the OpenMP standard again (as mentioned already in the other answer). So to keep the sequential order of the different tasks, a task with inout dependency is only executed with all previously created tasks, referencing the same item, are finished. Thus, the task from L27 is the last one executed for a C(L,K). Furthermore, I can guarantee, that at least one task from L20 exists.

I did another experiment, producing wrong numerical results, but it ensures, that the task from L27 isn't a problem: Commenting out lines 25 to 31 lead to the same error. Thus this cannot be the reason.

jimdempseyatthecove · ‎06-13-2023

In addition to Ron's suggestion, there may be a bug (oversight) in your code.

Should the line 18 DO J= loop have fewer iterations than available threads, the line27 task, which has the same dependency as the line 18 DO J= loop may execute base on the dependency satisfied by the completion of the line 10 task.

IOW the dependency on line 20 could get bypassed should the dependency check on line 27 get satisfied first.

Jim Dempsey

grisuthedragon · ‎06-13-2023

Following the OpenMP specifications, that should not be a bug:

For the out and inout task-dependence-types, if the storage location of at least one of the list
items matches the storage location of a list item appearing in a depend clause with an in, out,
inout, mutexinoutset, or inoutset task-dependence-type on a construct from which a
sibling task was previously generated, then the generated task will be a dependent task of that
sibling task.

For me, that reads that tasks referencing the same memory location as inout dependency, will be executed in the order they are created. Otherwise, none of the GEMM parallelization examples will work with a flexible number of threads.

I tried the same using gfortran and libgomp for OpenMP and there it runs up to infinity.

jimdempseyatthecove · ‎06-14-2023

Have you tried:

!$omp parallel
!$omp master
DO L=1, M, NB
   LH = MIN(M, L + NB  - 1)
   LB = LH - L + 1

   DO K = 1, N, NB
      KB = MIN ( NB, N - K + 1)
      KH = K + KB - 1
      !$omp task firstprivate(L, LH, K, KH) depend(OUT:C(L,K))
      IF ( BETA .EQ. ZERO ) THEN
         C(L:LH, K:KH) = ZERO
      ELSE
         C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
      END IF
      !$omp end task

      DO J = L, M, NB
         JB = MIN(NB, M - J + 1)
         !$omp task firstprivate(LB, KB, L, K, KH, J, JB) depend(inout:C(L,K))
         CALL DGEMM("N", "N", LB, KB, JB, ALPHA, A(L,J), LDA, B(J,K), LDB, DONE, C(L,K), LDC)
         !$omp end task
      END DO

      IF ( L.GT.1 ) THEN
         IF (A(L,L-1) .NE. ZERO ) THEN
            !$omp task firstprivate(L,K,KH) depend(IN:C(L,K))
            C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
            !$omp end task
         END IF
      END IF
   END DO
END DO
!$omp end master
!$omp taskwait
!$omp end parallel

note the difference in the depend(...)

I am wondering if the parallel region is not re-initializing/initializing any dependency data structures from one entry into the parallel region to the next entry into the same parallel region (or possibly different parallel region using the same dependency locations as this parallel region).

One other question not exposed by information given.

? Is there another parallel region running concurrently with this parallel region? And if so, does that happen to use the same array locations for its dependencies?

Jim Dempsey

grisuthedragon · ‎06-14-2023

I change the depend-clauses as you suggest, such that we get a "closed"-chain from the beginning to the end of all operation on a block. Unfortunately, it crashed in the same manner, but after more iterations in average until this happens.

We use the parallel interface of the MKL, but without nested parallelism, thus running dgemm from inside an OpenMP region should result in a sequentially running operation. In order to check whether the parallel MKL is involved in this, I changed the "-lmkl_intel_thread" against "-lmkl_sequential" and it happens as well.

Then, I removed the dependency from MKL by using the plain DGEMM implementation from lapack/netlib and compiled it together with my code. Beside being much slower, the code crashes as well.

Finally, I got rid of all external codes by changing the DGEMM call to

C(L:LH, K:KH) = 1.5D0*C(L:LH, K:KH)

That is incorrect with respect to the results, but imitate some RW access on C. And, surprise, the code crashes again with the same backtrace shown above. Going even one step further and removing all code from the tasks, i.e. they do nothing, It crashes.

jimdempseyatthecove · ‎06-14-2023

This certainly looks like an internal issue with depends(...)

If you can submit a reproducer here on the forum the issue will get addressed.

As a temporary workaround #1 remove the depends and add a taskwait after the loop.

Workaround #2

!$omp parallel do collapse(2) private(L, K, J, LH, LB, KB, KH, JB) schedule(dynamic, 1)
DO L=1, M, NB
   DO K = 1, N, NB
      LH = MIN(M, L + NB  - 1)
      LB = LH - L + 1

      KB = MIN ( NB, N - K + 1)
      KH = K + KB - 1
      IF ( BETA .EQ. ZERO ) THEN
         C(L:LH, K:KH) = ZERO
      ELSE
         C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
      END IF

      DO J = L, M, NB
         JB = MIN(NB, M - J + 1)
         CALL DGEMM("N", "N", LB, KB, JB, ALPHA, A(L,J), LDA, B(J,K), LDB, DONE, C(L,K), LDC)
      END DO

      IF ( L.GT.1 ) THEN
         IF (A(L,L-1) .NE. ZERO ) THEN
            C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
         END IF
      END IF
   END DO
END DO
!$omp end parallel do

(above is untested code)

If the number of iterations L*K is sufficiently larger than the number of threads, then the schedule(dynamic,1) should attain almost the same efficiency as the task model you presented.

Jim Dempsey

grisuthedragon · ‎06-15-2023

I gave it a try and it delivers a comparable performance. But in general the problem of the depend clauses need to addressed, since with an increasing number of CPU cores, designing algorithms on top of OpenMP tasks with dependencies became more popular.

jimdempseyatthecove · ‎06-14-2023

Additionally, you can task the CALL DGEMM at line 17 and TASKWAIT at line 19 without use of depends. As to if this eeks out additional performance you will have to test.

Jim Dempsey

Ron_Green · ‎06-14-2023

There does seem to be a resource exhaustion issue in this example.

I took Jim's code and removed the red herring that is the DGEMM call. Replaced with simple

JB = JB

pretty harmless, removes the MKl from the scenario. So I can compile simply as

ifx -O2 -xhost -heap-arrays -qopenmp jd2.f90 -o jd2 -g -traceback

Still crashes, intermittently though. I am running with 16threads on a 66 core server.

Let me understand this case. You run the timing test 10,000 times.

Each test creates, what, 64,000 tasks? Did I get that M x NB correct for this example?

VMSTAT does show context switching going absolutely bonkers. Good stressor for a system, eh?

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
14  0      0 123953136 404112 5912272    0    0     0     0 239399 2859665  4  4 93  0  0
 7  0      0 123952888 404112 5912272    0    0     0     0 234665 2845134  4  4 93  0  0
11  0      0 123953648 404112 5912272    0    0     0     0 236803 2835244  4  4 93  0  0
15  0      0 123953648 404112 5912272    0    0     0     0 235972 2850306  4  4 92  0  0
10  0      0 123952896 404112 5912272    0    0     0     0 234449 2847320  4  4 93  0  0
 9  0      0 123953088 404112 5912272    0    0     0   108 233350 2846280  4  4 92  0  0
 4  0      0 123953344 404112 5912272    0    0     0     4 230337 2814868  4  4 92  0  0
 8  0      0 123953344 404112 5912272    0    0     0    24 229254 2814594  4  4 93  0  0
10  0      0 123953344 404112 5912272    0    0     0    16 229803 2812707  4  4 93  0  0
12  0      0 123953592 404112 5912272    0    0     0     0 231430 2822067  3  4 93  0  0
11  0      0 123953592 404112 5912272    0    0     0     0 234669 2844372  4  4 93  0  0
10  0      0 123954096 404112 5912272    0    0     0     0 233591 2853878  4  4 93  0  0
 9  0      0 123953648 404112 5912272    0    0     0     0 232549 2850618  4  4 92  0  0
14  0      0 123953400 404112 5912272    0    0     0     0 236108 2824616  4  4 92  0  0

I notice GOMP seems to throttle back, keeps the context switches much lower. Maybe some OMP RT tuning is needed to keep the tasks from thrashing the OS so badly. Here is VMSTAT with gfortran/gomp

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
17  0      0 123951208 404112 5914376    0    0     0     0 6421 19157 14  0 86  0  0
16  0      0 123951208 404112 5914376    0    0     0     0 6391 19377 14  0 86  0  0
16  0      0 123951208 404112 5914376    0    0     0     0 6485 19470 14  0 86  0  0
16  0      0 123951208 404112 5914376    0    0     0     0 6373 19173 14  0 86  0  0
16  0      0 123951208 404112 5914376    0    0     0     0 6389 19223 14  0 86  0  0

The testcase


PROGRAM MAIN
    IMPLICIT NONE

    INTEGER, PARAMETER :: M = 1000

    DOUBLE PRECISION, ALLOCATABLE :: A(:,:), B(:,:), C(:,:)

    INTEGER :: K, L, RUN
    DOUBLE PRECISION wtime
    EXTERNAL wtime

    DOUBLE PRECISION :: TS, TE2
    INTEGER, PARAMETER :: RUNS = 10000


    ALLOCATE(A(M,M), B(M,M), C(M,M))

    CALL RANDOM_NUMBER(A)
    CALL RANDOM_NUMBER(B)
    CALL RANDOM_NUMBER(C)

    ! Make A quasi upper triangular
    DO K = 1, M-1
        A(K+1:M,K) = 0.0D0
    END DO

    DO K = 1, M-1, 64
        A(K+1, K) = 1.0D0
    END DO


    WRITE (*,*) "Old Code:"
    TS = wtime()

    DO RUN = 1, RUNS
        CALL DLA_DTRMM("N", M, M, 1.0D0, A, M, B, M, 0.0D0, C, M)
    END DO
    TE2 = (wtime() - TS) / DBLE(RUNS)
    WRITE(*,*) "Time per Run:", TE2



    DEALLOCATE(A, B, C)
END PROGRAM MAIN


!
! Returns a wall-time stamp like tic/toc in MATLAB
!
!

function wtime ( )
  implicit none

  integer, parameter :: rk = kind ( 1.0D+00 )

  integer clock_max
  integer clock_rate
  integer clock_reading
  real ( kind = rk ) wtime

  call system_clock ( clock_reading, clock_rate, clock_max )

  wtime = real ( clock_reading, kind = rk ) &
        / real ( clock_rate, kind = rk )

  return
end


! The routine computes
!
!    C <=  BETA * C + ALPHA * op(A) * B
!
! where A is a upper (quasi) triangular matrix. The matrices B and C are general
! ones.
!
! The routine uses OpenMP 4 tasks for acceleration.
!
!
SUBROUTINE DLA_DTRMM(TRANS, M, N, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
    IMPLICIT NONE
    INTEGER, INTENT(IN)  ::  M, N, LDA, LDB, LDC
    DOUBLE PRECISION, INTENT(IN) ::  ALPHA, BETA
    DOUBLE PRECISION, INTENT(IN) ::  A(LDA, *), B(LDB, *)
    DOUBLE PRECISION, INTENT(INOUT) ::  C(LDC,*)
    CHARACTER(1), INTENT(IN) ::  TRANS


    DOUBLE PRECISION ZERO, DONE
    INTEGER L, LH, LB, NB, K, KB, KH, J, JB
    PARAMETER(ZERO = 0.0D0, DONE = 1.0D0)

    IF ( M .LT. 1536 .OR. N .LT. 1536) THEN
        NB = 64
    ELSE IF ( M .GT. 4000 .AND. N.GT.4000) THEN
        NB = 256
    ELSE
        NB = 128
    END IF

!$omp parallel
!$omp master
DO L=1, M, NB
   LH = MIN(M, L + NB  - 1)
   LB = LH - L + 1

   DO K = 1, N, NB
      KB = MIN ( NB, N - K + 1)
      KH = K + KB - 1
      !$omp task firstprivate(L, LH, K, KH) depend(OUT:C(L,K))
      IF ( BETA .EQ. ZERO ) THEN
         C(L:LH, K:KH) = ZERO
      ELSE
         C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
      END IF
      !$omp end task

      DO J = L, M, NB
         JB = MIN(NB, M - J + 1)
         !$omp task firstprivate(LB, KB, L, K, KH, J, JB) depend(inout:C(L,K))
         JB = JB
         !$omp end task
      END DO

      IF ( L.GT.1 ) THEN
         IF (A(L,L-1) .NE. ZERO ) THEN
            !$omp task firstprivate(L,K,KH) depend(IN:C(L,K))
            C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
            !$omp end task
         END IF
      END IF
   END DO
END DO
!$omp end master
!$omp taskwait
!$omp end parallel
END SUBROUTINE

grisuthedragon · ‎06-15-2023

Let me understand this case. You run the timing test 10,000 times.
Each test creates, what, 64,000 tasks? Did I get that M x NB correct for this example?

I increased the number of test runs to 10,000 times to force the crash. For a production test I would use much less. But we call the routine very often in a long-running code such that 10,000 call to this routine can easily happen.

Ron_Green · ‎06-14-2023

@grisuthedragon @jimdempseyatthecove

Try this, works for me

export KMP_LIBRARY=SERIAL

KMP_LIBRARY Selects the OpenMP run-time library execution mode. The values for this variable are serial, turnaround, or throughput. Default: throughput Throughput mode: The throughput mode allows the program to yield to other running programs and adjust resource usage to produce efficient execution in a dynamic environment. In a multi-user environment where the load on the parallel machine is not constant or where the job stream is not predictable, it may be better to design and tune for throughput. This minimizes the total time to run multiple jobs simultaneously. In this mode, the worker threads yield to other threads while waiting for more parallel work. After completing the execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep. Until more parallel work becomes available, sleeping allows processor and resources to be used for other work by non-OpenMP threaded code that may execute between parallel regions, or by other applications. The amount of time to wait before sleeping is set either by the KMP_BLOCKTIME environment variable or by the kmp_set_blocktime() function. A small blocktime value may offer better overall performance if your application contains non-OpenMP threaded code that executes between parallel regions. A larger blocktime value may be more appropriate if threads are to be reserved solely for use for OpenMP execution, but may penalize other concurrently-running OpenMP or threaded applications. Turnaround mode: The turnaround mode is designed to keep active all processors involved in the parallel computation, which minimizes execution time of a single job. In this mode, the worker threads actively wait for more parallel work, without yielding to other threads (although they are still subject to KMP_BLOCKTIME control). In a dedicated (batch or single user) parallel environment where all processors are exclusively allocated to the program for its entire run, it is most important to effectively use all processors all of the time. Note: Avoid over-allocating system resources. The condition can occur if either too many threads have been specified, or if too few processors are available at run time. If system resources are over-allocated, this mode will cause poor performance. The throughput mode should be used instead if this occurs. Serial mode: The serial mode forces parallel applications to run as a single thread. Intel extension to OpenMP run-time library API: ... void kmp_set_library_throughput() void kmp_set_library_turnaround() void kmp_set_library_serial() void kmp_set_library(int) 1 - serial mode / 2 - turnaround mode / 3 - throughput mode int kmp_get_library() 1 - serial mode / 2 - turnaround mode / 3 - throughput mode ... In the throughput OpenMP Support Libraries, threads wait for new parallel work at the ends of parallel regions, and then sleep, after a specified period of time. This time interval can be set by the KMP_BLOCKTIME environment variable or by the kmp_set_blocktime() function. ... int kmp_get_blocktime(void) Returns the number of milliseconds that a thread should wait, after completing the execution of a parallel region, before sleeping, as set either by the KMP_BLOCKTIME environment variable or by kmp_set_blocktime(). void kmp_set_blocktime(int msec) Sets the number of milliseconds that a thread should wait, after completing the execution of a parallel region, before sleeping. This routine affects the block time setting for the calling thread and any OpenMP team threads formed by the calling thread. The routine does not affect the block time for any other threads. ...

Runtimes seem to be fast as well: again, 16 threads on my 66core server

ifx/ifort

Old Code:

Time per Run: 7.628200000006473E-004

grisuthedragon · ‎06-15-2023

For a test (such that it most likely not crash) I reduced the number of runs to 100 and compared the performance: (With GEMM from MKL)

$ KMP_LIBRARY=serial ./dla_dtrmm                                                                                                                                                               
 Old Code:
 Time per Run:  3.785300000017742E-002
$ KMP_LIBRARY=throughput ./dla_dtrmm                                                                                                                                                           
 Old Code:
 Time per Run:  3.755000000237487E-003
$ KMP_LIBRARY=turnaround ./dla_dtrmm                                                                                                                                                           
 Old Code:
 Time per Run:  3.591999999771361E-003

So using KMP_LIBRARY=serial only gives 10% of the performance. But it does not crash. Both other cases, with throughput and turnaround will crash if I increase the number of iterations.