- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried to parallelize a modified version of a triangular matrix-matrix multiply, which allows 2x2 blocks on the diagonal, i.e. a quasi-triangular matrix. The code is parallelized using OpenMP Tasks and dependencies. Assuming we can to compute
C <= BETA * C + ALPHA * op(A) * B
where A is a quasi upper triangular matrix. Then the core of my implementation looks like:
!$omp parallel
!$omp master
DO L=1, M, NB
LH = MIN(M, L + NB - 1)
LB = LH - L + 1
DO K = 1, N, NB
KB = MIN ( NB, N - K + 1)
KH = K + KB - 1
!$omp task firstprivate(L, LH, K, KH) depend(inout:C(L,K))
IF ( BETA .EQ. ZERO ) THEN
C(L:LH, K:KH) = ZERO
ELSE
C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
END IF
!$omp end task
DO J = L, M, NB
JB = MIN(NB, M - J + 1)
!$omp task firstprivate(LB, KB, L, K, KH, J, JB) depend(inout:C(L,K))
CALL DGEMM("N", "N", LB, KB, JB, ALPHA, A(L,J), LDA, B(J,K), LDB, DONE, C(L,K), LDC)
!$omp end task
END DO
IF ( L.GT.1 ) THEN
IF (A(L,L-1) .NE. ZERO ) THEN
!$omp task firstprivate(L,K,KH) depend(inout:C(L,K))
C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
!$omp end task
END IF
END IF
END DO
END DO
!$omp end master
!$omp taskwait
!$omp end parallel
Executing the code for NB=64, M = N = 1000 for several times, I get the following error message:
OMP: Error #13: Assertion failure at kmp_taskdeps.h(57).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
libpthread-2.17.s 00002AAC5FB22630 Unknown Unknown Unknown
libc-2.17.so 00002AAC60655387 gsignal Unknown Unknown
libc-2.17.so 00002AAC60656A78 abort Unknown Unknown
libiomp5.so 00002AAC5F7A676C Unknown Unknown Unknown
libiomp5.so 00002AAC5F78D59B Unknown Unknown Unknown
libiomp5.so 00002AAC5F75DEBA Unknown Unknown Unknown
libiomp5.so 00002AAC5F7D0E7B Unknown Unknown Unknown
libiomp5.so 00002AAC5F7D3F0B Unknown Unknown Unknown
libiomp5.so 00002AAC5F7E048F Unknown Unknown Unknown
libiomp5.so 00002AAC5F74844A Unknown Unknown Unknown
libiomp5.so 00002AAC5F74B146 Unknown Unknown Unknown
libiomp5.so 00002AAC5F753CCB Unknown Unknown Unknown
libiomp5.so 00002AAC5F79AD54 Unknown Unknown Unknown
libiomp5.so 00002AAC5F82DC81 Unknown Unknown Unknown
libpthread-2.17.s 00002AAC5FB1AEA5 Unknown Unknown Unknown
libc-2.17.so 00002AAC6071D9FD clone Unknown Unknown
Aborted (core dumped)
Looking at the core dump one gets the following backtrace:
#0 0x00002b90a1ee4387 in raise () from /lib64/libc.so.6
#1 0x00002b90a1ee5a78 in abort () from /lib64/libc.so.6
#2 0x0000000000404b9b in for.signal_handler ()
#3 <signal handler called>
#4 0x00002b90a1ee4387 in raise () from /lib64/libc.so.6
#5 0x00002b90a1ee5a78 in abort () from /lib64/libc.so.6
#6 0x00002b90a103576c in __kmp_abort_process () at ../../src/kmp_runtime.cpp:493
#7 0x00002b90a101c59b in __kmp_fatal (message=...) at ../../src/kmp_i18n.cpp:894
#8 0x00002b90a0feceba in __kmp_debug_assert (msg=0x2423 <error: Cannot access memory at address 0x2423>, file=0x2427 <error: Cannot access memory at address 0x2427>, line=6)
at ../../src/kmp_debug.cpp:100
#9 0x00002b90a105fe7b in _INTERNAL7434b414::__kmp_node_deref (thread=<optimized out>, node=0x2b90b82b5080) at ../../src/kmp_taskdeps.h:57
#10 _INTERNAL7434b414::__kmp_dephash_free_entries (thread=<optimized out>, h=0x2b90b82b5104) at ../../src/kmp_taskdeps.h:91
#11 _INTERNAL7434b414::__kmp_free_task_and_ancestors (gtid=<optimized out>, taskdata=0x2b90980d21c0, thread=<optimized out>) at ../../src/kmp_tasking.cpp:980
#12 _INTERNAL7434b414::__kmp_task_finish<false> (gtid=9251, task=0x2427, resumed_task=0x6) at ../../src/kmp_tasking.cpp:1194
#13 0x00002b90a1062f0b in _INTERNAL7434b414::__kmp_invoke_task (gtid=9251, task=0x2427, current_task=0x6) at ../../src/kmp_tasking.cpp:1886
#14 0x00002b90a106f48f in _INTERNAL7434b414::__kmp_execute_tasks_template<kmp_flag_64<false, true> > (thread=<optimized out>, gtid=<optimized out>, flag=<optimized out>, final_spin=<optimized out>,
thread_finished=<optimized out>, itt_sync_obj=0x0, is_constrained=<optimized out>) at ../../src/kmp_tasking.cpp:3362
#15 __kmp_execute_tasks_64<false, true> (thread=0x2423, gtid=9255, flag=0x6, final_spin=-1, thread_finished=0x0, itt_sync_obj=0x2b90a1efb2cd <vfprintf+19661>, is_constrained=0)
at ../../src/kmp_tasking.cpp:3475
#16 0x00002b90a0fd744a in kmp_flag_64<false, true>::execute_tasks (this=<optimized out>, this_thr=<optimized out>, gtid=<optimized out>, final_spin=<optimized out>, thread_finished=<optimized out>,
itt_sync_obj=<optimized out>, is_constrained=<optimized out>) at ../../src/kmp_wait_release.h:905
#17 _INTERNAL1311483b::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true> (this_thr=0x2423, flag=0x2427, itt_sync_obj=0x6) at ../../src/kmp_wait_release.h:570
#18 0x00002b90a0fda146 in kmp_flag_64<false, true>::wait (this=<optimized out>, this_thr=<optimized out>, final_spin=<optimized out>, itt_sync_obj=<optimized out>) at ../../src/kmp_wait_release.h:912
#19 _INTERNAL1311483b::__kmp_hyper_barrier_release (bt=9251, this_thr=0x2427, gtid=6, tid=-1, propagate_icvs=0, itt_sync_obj=0x2b90a1efb2cd <vfprintf+19661>) at ../../src/kmp_barrier.cpp:1190
#20 0x00002b90a0fe2ccb in __kmp_fork_barrier (gtid=9251, tid=9255) at ../../src/kmp_barrier.cpp:2539
#21 0x00002b90a1029d54 in __kmp_launch_thread (this_thr=0x2423) at ../../src/kmp_runtime.cpp:6220
#22 0x00002b90a10bcc81 in _INTERNAL7b50d17b::__kmp_launch_worker (thr=0x2423) at ../../src/z_Linux_util.cpp:558
#23 0x00002b90a13a9ea5 in start_thread () from /lib64/libpthread.so.0
#24 0x00002b90a1fac9fd in clone () from /lib64/libc.so.6
The crash happens after a few hundred calls to the subroutine. I recognized a similar behavior with other OpenMP task parallel codes, which run once but calling many times, the program crashes.
The whole example code is attached and I compiled it using
ifx dla_dtrmm.f90 -o dla_dtrmm -qopenmp -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -O3 -xHOST
I tried it on a dual Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz system but I watched a similar behavior on other systems as well.
I tried it with ifx (IFX) 2023.1.0 20230320 from the current BASE/HPC kit 2023.1 but it happens also with ifort and older versions of the toolkits.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you unlimit stack before running?
ulimit -s unlimited
or set KMP_STACKSIZE to something very large?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried both but neither one nor the other help. Since the code runs at least once and the returns to the single threaded main routine I do not think that this is problem of the stack size of the OpenMP threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Since the code runs at least once and the returns to the single threaded main routine I do not think that this is problem of the stack size of the OpenMP threads.
Good point.... but, keep in mind that the linker stack size specification applies to the main thread and not the OpenMP created threads. To adjust those use either the environment variable or the callable function for stack size. While the run once indicates stack issue is not likely, the difference in stack sizes between main thread and team threads does not preclude a stack size issue.
What remains is:
coding errors: a) index out of bounds and/or b) code using pointers with invalid(stale) pointer.
or heap corruption
or heap fragmentation (though the error does not indicate fragmentation).
******
In looking at the error messages from the kmp_runtime system this is a strong indication that something is amiss with the condition variables. This can be from either your code or something else inside the kmp_runtime system.
As an experiment, restrict the number of OpenMP threads. IOW OMP_NUM_THREADS=2, OMP_MAX_THREADS=2, OMP_NESTED=F
If that works, then work up the number of threads.
If it is found that a diminished number of threads work, then this is an indication that insufficient iterations in the 2nd task region permit the 3rd task to run before the 2nd task region for that region. (as mentioned in my earlier post).
Jim Dempsey
EDIT:
My supposition can be tested by inserting a TASKWAIT following the 2nd task region loop.
If this works, and if you desire to keep the overlap, then use a different array element for each loop. Note, assuming that NB is > 1
the first task region can use depend(out:C(L,K))
the second task region can use depend(in:C(L,K),out:C(L+1,K))
the third task region can use depend(in:C(L+1,K))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried it with 2 threads and it crashes the same way. From my point of view and the fact that the code works with other OpenMP implementations the error lies in the kmp_runtime system.
If it is found that a diminished number of threads work, then this is an indication that insufficient iterations in the 2nd task region permit the 3rd task to run before the 2nd task region for that region. (as mentioned in my earlier post).
Here I would point to the description of the `depend(inout)` in the OpenMP standard again (as mentioned already in the other answer). So to keep the sequential order of the different tasks, a task with inout dependency is only executed with all previously created tasks, referencing the same item, are finished. Thus, the task from L27 is the last one executed for a C(L,K). Furthermore, I can guarantee, that at least one task from L20 exists.
I did another experiment, producing wrong numerical results, but it ensures, that the task from L27 isn't a problem: Commenting out lines 25 to 31 lead to the same error. Thus this cannot be the reason.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition to Ron's suggestion, there may be a bug (oversight) in your code.
Should the line 18 DO J= loop have fewer iterations than available threads, the line27 task, which has the same dependency as the line 18 DO J= loop may execute base on the dependency satisfied by the completion of the line 10 task.
IOW the dependency on line 20 could get bypassed should the dependency check on line 27 get satisfied first.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Following the OpenMP specifications, that should not be a bug:
For the out and inout task-dependence-types, if the storage location of at least one of the list
items matches the storage location of a list item appearing in a depend clause with an in, out,
inout, mutexinoutset, or inoutset task-dependence-type on a construct from which a
sibling task was previously generated, then the generated task will be a dependent task of that
sibling task.
For me, that reads that tasks referencing the same memory location as inout dependency, will be executed in the order they are created. Otherwise, none of the GEMM parallelization examples will work with a flexible number of threads.
I tried the same using gfortran and libgomp for OpenMP and there it runs up to infinity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you tried:
!$omp parallel
!$omp master
DO L=1, M, NB
LH = MIN(M, L + NB - 1)
LB = LH - L + 1
DO K = 1, N, NB
KB = MIN ( NB, N - K + 1)
KH = K + KB - 1
!$omp task firstprivate(L, LH, K, KH) depend(OUT:C(L,K))
IF ( BETA .EQ. ZERO ) THEN
C(L:LH, K:KH) = ZERO
ELSE
C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
END IF
!$omp end task
DO J = L, M, NB
JB = MIN(NB, M - J + 1)
!$omp task firstprivate(LB, KB, L, K, KH, J, JB) depend(inout:C(L,K))
CALL DGEMM("N", "N", LB, KB, JB, ALPHA, A(L,J), LDA, B(J,K), LDB, DONE, C(L,K), LDC)
!$omp end task
END DO
IF ( L.GT.1 ) THEN
IF (A(L,L-1) .NE. ZERO ) THEN
!$omp task firstprivate(L,K,KH) depend(IN:C(L,K))
C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
!$omp end task
END IF
END IF
END DO
END DO
!$omp end master
!$omp taskwait
!$omp end parallel
note the difference in the depend(...)
I am wondering if the parallel region is not re-initializing/initializing any dependency data structures from one entry into the parallel region to the next entry into the same parallel region (or possibly different parallel region using the same dependency locations as this parallel region).
One other question not exposed by information given.
? Is there another parallel region running concurrently with this parallel region? And if so, does that happen to use the same array locations for its dependencies?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I change the depend-clauses as you suggest, such that we get a "closed"-chain from the beginning to the end of all operation on a block. Unfortunately, it crashed in the same manner, but after more iterations in average until this happens.
We use the parallel interface of the MKL, but without nested parallelism, thus running dgemm from inside an OpenMP region should result in a sequentially running operation. In order to check whether the parallel MKL is involved in this, I changed the "-lmkl_intel_thread" against "-lmkl_sequential" and it happens as well.
Then, I removed the dependency from MKL by using the plain DGEMM implementation from lapack/netlib and compiled it together with my code. Beside being much slower, the code crashes as well.
Finally, I got rid of all external codes by changing the DGEMM call to
C(L:LH, K:KH) = 1.5D0*C(L:LH, K:KH)
That is incorrect with respect to the results, but imitate some RW access on C. And, surprise, the code crashes again with the same backtrace shown above. Going even one step further and removing all code from the tasks, i.e. they do nothing, It crashes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This certainly looks like an internal issue with depends(...)
If you can submit a reproducer here on the forum the issue will get addressed.
As a temporary workaround #1 remove the depends and add a taskwait after the loop.
Workaround #2
!$omp parallel do collapse(2) private(L, K, J, LH, LB, KB, KH, JB) schedule(dynamic, 1)
DO L=1, M, NB
DO K = 1, N, NB
LH = MIN(M, L + NB - 1)
LB = LH - L + 1
KB = MIN ( NB, N - K + 1)
KH = K + KB - 1
IF ( BETA .EQ. ZERO ) THEN
C(L:LH, K:KH) = ZERO
ELSE
C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
END IF
DO J = L, M, NB
JB = MIN(NB, M - J + 1)
CALL DGEMM("N", "N", LB, KB, JB, ALPHA, A(L,J), LDA, B(J,K), LDB, DONE, C(L,K), LDC)
END DO
IF ( L.GT.1 ) THEN
IF (A(L,L-1) .NE. ZERO ) THEN
C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
END IF
END IF
END DO
END DO
!$omp end parallel do
(above is untested code)
If the number of iterations L*K is sufficiently larger than the number of threads, then the schedule(dynamic,1) should attain almost the same efficiency as the task model you presented.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I gave it a try and it delivers a comparable performance. But in general the problem of the depend clauses need to addressed, since with an increasing number of CPU cores, designing algorithms on top of OpenMP tasks with dependencies became more popular.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additionally, you can task the CALL DGEMM at line 17 and TASKWAIT at line 19 without use of depends. As to if this eeks out additional performance you will have to test.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There does seem to be a resource exhaustion issue in this example.
I took Jim's code and removed the red herring that is the DGEMM call. Replaced with simple
JB = JB
pretty harmless, removes the MKl from the scenario. So I can compile simply as
ifx -O2 -xhost -heap-arrays -qopenmp jd2.f90 -o jd2 -g -traceback
Still crashes, intermittently though. I am running with 16threads on a 66 core server.
Let me understand this case. You run the timing test 10,000 times.
Each test creates, what, 64,000 tasks? Did I get that M x NB correct for this example?
VMSTAT does show context switching going absolutely bonkers. Good stressor for a system, eh?
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
14 0 0 123953136 404112 5912272 0 0 0 0 239399 2859665 4 4 93 0 0
7 0 0 123952888 404112 5912272 0 0 0 0 234665 2845134 4 4 93 0 0
11 0 0 123953648 404112 5912272 0 0 0 0 236803 2835244 4 4 93 0 0
15 0 0 123953648 404112 5912272 0 0 0 0 235972 2850306 4 4 92 0 0
10 0 0 123952896 404112 5912272 0 0 0 0 234449 2847320 4 4 93 0 0
9 0 0 123953088 404112 5912272 0 0 0 108 233350 2846280 4 4 92 0 0
4 0 0 123953344 404112 5912272 0 0 0 4 230337 2814868 4 4 92 0 0
8 0 0 123953344 404112 5912272 0 0 0 24 229254 2814594 4 4 93 0 0
10 0 0 123953344 404112 5912272 0 0 0 16 229803 2812707 4 4 93 0 0
12 0 0 123953592 404112 5912272 0 0 0 0 231430 2822067 3 4 93 0 0
11 0 0 123953592 404112 5912272 0 0 0 0 234669 2844372 4 4 93 0 0
10 0 0 123954096 404112 5912272 0 0 0 0 233591 2853878 4 4 93 0 0
9 0 0 123953648 404112 5912272 0 0 0 0 232549 2850618 4 4 92 0 0
14 0 0 123953400 404112 5912272 0 0 0 0 236108 2824616 4 4 92 0 0
I notice GOMP seems to throttle back, keeps the context switches much lower. Maybe some OMP RT tuning is needed to keep the tasks from thrashing the OS so badly. Here is VMSTAT with gfortran/gomp
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
17 0 0 123951208 404112 5914376 0 0 0 0 6421 19157 14 0 86 0 0
16 0 0 123951208 404112 5914376 0 0 0 0 6391 19377 14 0 86 0 0
16 0 0 123951208 404112 5914376 0 0 0 0 6485 19470 14 0 86 0 0
16 0 0 123951208 404112 5914376 0 0 0 0 6373 19173 14 0 86 0 0
16 0 0 123951208 404112 5914376 0 0 0 0 6389 19223 14 0 86 0 0
The testcase
PROGRAM MAIN
IMPLICIT NONE
INTEGER, PARAMETER :: M = 1000
DOUBLE PRECISION, ALLOCATABLE :: A(:,:), B(:,:), C(:,:)
INTEGER :: K, L, RUN
DOUBLE PRECISION wtime
EXTERNAL wtime
DOUBLE PRECISION :: TS, TE2
INTEGER, PARAMETER :: RUNS = 10000
ALLOCATE(A(M,M), B(M,M), C(M,M))
CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL RANDOM_NUMBER(C)
! Make A quasi upper triangular
DO K = 1, M-1
A(K+1:M,K) = 0.0D0
END DO
DO K = 1, M-1, 64
A(K+1, K) = 1.0D0
END DO
WRITE (*,*) "Old Code:"
TS = wtime()
DO RUN = 1, RUNS
CALL DLA_DTRMM("N", M, M, 1.0D0, A, M, B, M, 0.0D0, C, M)
END DO
TE2 = (wtime() - TS) / DBLE(RUNS)
WRITE(*,*) "Time per Run:", TE2
DEALLOCATE(A, B, C)
END PROGRAM MAIN
!
! Returns a wall-time stamp like tic/toc in MATLAB
!
!
function wtime ( )
implicit none
integer, parameter :: rk = kind ( 1.0D+00 )
integer clock_max
integer clock_rate
integer clock_reading
real ( kind = rk ) wtime
call system_clock ( clock_reading, clock_rate, clock_max )
wtime = real ( clock_reading, kind = rk ) &
/ real ( clock_rate, kind = rk )
return
end
! The routine computes
!
! C <= BETA * C + ALPHA * op(A) * B
!
! where A is a upper (quasi) triangular matrix. The matrices B and C are general
! ones.
!
! The routine uses OpenMP 4 tasks for acceleration.
!
!
SUBROUTINE DLA_DTRMM(TRANS, M, N, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
IMPLICIT NONE
INTEGER, INTENT(IN) :: M, N, LDA, LDB, LDC
DOUBLE PRECISION, INTENT(IN) :: ALPHA, BETA
DOUBLE PRECISION, INTENT(IN) :: A(LDA, *), B(LDB, *)
DOUBLE PRECISION, INTENT(INOUT) :: C(LDC,*)
CHARACTER(1), INTENT(IN) :: TRANS
DOUBLE PRECISION ZERO, DONE
INTEGER L, LH, LB, NB, K, KB, KH, J, JB
PARAMETER(ZERO = 0.0D0, DONE = 1.0D0)
IF ( M .LT. 1536 .OR. N .LT. 1536) THEN
NB = 64
ELSE IF ( M .GT. 4000 .AND. N.GT.4000) THEN
NB = 256
ELSE
NB = 128
END IF
!$omp parallel
!$omp master
DO L=1, M, NB
LH = MIN(M, L + NB - 1)
LB = LH - L + 1
DO K = 1, N, NB
KB = MIN ( NB, N - K + 1)
KH = K + KB - 1
!$omp task firstprivate(L, LH, K, KH) depend(OUT:C(L,K))
IF ( BETA .EQ. ZERO ) THEN
C(L:LH, K:KH) = ZERO
ELSE
C(L:LH, K:KH) = BETA * C(L:LH, K:KH)
END IF
!$omp end task
DO J = L, M, NB
JB = MIN(NB, M - J + 1)
!$omp task firstprivate(LB, KB, L, K, KH, J, JB) depend(inout:C(L,K))
JB = JB
!$omp end task
END DO
IF ( L.GT.1 ) THEN
IF (A(L,L-1) .NE. ZERO ) THEN
!$omp task firstprivate(L,K,KH) depend(IN:C(L,K))
C(L,K:KH) = (ALPHA*A(L,L-1)) * B(L-1, K:KH) + C(L,K:KH)
!$omp end task
END IF
END IF
END DO
END DO
!$omp end master
!$omp taskwait
!$omp end parallel
END SUBROUTINE
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me understand this case. You run the timing test 10,000 times.Each test creates, what, 64,000 tasks? Did I get that M x NB correct for this example?
I increased the number of test runs to 10,000 times to force the crash. For a production test I would use much less. But we call the routine very often in a long-running code such that 10,000 call to this routine can easily happen.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@grisuthedragon @jimdempseyatthecove
Try this, works for me
export KMP_LIBRARY=SERIAL
KMP_LIBRARY Selects the OpenMP run-time library execution mode. The values for this variable are serial, turnaround, or throughput. Default: throughput Throughput mode: The throughput mode allows the program to yield to other running programs and adjust resource usage to produce efficient execution in a dynamic environment. In a multi-user environment where the load on the parallel machine is not constant or where the job stream is not predictable, it may be better to design and tune for throughput. This minimizes the total time to run multiple jobs simultaneously. In this mode, the worker threads yield to other threads while waiting for more parallel work. After completing the execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep. Until more parallel work becomes available, sleeping allows processor and resources to be used for other work by non-OpenMP threaded code that may execute between parallel regions, or by other applications. The amount of time to wait before sleeping is set either by the KMP_BLOCKTIME environment variable or by the kmp_set_blocktime() function. A small blocktime value may offer better overall performance if your application contains non-OpenMP threaded code that executes between parallel regions. A larger blocktime value may be more appropriate if threads are to be reserved solely for use for OpenMP execution, but may penalize other concurrently-running OpenMP or threaded applications. Turnaround mode: The turnaround mode is designed to keep active all processors involved in the parallel computation, which minimizes execution time of a single job. In this mode, the worker threads actively wait for more parallel work, without yielding to other threads (although they are still subject to KMP_BLOCKTIME control). In a dedicated (batch or single user) parallel environment where all processors are exclusively allocated to the program for its entire run, it is most important to effectively use all processors all of the time. Note: Avoid over-allocating system resources. The condition can occur if either too many threads have been specified, or if too few processors are available at run time. If system resources are over-allocated, this mode will cause poor performance. The throughput mode should be used instead if this occurs. Serial mode: The serial mode forces parallel applications to run as a single thread. Intel extension to OpenMP run-time library API: ... void kmp_set_library_throughput() void kmp_set_library_turnaround() void kmp_set_library_serial() void kmp_set_library(int) 1 - serial mode / 2 - turnaround mode / 3 - throughput mode int kmp_get_library() 1 - serial mode / 2 - turnaround mode / 3 - throughput mode ... In the throughput OpenMP Support Libraries, threads wait for new parallel work at the ends of parallel regions, and then sleep, after a specified period of time. This time interval can be set by the KMP_BLOCKTIME environment variable or by the kmp_set_blocktime() function. ... int kmp_get_blocktime(void) Returns the number of milliseconds that a thread should wait, after completing the execution of a parallel region, before sleeping, as set either by the KMP_BLOCKTIME environment variable or by kmp_set_blocktime(). void kmp_set_blocktime(int msec) Sets the number of milliseconds that a thread should wait, after completing the execution of a parallel region, before sleeping. This routine affects the block time setting for the calling thread and any OpenMP team threads formed by the calling thread. The routine does not affect the block time for any other threads. ...
Runtimes seem to be fast as well: again, 16 threads on my 66core server
ifx/ifort
Old Code:
Time per Run: 7.628200000006473E-004
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For a test (such that it most likely not crash) I reduced the number of runs to 100 and compared the performance: (With GEMM from MKL)
$ KMP_LIBRARY=serial ./dla_dtrmm
Old Code:
Time per Run: 3.785300000017742E-002
$ KMP_LIBRARY=throughput ./dla_dtrmm
Old Code:
Time per Run: 3.755000000237487E-003
$ KMP_LIBRARY=turnaround ./dla_dtrmm
Old Code:
Time per Run: 3.591999999771361E-003
So using KMP_LIBRARY=serial only gives 10% of the performance. But it does not crash. Both other cases, with throughput and turnaround will crash if I increase the number of iterations.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page