Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Are zcopy calls serialized system-wide?

Ton_Kostelijk
Beginner
532 Views
While trying to solve a performance issue, I started suspecting zcopy calls to be serialized system-wide.
For this reason, I wrote two little test programs, which are included below.

The results were astonishing.

When I ran two instances of test_zcopy on a Core2 Duo, the total execution time doubled:

-One process-
test_asgn: 8 sec per process.
test_zcopy: 22 sec per process.

-Two concurrent processes-
test_asgn: 10 sec per process (20 sec. in total)
test_zcopy: 44 sec per process (88 sec. in total)

When I repeated the experiment on a quad-core machine with four concurrent test_zcopy processes, the total execution time quadrupled.

Is this normal / desired behaviour or a serious bug?

I am using Intel Fortran Compiler 10.1.29 and MKL version 10.0.5.025 and tried both the sequential and the multi-threaded version of MKL, without any difference in results.

ifort /QxT %1 /link /LIBPATH:"C:\Program Files\Intel\MKL\10.0.5.025\ia32\lib" mkl_intel_c.lib mkl_sequential.lib mkl_core.lib

ifort /QxT %1 /link /LIBPATH:"C:\Program Files\Intel\MKL\10.0.5.025\ia32\lib" mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libguide40.lib

=======================================

program test_zcopy

complex*16, dimension(:,:), allocatable :: a1,a2
integer i,j,k

allocate(a1(40,40))
allocate(a2(40,40))

call zcopy(40*40,(1d0,1d0),0,a1,1)

do i=i,1000000
call zcopy(40*40,a1,1,a2,1)
call zcopy(40*40,a2,1,a1,1)
end do

deallocate(a1)
deallocate(a2)

end

=======================================

program test_asgn

complex*16, dimension(:,:), allocatable :: a1,a2
integer i,j,k

allocate(a1(40,40))
allocate(a2(40,40))

do j=1,40
do k=1,40
a1(k,j)=(1d0,1d0)
end do
end do

do i=i,1000000
do j=1,40
do k=1,40
a2(k,j) = a1(k,j)
end do
end do
do j=1,40
do k=1,40
a1(k,j) = a2(k,j)
end do
end do
end do

deallocate(a1)
deallocate(a2)

end

=======================================

0 Kudos
5 Replies
Vladimir_Petrov__Int
New Contributor III
532 Views

Ton,

First of all, it is not correct to compare calls to zcopy to your test_asgn code. Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation, while calls to zcopy explicitly copy arrays from one to another and then back, hence the memory access pattern and cache utilization are completely different for these two cases.

Second, I could not reproduce your double-doubling of the per-process time. On my Core 2 Duo machine the timings are:
-One process-
test_asgn: ~9 sec total
test_zcopy: ~28 sec total
-Two processes-
test_asgn: ~12 sec total
test_zcopy: ~63 sec total
which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache. Your result of 88 seconds on two threads vs 22 seconds on one thread is very strange to me.

Third of all, if we reduce the number of iterations in the outermost loop to 10000 and increase the size of the matrix to 400x400 the times become as follows:
-One process-
test_asgn: ~29 sec total
test_zcopy: ~30 sec total
-Two processes-
test_asgn: ~77 sec total
test_zcopy: ~63 sec total
which proves that zcopy is better for parallel computations on more or less large arrays (those which do not fit in cache)

Finally it is somewhat misleading to talk about the "per process"-time. It's better to talk about the number of memops per second (or bytes per second). But once the difference in these two programs on small arrays is explained by different paths the data flows through.

Best regards,
-Vladimir
0 Kudos
Ton_Kostelijk
Beginner
532 Views
Vladimir,

(1) "Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation"

This is not what I see in the assembly of test_asgn:

[plain];;;       do i=i,1000000
;;; do j=1,40
;;; do k=1,40
;;; a2(k,j) = a1(k,j)

mov eax, DWORD PTR TEST_ASGN$A2$0$0 ;18.13
xor edx, edx ;15.7
shl ecx, 4 ;
sub eax, ecx ;
sub eax, DWORD PTR [esp+4] ;
mov ecx, DWORD PTR [esp+16] ;
add eax, DWORD PTR [esp+12] ;
mov DWORD PTR [esp], ebx ;
mov DWORD PTR [esp+24], esi ;
; LOE eax edx ecx
$B1$9: ; Preds $B1$17 $B1$8
xor edi, edi ;16.9
mov esi, eax ;16.9
mov ebx, ecx ;16.9
mov ecx, DWORD PTR [esp+12] ;16.9
mov DWORD PTR [esp+4], eax ;16.9
mov DWORD PTR [esp+20], edx ;16.9
mov edx, DWORD PTR [esp+24] ;16.9
; LOE edx ecx ebx esi edi
$B1$10: ; Preds $B1$12 $B1$9
xor eax, eax ;17.11
; LOE eax edx ecx ebx esi edi
$B1$11: ; Preds $B1$11 $B1$10
movsd xmm0, QWORD PTR [ebx+eax+16] ;18.13
movhpd xmm0, QWORD PTR [ebx+eax+24] ;18.13
movsd QWORD PTR [eax+esi+16], xmm0 ;18.13
movhpd QWORD PTR [eax+esi+24], xmm0 ;18.13
add eax, 16 ;
cmp eax, 640 ;17.11
jb $B1$11 ; Prob 97% ;17.11
; LOE eax edx ecx ebx esi edi
$B1$12: ; Preds $B1$11
add esi, ecx ;16.9
add ebx, edx ;16.9
add edi, 1 ;16.9
cmp edi, 40 ;16.9
jb $B1$10 ; Prob 97% ;16.9
; LOE edx ecx ebx esi edi
$B1$13: ; Preds $B1$12
mov eax, DWORD PTR [esp+4] ;
mov edx, DWORD PTR [esp+20] ;
mov ecx, DWORD PTR [esp+16] ;

;;; end do
;;; end do
;;; do j=1,40

xor edi, edi ;21.9
mov esi, eax ;21.9
mov ebx, ecx ;21.9
mov ecx, DWORD PTR [esp+12] ;21.9
mov DWORD PTR [esp+4], eax ;21.9
mov DWORD PTR [esp+20], edx ;21.9
mov edx, DWORD PTR [esp+24] ;21.9
; LOE edx ecx ebx esi edi
$B1$14: ; Preds $B1$16 $B1$13

;;; do k=1,40

xor eax, eax ;22.11
; LOE eax edx ecx ebx esi edi
$B1$15: ; Preds $B1$15 $B1$14

;;; a1(k,j) = a2(k,j)

movsd xmm0, QWORD PTR [eax+esi+16] ;23.13
movhpd xmm0, QWORD PTR [eax+esi+24] ;23.13
movsd QWORD PTR [ebx+eax+16], xmm0 ;23.13
movhpd QWORD PTR [ebx+eax+24], xmm0 ;23.13
add eax, 16 ;
cmp eax, 640 ;22.11
jb $B1$15 ; Prob 97% ;22.11
; LOE eax edx ecx ebx esi edi
$B1$16: ; Preds $B1$15
add esi, ecx ;
add ebx, edx ;
add edi, 1 ;
cmp edi, 40 ;21.9
jb $B1$14 ; Prob 97% ;21.9
; LOE edx ecx ebx esi edi
$B1$17: ; Preds $B1$16
mov eax, DWORD PTR [esp+4] ;
mov edx, DWORD PTR [esp+20] ;
mov ecx, DWORD PTR [esp+16] ;
add edx, 1 ;15.7
cmp edx, 1000001 ;15.7
jb $B1$9 ; Prob 82% ;15.7
; LOE eax edx ecx
$B1$18: ; Preds $B1$17
mov ebx, DWORD PTR [esp] ;

;;; end do
;;; end do
;;; end do
[/plain]
Here I clearly see the following loop structure:

|
||
|||
|||
||
|
||
|||
|||
||
|

with among others these instructions in the inner loops:

movsd xmm0, QWORD PTR [ebx+eax+16] ;18.13
movhpd xmm0, QWORD PTR [ebx+eax+24] ;18.13
movsd QWORD PTR [eax+esi+16], xmm0 ;18.13
movhpd QWORD PTR [eax+esi+24], xmm0 ;18.13

and

movsd xmm0, QWORD PTR [eax+esi+16] ;23.13
movhpd xmm0, QWORD PTR [eax+esi+24] ;23.13
movsd QWORD PTR [ebx+eax+16], xmm0 ;23.13
movhpd QWORD PTR [ebx+eax+24], xmm0 ;23.13

which show that no loops are fused (which is also not mentioned in the optimization report) and that all data is copied every time.


(2a)
"I could not reproduce your double-doubling of the per-process time."

Apperently you did not understand my timings correctly, what I meant with

"test_zcopy: 44 sec per process (88 sec. in total)"

is that 44 seconds (CPU time) are spent by each process, adding up to 88 seconds CPU time in total (spent by two CPUs running in parallel). Because the two processes run in parallel, the elapsed wall-clock time is 44 seconds.


(2b)
"...which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache."

Why would zcopy use the FSB? All data fits in the L2 cache.


(3)
"...if we reduce the number of iterations in the outermost loop to 10000 and increase the size of the matrix to 400x400 the times..."

This is irrelevant to my question. Of course a different situation will lead to different numbers.

0 Kudos
TimP
Honored Contributor III
532 Views
Quoting - Ton Kostelijk


(1) "Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation"



(2b)
"...which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache."

Why would zcopy use the FSB? All data fits in the L2 cache.

Loop fusion behavior will change among compiler releases and with options set. I've been informed there are known bugs in the fusion which may be corrected in an 11.1 release.

The Core 2 Quad has 2 L2 caches. Every time a thread is reassigned from one cache to the other, you must update the new cache via FSB. This will happen frequently on Windows, if you don't take advantage of the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables. One sure thing about Windows is there will not be satisfactory scheduling for Core 2 Quad before Windows 7. By then, there may not be so much need for it.
0 Kudos
Ton_Kostelijk
Beginner
532 Views
Quoting - tim18
Loop fusion behavior will change among compiler releases and with options set. I've been informed there are known bugs in the fusion which may be corrected in an 11.1 release.

The Core 2 Quad has 2 L2 caches. Every time a thread is reassigned from one cache to the other, you must update the new cache via FSB. This will happen frequently on Windows, if you don't take advantage of the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables. One sure thing about Windows is there will not be satisfactory scheduling for Core 2 Quad before Windows 7. By then, there may not be so much need for it.

Loop fusion behavior will change among compiler releases and with options set. I've been informed there are known bugs in the fusion which may be corrected in an 11.1 release.

As you can read in my first post, I am using Intel Fortran Compiler version 10.1.29 and MKL version 10.0.5.025.
My point was that loop fusion is not occurring, as Vladimir claims.

The Core 2 Quad has 2 L2 caches. Every time a thread is reassigned from one cache to the other, you must update the new cache via FSB. This will happen frequently on Windows, if you don't take advantage of the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables. One sure thing about Windows is there will not be satisfactory scheduling for Core 2 Quad before Windows 7. By then, there may not be so much need for it.

As you can read in my first post, the problem occurs on a Core2 Duo (with only 1 L2 cache).
Furthermore, I am using the single-threaded version of MKL.

0 Kudos
Vladimir_Petrov__Int
New Contributor III
532 Views
Quoting - Ton Kostelijk
Vladimir,

(1) "Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation"

This is not what I see in the assembly of test_asgn:

which show that no loops are fused (which is also not mentioned in the optimization report) and that all data is copied every time.


(2a)
"I could not reproduce your double-doubling of the per-process time."

Apperently you did not understand my timings correctly, what I meant with

"test_zcopy: 44 sec per process (88 sec. in total)"

is that 44 seconds (CPU time) are spent by each process, adding up to 88 seconds CPU time in total (spent by two CPUs running in parallel). Because the two processes run in parallel, the elapsed wall-clock time is 44 seconds.


(2b)
"...which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache."

Why would zcopy use the FSB? All data fits in the L2 cache.


(3)
"...if we reduce the number of iterations in the outermost loop to 10000 and increase the size of the matrix to 400x400 the times..."

This is irrelevant to my question. Of course a different situation will lead to different numbers.

Ton,

(1) What I tried was the 64-bit version of the Intel 10.1 Compiler and it did fuse the loops, while it looks like 32-bit version does not do it (at least with your optimization options).

(2a) OK, now we are on the same page.

(2b) One reason why zcopy is FSB limited could be the use of non-temporal stores while your code stays within L2 cache as you observed.

Well, and finally back to your original question - there is no system-wide serialization of zcopy calls.

Best regards,
-Vladimir
0 Kudos
Reply