Are zcopy calls serialized system-wide?

Ton_Kostelijk · ‎02-02-2009

While trying to solve a performance issue, I started suspecting zcopy calls to be serialized system-wide.
For this reason, I wrote two little test programs, which are included below.

The results were astonishing.

When I ran two instances of test_zcopy on a Core2 Duo, the total execution time doubled:

-One process-
test_asgn: 8 sec per process.
test_zcopy: 22 sec per process.

-Two concurrent processes-
test_asgn: 10 sec per process (20 sec. in total)
test_zcopy: 44 sec per process (88 sec. in total)

When I repeated the experiment on a quad-core machine with four concurrent test_zcopy processes, the total execution time quadrupled.

Is this normal / desired behaviour or a serious bug?

I am using Intel Fortran Compiler 10.1.29 and MKL version 10.0.5.025 and tried both the sequential and the multi-threaded version of MKL, without any difference in results.

ifort /QxT %1 /link /LIBPATH:"C:\Program Files\Intel\MKL\10.0.5.025\ia32\lib" mkl_intel_c.lib mkl_sequential.lib mkl_core.lib

ifort /QxT %1 /link /LIBPATH:"C:\Program Files\Intel\MKL\10.0.5.025\ia32\lib" mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libguide40.lib

=======================================

program test_zcopy

complex*16, dimension(:,:), allocatable :: a1,a2
integer i,j,k

allocate(a1(40,40))
allocate(a2(40,40))

call zcopy(40*40,(1d0,1d0),0,a1,1)

do i=i,1000000
call zcopy(40*40,a1,1,a2,1)
call zcopy(40*40,a2,1,a1,1)
end do

deallocate(a1)
deallocate(a2)

end

=======================================

program test_asgn

complex*16, dimension(:,:), allocatable :: a1,a2
integer i,j,k

allocate(a1(40,40))
allocate(a2(40,40))

do j=1,40
do k=1,40
a1(k,j)=(1d0,1d0)
end do
end do

do i=i,1000000
do j=1,40
do k=1,40
a2(k,j) = a1(k,j)
end do
end do
do j=1,40
do k=1,40
a1(k,j) = a2(k,j)
end do
end do
end do

deallocate(a1)
deallocate(a2)

end

=======================================

Vladimir_Petrov__Int · ‎02-03-2009

Ton,

First of all, it is not correct to compare calls to zcopy to your test_asgn code. Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation, while calls to zcopy explicitly copy arrays from one to another and then back, hence the memory access pattern and cache utilization are completely different for these two cases.

Second, I could not reproduce your double-doubling of the per-process time. On my Core 2 Duo machine the timings are:
-One process-
test_asgn: ~9 sec total
test_zcopy: ~28 sec total
-Two processes-
test_asgn: ~12 sec total
test_zcopy: ~63 sec total
which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache. Your result of 88 seconds on two threads vs 22 seconds on one thread is very strange to me.

Third of all, if we reduce the number of iterations in the outermost loop to 10000 and increase the size of the matrix to 400x400 the times become as follows:
-One process-
test_asgn: ~29 sec total
test_zcopy: ~30 sec total
-Two processes-
test_asgn: ~77 sec total
test_zcopy: ~63 sec total
which proves that zcopy is better for parallel computations on more or less large arrays (those which do not fit in cache)

Finally it is somewhat misleading to talk about the "per process"-time. It's better to talk about the number of memops per second (or bytes per second). But once the difference in these two programs on small arrays is explained by different paths the data flows through.

Best regards,
-Vladimir

Ton_Kostelijk · ‎02-04-2009

Vladimir,

(1) "Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation"

This is not what I see in the assembly of test_asgn:

[plain];;;       do i=i,1000000
;;;         do j=1,40
;;;           do k=1,40
;;;             a2(k,j) = a1(k,j)

        mov       eax, DWORD PTR TEST_ASGN$A2$0$0               ;18.13
        xor       edx, edx                                      ;15.7
        shl       ecx, 4                                        ;
        sub       eax, ecx                                      ;
        sub       eax, DWORD PTR [esp+4]                        ;
        mov       ecx, DWORD PTR [esp+16]                       ;
        add       eax, DWORD PTR [esp+12]                       ;
        mov       DWORD PTR [esp], ebx                          ;
        mov       DWORD PTR [esp+24], esi                       ;
                                ; LOE eax edx ecx
$B1$9:                          ; Preds $B1$17 $B1$8
        xor       edi, edi                                      ;16.9
        mov       esi, eax                                      ;16.9
        mov       ebx, ecx                                      ;16.9
        mov       ecx, DWORD PTR [esp+12]                       ;16.9
        mov       DWORD PTR [esp+4], eax                        ;16.9
        mov       DWORD PTR [esp+20], edx                       ;16.9
        mov       edx, DWORD PTR [esp+24]                       ;16.9
                                ; LOE edx ecx ebx esi edi
$B1$10:                         ; Preds $B1$12 $B1$9
        xor       eax, eax                                      ;17.11
                                ; LOE eax edx ecx ebx esi edi
$B1$11:                         ; Preds $B1$11 $B1$10
        movsd     xmm0, QWORD PTR [ebx+eax+16]                  ;18.13
        movhpd    xmm0, QWORD PTR [ebx+eax+24]                  ;18.13
        movsd     QWORD PTR [eax+esi+16], xmm0                  ;18.13
        movhpd    QWORD PTR [eax+esi+24], xmm0                  ;18.13
        add       eax, 16                                       ;
        cmp       eax, 640                                      ;17.11
        jb        $B1$11        ; Prob 97%                      ;17.11
                                ; LOE eax edx ecx ebx esi edi
$B1$12:                         ; Preds $B1$11
        add       esi, ecx                                      ;16.9
        add       ebx, edx                                      ;16.9
        add       edi, 1                                        ;16.9
        cmp       edi, 40                                       ;16.9
        jb        $B1$10        ; Prob 97%                      ;16.9
                                ; LOE edx ecx ebx esi edi
$B1$13:                         ; Preds $B1$12
        mov       eax, DWORD PTR [esp+4]                        ;
        mov       edx, DWORD PTR [esp+20]                       ;
        mov       ecx, DWORD PTR [esp+16]                       ;

;;;           end do
;;;         end do
;;;         do j=1,40

        xor       edi, edi                                      ;21.9
        mov       esi, eax                                      ;21.9
        mov       ebx, ecx                                      ;21.9
        mov       ecx, DWORD PTR [esp+12]                       ;21.9
        mov       DWORD PTR [esp+4], eax                        ;21.9
        mov       DWORD PTR [esp+20], edx                       ;21.9
        mov       edx, DWORD PTR [esp+24]                       ;21.9
                                ; LOE edx ecx ebx esi edi
$B1$14:                         ; Preds $B1$16 $B1$13

;;;           do k=1,40

        xor       eax, eax                                      ;22.11
                                ; LOE eax edx ecx ebx esi edi
$B1$15:                         ; Preds $B1$15 $B1$14

;;;             a1(k,j) = a2(k,j)

        movsd     xmm0, QWORD PTR [eax+esi+16]                  ;23.13
        movhpd    xmm0, QWORD PTR [eax+esi+24]                  ;23.13
        movsd     QWORD PTR [ebx+eax+16], xmm0                  ;23.13
        movhpd    QWORD PTR [ebx+eax+24], xmm0                  ;23.13
        add       eax, 16                                       ;
        cmp       eax, 640                                      ;22.11
        jb        $B1$15        ; Prob 97%                      ;22.11
                                ; LOE eax edx ecx ebx esi edi
$B1$16:                         ; Preds $B1$15
        add       esi, ecx                                      ;
        add       ebx, edx                                      ;
        add       edi, 1                                        ;
        cmp       edi, 40                                       ;21.9
        jb        $B1$14        ; Prob 97%                      ;21.9
                                ; LOE edx ecx ebx esi edi
$B1$17:                         ; Preds $B1$16
        mov       eax, DWORD PTR [esp+4]                        ;
        mov       edx, DWORD PTR [esp+20]                       ;
        mov       ecx, DWORD PTR [esp+16]                       ;
        add       edx, 1                                        ;15.7
        cmp       edx, 1000001                                  ;15.7
        jb        $B1$9         ; Prob 82%                      ;15.7
                                ; LOE eax edx ecx
$B1$18:                         ; Preds $B1$17
        mov       ebx, DWORD PTR [esp]                          ;

;;;           end do
;;;         end do
;;;       end do
[/plain]

Here I clearly see the following loop structure:

|
||
|||
|||
||
|
||
|||
|||
||
|

with among others these instructions in the inner loops:

movsd xmm0, QWORD PTR [ebx+eax+16] ;18.13
movhpd xmm0, QWORD PTR [ebx+eax+24] ;18.13
movsd QWORD PTR [eax+esi+16], xmm0 ;18.13
movhpd QWORD PTR [eax+esi+24], xmm0 ;18.13

and

movsd xmm0, QWORD PTR [eax+esi+16] ;23.13
movhpd xmm0, QWORD PTR [eax+esi+24] ;23.13
movsd QWORD PTR [ebx+eax+16], xmm0 ;23.13
movhpd QWORD PTR [ebx+eax+24], xmm0 ;23.13

which show that no loops are fused (which is also not mentioned in the optimization report) and that all data is copied every time.

(2a) "I could not reproduce your double-doubling of the per-process time."

Apperently you did not understand my timings correctly, what I meant with

"test_zcopy: 44 sec per process (88 sec. in total)"

is that 44 seconds (CPU time) are spent by each process, adding up to 88 seconds CPU time in total (spent by two CPUs running in parallel). Because the two processes run in parallel, the elapsed wall-clock time is 44 seconds.

(2b) "...which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache."

Why would zcopy use the FSB? All data fits in the L2 cache.

(3) "...if we reduce the number of iterations in the outermost loop to 10000 and increase the size of the matrix to 400x400 the times..."

This is irrelevant to my question. Of course a different situation will lead to different numbers.

TimP · ‎02-04-2009

Quoting - Ton Kostelijk

(1) "Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation"

(2b) "...which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache."

Why would zcopy use the FSB? All data fits in the L2 cache.

Loop fusion behavior will change among compiler releases and with options set. I've been informed there are known bugs in the fusion which may be corrected in an 11.1 release.

The Core 2 Quad has 2 L2 caches. Every time a thread is reassigned from one cache to the other, you must update the new cache via FSB. This will happen frequently on Windows, if you don't take advantage of the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables. One sure thing about Windows is there will not be satisfactory scheduling for Core 2 Quad before Windows 7. By then, there may not be so much need for it.

Ton_Kostelijk · ‎02-04-2009

Quoting - tim18

Loop fusion behavior will change among compiler releases and with options set. I've been informed there are known bugs in the fusion which may be corrected in an 11.1 release.

The Core 2 Quad has 2 L2 caches. Every time a thread is reassigned from one cache to the other, you must update the new cache via FSB. This will happen frequently on Windows, if you don't take advantage of the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables. One sure thing about Windows is there will not be satisfactory scheduling for Core 2 Quad before Windows 7. By then, there may not be so much need for it.

Loop fusion behavior will change among compiler releases and with options set. I've been informed there are known bugs in the fusion which may be corrected in an 11.1 release.

As you can read in my first post, I am using Intel Fortran Compiler version 10.1.29 and MKL version 10.0.5.025.
My point was that loop fusion is not occurring, as Vladimir claims.

The Core 2 Quad has 2 L2 caches. Every time a thread is reassigned from one cache to the other, you must update the new cache via FSB. This will happen frequently on Windows, if you don't take advantage of the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables. One sure thing about Windows is there will not be satisfactory scheduling for Core 2 Quad before Windows 7. By then, there may not be so much need for it.

As you can read in my first post, the problem occurs on a Core2 Duo (with only 1 L2 cache).
Furthermore, I am using the single-threaded version of MKL.

Vladimir_Petrov__Int · ‎02-04-2009

Quoting - Ton Kostelijk

Vladimir,

(1) "Intel compiler optimizes the code in test_asgn by fusing two internal loops and eliminating one innermost memory operation"

This is not what I see in the assembly of test_asgn:

which show that no loops are fused (which is also not mentioned in the optimization report) and that all data is copied every time.

(2a) "I could not reproduce your double-doubling of the per-process time."

Apperently you did not understand my timings correctly, what I meant with

"test_zcopy: 44 sec per process (88 sec. in total)"

is that 44 seconds (CPU time) are spent by each process, adding up to 88 seconds CPU time in total (spent by two CPUs running in parallel). Because the two processes run in parallel, the elapsed wall-clock time is 44 seconds.

(2b) "...which simply shows that zcopy is FSB limited while test_asgn operates mostly on cache."

Why would zcopy use the FSB? All data fits in the L2 cache.

(3) "...if we reduce the number of iterations in the outermost loop to 10000 and increase the size of the matrix to 400x400 the times..."

This is irrelevant to my question. Of course a different situation will lead to different numbers.

Ton,

(1) What I tried was the 64-bit version of the Intel 10.1 Compiler and it did fuse the loops, while it looks like 32-bit version does not do it (at least with your optimization options).

(2a) OK, now we are on the same page.

(2b) One reason why zcopy is FSB limited could be the use of non-temporal stores while your code stays within L2 cache as you observed.

Well, and finally back to your original question - there is no system-wide serialization of zcopy calls.

Best regards,
-Vladimir