AVX question

jimdempseyatthecove · ‎05-23-2012

I am experimenting with AVX to improve performance of an application. A simple test program is as follows:

[fortran]module MOD_AVX type TypeYMM real(8) :: ymm(0:3) end type TypeYMM end module MOD_AVX program TestAVX use MOD_AVX implicit none ! Variables type (TypeYMM), allocatable :: X(:),Y(:),Z(:) type (TypeYMM), allocatable :: uX(:),uY(:),uZ(:) type (TypeYMM) :: temp,uXtemp,uYtemp,uZtemp type (TypeYMM), allocatable :: ELBSG(:) integer :: i integer :: NBEADS ! Body of TestAVX NBEADS = 1000 allocate(X(0:NBEADS+1),Y(0:NBEADS+1),Z(0:NBEADS+1)) allocate(uX(1:NBEADS+1),uY(1:NBEADS+1),uZ(1:NBEADS+1)) allocate(ELBSG(1:NBEADS+1)) do i=1,NBEADS+1 uXtemp%ymm = X(i)%ymm - X(i-1)%ymm uYtemp%ymm = Y(i)%ymm - Y(i-1)%ymm uZtemp%ymm = X(i)%ymm - Z(i-1)%ymm temp%ymm = uXtemp%ymm**2 + uYtemp%ymm**2 + uZtemp%ymm**2 temp%ymm = sqrt(temp%ymm) uX(i)%ymm = uXtemp%ymm / temp%ymm uY(i)%ymm = uYtemp%ymm / temp%ymm uZ(i)%ymm = uZtemp%ymm / temp%ymm ELBSG(i)%ymm = temp%ymm end do print *, 'Hello World' end program TestAVX [/fortran]
When looking at the dissassembly of the DO loop the code looks efficient

[cpp].B1.17:: ; Preds .B1.17 .B1.16 vmovupd xmm2, XMMWORD PTR [-32+rbx+r9] ;39.9 vmovupd xmm3, XMMWORD PTR [rbx+r9] ;39.9 vmovupd xmm5, XMMWORD PTR [rbx+r8] ;40.9 ;;; end do inc r10 ;48.5 vinsertf128 ymm4, ymm2, XMMWORD PTR [-16+rbx+r9], 1 ;39.9 vmovupd xmm2, XMMWORD PTR [-32+rbx+r8] ;40.9 vinsertf128 ymm1, ymm3, XMMWORD PTR [16+rbx+r9], 1 ;39.9 vsubpd ymm3, ymm1, ymm4 ;39.9 vinsertf128 ymm0, ymm5, XMMWORD PTR [16+rbx+r8], 1 ;40.9 vinsertf128 ymm4, ymm2, XMMWORD PTR [-16+rbx+r8], 1 ;40.9 vsubpd ymm2, ymm0, ymm4 ;40.9 vmovupd xmm0, XMMWORD PTR [-32+rbx+rsi] ;41.9 vmulpd ymm5, ymm3, ymm3 ;42.30 vinsertf128 ymm4, ymm0, XMMWORD PTR [-16+rbx+rsi], 1 ;41.9 vmulpd ymm0, ymm2, ymm2 ;42.46 vsubpd ymm1, ymm1, ymm4 ;41.9 vaddpd ymm4, ymm5, ymm0 ;42.34 vmulpd ymm5, ymm1, ymm1 ;42.62 vaddpd ymm0, ymm4, ymm5 ;42.9 vsqrtpd ymm0, ymm0 ;43.20 vmovupd YMMWORD PTR [TESTAVX$TEMP.0.2], ymm0 ;43.9 vdivpd ymm4, ymm3, ymm0 ;44.9 vdivpd ymm5, ymm2, ymm0 ;45.9 vdivpd ymm0, ymm1, ymm0 ;46.9 mov rdi, QWORD PTR [TESTAVX$TEMP.0.2] ;47.9 mov QWORD PTR [rbx+r11], rdi ;47.9 mov rdi, QWORD PTR [TESTAVX$TEMP.0.2+8] ;47.9 mov QWORD PTR [8+rbx+r11], rdi ;47.9 mov rdi, QWORD PTR [TESTAVX$TEMP.0.2+16] ;47.9 mov QWORD PTR [16+rbx+r11], rdi ;47.9 mov rdi, QWORD PTR [TESTAVX$TEMP.0.2+24] ;47.9 mov QWORD PTR [24+rbx+r11], rdi ;47.9 vmovupd XMMWORD PTR [rbx+rcx], xmm4 ;44.9 vmovupd XMMWORD PTR [rbx+rdx], xmm5 ;45.9 vmovupd XMMWORD PTR [rbx+rax], xmm0 ;46.9 vextractf128 XMMWORD PTR [16+rbx+rcx], ymm4, 1 ;44.9 vextractf128 XMMWORD PTR [16+rbx+rdx], ymm5, 1 ;45.9 vextractf128 XMMWORD PTR [16+rbx+rax], ymm0, 1 ;46.9 add rbx, 32 ;48.5 cmp r10, 1001 ;48.5 jle .B1.17 ; Prob 82% ;48.5 [/cpp]
However, in looking at the instructions we find

[cpp] vmovupd xmm2, XMMWORD PTR [-32+rbx+r9] ;39.9
...
[/cpp] vinsertf128 ymm4, ymm2, XMMWORD PTR [-16+rbx+r9], 1 ;39.9

First question):

Why isn't the 256-bit vmovupd used for the load instead of 128-bit load followed (shortly) by 128-bit merge into 256-bit register?

I guess the reason is it is (slightly) faster when the 256-bit value has cache line split at 128-bit offset.
IOW the instruction pipeline fetch, will not stall at break in cache line. This would not fix a problem with a cache line split at 64-bit offset (split occures between 1st and 2nd qword.

Second question):

Why isn't "temp" fully registerized?

If I change the source code to:

do i=1,NBEADS+1
uXtemp%ymm = X(i)%ymm - X(i-1)%ymm
uYtemp%ymm = Y(i)%ymm - Y(i-1)%ymm
uZtemp%ymm = X(i)%ymm - Z(i-1)%ymm
ELBSG(i)%ymm = sqrt(uXtemp%ymm**2 + uYtemp%ymm**2 + uZtemp%ymm**2)
uX(i)%ymm = uXtemp%ymm / ELBSG(i)%ymm
uY(i)%ymm = uYtemp%ymm / ELBSG(i)%ymm
uZ(i)%ymm = uZtemp%ymm / ELBSG(i)%ymm
end do

The temporary is removed
[cpp].B1.17:: ; Preds .B1.17 .B1.16 vmovupd xmm2, XMMWORD PTR [-32+rbx+r10] ;39.9 vmovupd xmm3, XMMWORD PTR [rbx+r10] ;39.9 vmovupd xmm5, XMMWORD PTR [rbx+r9] ;40.9 ;;; #endif ;;; end do inc rsi ;55.5 vinsertf128 ymm4, ymm2, XMMWORD PTR [-16+rbx+r10], 1 ;39.9 vmovupd xmm2, XMMWORD PTR [-32+rbx+r9] ;40.9 vinsertf128 ymm1, ymm3, XMMWORD PTR [16+rbx+r10], 1 ;39.9 vsubpd ymm3, ymm1, ymm4 ;39.9 vinsertf128 ymm0, ymm5, XMMWORD PTR [16+rbx+r9], 1 ;40.9 vinsertf128 ymm4, ymm2, XMMWORD PTR [-16+rbx+r9], 1 ;40.9 vsubpd ymm2, ymm0, ymm4 ;40.9 vmovupd xmm0, XMMWORD PTR [-32+rbx+r8] ;41.9 vmulpd ymm5, ymm3, ymm3 ;50.39 vinsertf128 ymm4, ymm0, XMMWORD PTR [-16+rbx+r8], 1 ;41.9 vmulpd ymm0, ymm2, ymm2 ;50.55 vsubpd ymm1, ymm1, ymm4 ;41.9 vaddpd ymm4, ymm5, ymm0 ;50.43 vmulpd ymm5, ymm1, ymm1 ;50.71 vaddpd ymm0, ymm4, ymm5 ;50.59 vsqrtpd ymm0, ymm0 ;50.24 vmovupd XMMWORD PTR [rbx+rcx], xmm0 ;50.9 vdivpd ymm4, ymm3, ymm0 ;51.9 vdivpd ymm5, ymm2, ymm0 ;52.9 vextractf128 XMMWORD PTR [16+rbx+rcx], ymm0, 1 ;50.9 vdivpd ymm0, ymm1, ymm0 ;53.9 vmovupd XMMWORD PTR [rbx+rdx], xmm4 ;51.9 vmovupd XMMWORD PTR [rbx+rax], xmm5 ;52.9 vmovupd XMMWORD PTR [rbx+r11], xmm0 ;53.9 vextractf128 XMMWORD PTR [16+rbx+rdx], ymm4, 1 ;51.9 vextractf128 XMMWORD PTR [16+rbx+rax], ymm5, 1 ;52.9 vextractf128 XMMWORD PTR [16+rbx+r11], ymm0, 1 ;53.9 add rbx, 32 ;55.5 cmp rsi, 1001 ;55.5 jle .B1.17 ; Prob 82% ;55.5 [/cpp]

Curious

Jim Dempsey

TimP · ‎05-23-2012

As you guessed, the misaligned 256-bit move is a lot slower than a pair of 128-bit moves on Sandy Bridge. Ivy Bridge should correct that, but I haven't seen the compiler make any distinction. If it is fixed in the hardware, it is mainly so that those who use AVX-256 intrinsics will no longer be shooting themselves in the foot.
You have available !dir$ attributes align:32 :: array1,array2,.... to request 32-byte alignment, in case that makes a difference. In the next release, there is also the command line option -align array32byte.
Even though vsqrtpd and vdivpd are using 256-bit registers, in reality they are split internally, so you wouldn't expect much gain in this sequence over SSE2. That is still so in Ivy Bridge, but the latencies are much less.

jimdempseyatthecove · ‎05-23-2012

TimP,

>>Even though vsqrtpd and vdivpd are using 256-bit registers, in reality they are split internally, so you wouldn't expect much gain in this sequence over SSE2.

Excepting that the 256-bit register reduces register pressure and in turn reduces L1 fetch (assuming other half was in L1). Note that in this routine, after the sqrt the result is reused (in producing the unit vectors).

In the first set of codes F90/asm the unit vector vectors uXtemp, uYtemp, uZtemp were registerized but temp was not (these were all of TypeAVX). When I removed temp and rewrote in a more complex expression, where the compiler could generate and use its own temp, it did. This surprised me. But it is something to keep in mind when tuning the code.

Jim Dempsey