- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the following code, subroutine foo is an exemplar of a small section of a simulation program. The simulation program uses a large number of small multi-dimension arrays (3x3, 3x4,...) that do not lend themselves to efficient vectorization using AVX, AVX2 and AVX512. While in places some degree of vectorization is attained, it is usually limited to 2-lanes or 3-lanes, but very seldom to 4-lanes (AVX/AVX2) and never 8-lanes (AVX512) REAL(8) data.
As a solution to this, the data structure is reorganized into cache line wide elements:
real(8) :: array(8,3,4) ! eight or former array(3,4)
When generating for AVX512, the left-most slice can be manipulated using single AVX512 registers. The compiler optimization (19.0 and later) does an excellent job of reducing the statements into efficient machine code.
On systems with AVX and AVX2, while the registers cannot hold the full width of the left most slice in a single register, it can hold half the the slice and use two registers to hold the whole of the slice.
The compiler optimizations, when optimizing non-looped code, does in fact generate fully vectorized code (twice, once for each half of the slice).
The problem comes in when the statements are enclosed within a loop and double nested loop.
In cases of a single loop level, depending on the code in the loop, some times the compiler manages to vectorize, In most cases it does not.
Using !dir$ ivdep helps in some places but not all. Using !dir vector always seems to have no effect.
In the following code
subroutine foo should vectorize with systems with AVX, AVX2, and AVX512 (it does with AVX512)
subroutine foo2 is a work-around that explicitly sub-divides the slice(1:8) into slice(1:4) and slice(5:8) and perform the same statement(s) twice. This works some of the time but not all of the time. In the below example, it does not work.
subroutine foo3, with a contains function Hack does attain vectorization on system with AVX. While it will attain 256-bit wide vectorization on AVX512 it is not as efficient as using the intended code without the Hack.
I hope this reproducer will aid the Intel software development team identify an optimization opportunity (and hopefully fix it in an upcoming release).
! Console2.f90
module mod_foo
integer, parameter :: CacheLineSize = 64
integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
real(8) :: Output(VectorWidth, 3, 4)
real(8) :: A(VectorWidth, 3, 4)
real(8) :: B(VectorWidth, 3, 3)
real(8) :: C(VectorWidth)
!dir$ attributes align : CacheLineSize :: Output, A, B, C
end module mod_foo
program tst
use mod_foo
call RANDOM_NUMBER(A)
call RANDOM_NUMBER(B)
call RANDOM_NUMBER(C)
call foo(Output, A, B, C)
! not interested in output
! print is here to assure optimization does not remove all code
print *,Output
call foo2(Output, A, B, C)
! not interested in output
! print is here to assure optimization does not remove all code
print *,Output
call foo3(Output, A, B, C)
! not interested in output
! print is here to assure optimization does not remove all code
print *,Output
end program
subroutine foo(output, A, B, C)
integer, parameter :: CacheLineSize = 64
integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
real(8), intent(out) :: Output(VectorWidth, 3, 4)
real(8), intent(in) :: A(VectorWidth, 3, 4)
real(8), intent(in) :: B(VectorWidth, 3, 3)
real(8), intent(in) :: C(VectorWidth)
integer :: i, j
!dir$ assume_aligned output:64, A:64, B:64, C:64
do j=1,4
!dir$ ivdep
!dir$ vector always
do i=1,3
Output(:,i,j) = &
( A(:,1,j) * B(:,1,i) &
+ A(:,2,j) * B(:,2,i) &
+ A(:,3,j) * B(:,3,i) ) * C(:)
end do
end do
end subroutine foo
subroutine foo2(output, A, B, C)
integer, parameter :: CacheLineSize = 64
integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
real(8), intent(out) :: Output(VectorWidth, 3, 4)
real(8), intent(in) :: A(VectorWidth, 3, 4)
real(8), intent(in) :: B(VectorWidth, 3, 3)
real(8), intent(in) :: C(VectorWidth)
integer :: i, j
!dir$ assume_aligned output:64, A:64, B:64, C:64
do j=1,4
!dir$ ivdep
!dir$ vector always
do i=1,3
Output(1:4,i,j) = &
( A(1:4,1,j) * B(1:4,1,i) &
+ A(1:4,2,j) * B(1:4,2,i) &
+ A(1:4,3,j) * B(1:4,3,i) ) * C(1:4)
end do
!dir$ ivdep
!dir$ vector always
do i=1,3
Output(5:8,i,j) = &
( A(5:8,1,j) * B(5:8,1,i) &
+ A(5:8,2,j) * B(5:8,2,i) &
+ A(5:8,3,j) * B(5:8,3,i) ) * C(5:8)
end do
end do
end subroutine foo2
subroutine foo3(output, A, B, C)
integer, parameter :: CacheLineSize = 64
integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
real(8), intent(out) :: Output(VectorWidth, 3, 4)
real(8), intent(in) :: A(VectorWidth, 3, 4)
real(8), intent(in) :: B(VectorWidth, 3, 3)
real(8), intent(in) :: C(VectorWidth)
integer :: i, j
!dir$ assume_aligned output:64, A:64, B:64, C:64
do j=1,4
!dir$ ivdep
!dir$ vector always
do i=1,3
Output(1:4,i,j) = &
Hack( A(1:4,:,j), B(1:4,:,i), C(1:4))
Output(5:8,i,j) = &
Hack( A(5:8,:,j), B(5:8,:,i), C(5:8))
end do
end do
contains
function Hack(A, B, C)
real(8) :: Hack(4), A(4,3), B(4,3), C(4)
!dir$ assume_aligned Hack(1):64, A:64, B:64, C:64
Hack(:) = &
( A(:,1) * B(:,1) &
+ A(:,2) * B(:,2) &
+ A(:,3) * B(:,3) ) * C(:)
end function Hack
end subroutine foo3
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the beginning was the Word......
Jim: That statement actually makes more sense than trying to understand your code. It is just way to deep.
JMN
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
The code wasn't intended for you. It was intended for Intel compiler developers, and possibly FortranFan, Mecej, etc... that are looking for a work around to a vectorization issue.
For mere mortals like you, it is gobbly **bleep**.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim:
Thank you for clarifying that -- I will never make the Titan class.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Did you try to annotate the nested "Hack" function with this:
For example
!dir$ attribute vector : vectorlength(8) :: Hack
P.s.
Quick look at Godbolt compiler explorer Ifort 19 assembly revealed that, "Hack" function was (as expected) inlined at the its callsite. Unfortunately the machine code (AVX2) uses mixture of XMM and YMM registers. At any point of time I was not able to force the generation of AVX512 code even when using "qopt-zmm-usage=high" compiler option. I presume, that either the data layout of (Hack arguments) was not optimal for 8-lanes code, or maybe the cost model of switching on the additional 512-bit circuitry needed for the heavy AVX512 code was prohibitive in the light of small data size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Hack is not required on AVX512 builds.
The point was not to show the call overhead of the Hack, rather to show that the do j=; do i= loops did not vectorize code that is clearly vectorizable.
Alternativies to the Hack, is in the inner most loop, use associate(component=>slice(:,i,j) for each component in the statement, then use the (1:4) indexes of the components, and then (5:8) indexes of the component.
When that doesn't work, then use the associate(component=>slice(1:4,i,j)... statement, end associate, associate(component=>slice(5:8,i,j)... statement, end associate. While these work it is a royal PIA.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose, that you might have known that information already.
In subroutine 'foo' indexing the arrays: by induction variable "i" did enable AVX2 vectorization. Unfortunately the directive "!dir$ attributes align" did not have any effect. Compiler generates unaligned load/stores. I wonder why indexing by sequence (1,2,3) produced scalar AVX2 code and indexing by induction variable "i" vectorized the code.
Godbolt link
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page