Vectorization of component arrays

Intel_C_Intel · ‎02-04-2006

Hello,

I have a few questions to specialists in vectorization.
I used compilation for Pentium M specific and added vectorization information from BuildLog to comments. Lets have a look at simple but interesting code:

Code:

module Types
    implicit none
    
    type PointerVector
        real, pointer :: Array(:)
    end type
    
    type AllocatableVector
        real, allocatable :: Array(:)
    end type
    
    interface operator(+)
        module procedure SumPointerVectors, SumAllocatableVectors
    end interface operator(+)
    
    interface operator(.plus.)
        module procedure SumPointerVectorsViaPointers
    end interface operator(.plus.)
    
    integer, parameter :: N = 1000000
    
contains
    function SumPointerVectors( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
        
        ! loop was not vectorized: subscript too complex.
        result%Array = vector1%Array + vector2%Array
    end function SumPointerVectors
    
    function SumPointerVectorsViaPointers( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        real, pointer :: array(:), array1(:), array2(:)
        
        allocate( result%Array(Size( vector1%Array )) )
        
        array => result%Array
        array1 => vector1%Array
        array2 => vector2%Array
        
        ! LOOP WAS VECTORIZED.
        ! LOOP WAS VECTORIZED.
        array = array1 + array2
    end function SumPointerVectorsViaPointers
    
    function SumAllocatableVectors( vector1, vector2 ) result(result)
        type(AllocatableVector) result
        type(AllocatableVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
        
        ! loop was not vectorized: subscript too complex.
        result%Array = vector1%Array + vector2%Array
    end function SumAllocatableVectors
end module Types

Code:

program Alloc
    use Types

    implicit none

    type(PointerVector) Vector1Pointer, Vector2Pointer
    type(AllocatableVector) Vector1Allocatable, Vector2Allocatable

    allocate( Vector1Pointer%Array(N), Vector2Pointer%Array(N),&
        Vector1Allocatable%Array(N), Vector2Allocatable%Array(N) )
    
    ! LOOP WAS VECTORIZED.
    Vector1Pointer%Array = 1
    
    ! LOOP WAS VECTORIZED.
    Vector2Pointer%Array = 2
    
    ! LOOP WAS VECTORIZED.
    Vector1Allocatable%Array = 3
    
    ! LOOP WAS VECTORIZED.
    Vector2Allocatable%Array = 4

    Vector1Pointer = Vector1Pointer + Vector2Pointer
    
    Vector1Pointer = Vector1Pointer .plus. Vector2Pointer
    
    ! loop was not vectorized: unsupported loop structure.
    Vector1Allocatable = Vector1Allocatable + Vector2Allocatable

    read *

end program Alloc

1. Why it [technically] impossible to vectorize
result%Array = vector1%Array + vector2%Array
2. Why loop in SumPointerVectorsViaPointers
array = array1 + array2
was vectorized twice while intrinsic arrays would be vectorized only once?
3. To get vect orization I have to use additional pointers. But I cant do the same with allocatable components because components cant have target attribute. Thats why there is no SumAllocatableVectorsViaPointers procedure. Is it possible in Fortran 2003? I just would like to use allocatable components, but dont loose vectorization.
4. Is there another approach to get vectorization in this case?
5. Itanium doesnt have vectorization. Why so if its true? And what about vectorization in IA-64 in the future?
Any answers will be appreciated.

jim_dempsey · ‎02-06-2006

As a work around I suggest writing a subroutine that takes three arguments (the three arrays)and performs the sum. Then in your overloaded functions sepcify the subroutine call is to be inlined. In this manner the compiler optimizations will remove the code that performs the pointer assignments.

I think this will give you what you want until the compiler is fixed. I think this problem is a result of user defined types. I think the compiler assumes all user defined types will require a member offset and thus does not think it can vectorize the code. I've seen this in my code and when using the inlined subroutine call hack this seems to correct the vectorization problem in most cases.

Jim Dempsey

Steven_L_Intel1 · ‎02-06-2006

You got a partial response to your post in comp.lang.fortran regarding your misconception on the TARGET attribute.

The compiler does not try to vectorize loops where the array access is complicated. It would require some special-case code to try to get some advantage from this. Deapite Jim's remark about "fix", there is no evidence of a compiler error here.

It is a fact that arrays that are components of derived types, especially in conjuction with pointer or allocatable, complicate life for the compiler and as such some optimization opportunities may be missed.

Itanium does not have "vectorization" in the traditional sense, but it achieves high performance using other architectural features that are exploited by the compiler.

jim_dempsey · ‎02-07-2006

Stanislaus,

Try the following and repor back on your success or failure:

Code:

module Types
    implicit none
    
    type PointerVector
        real, pointer :: Array(:)
    end type
    
    type AllocatableVector
        real, allocatable :: Array(:)
    end type
    
    interface operator(+)
        module procedure SumPointerVectors, SumAllocatableVectors
    end interface operator(+)
    
    interface operator(.plus.)
        module procedure SumPointerVectorsViaPointers
    end interface operator(.plus.)
    
    integer, parameter :: N = 1000000
    
contains
    subroutine SumArrays(A, B, C)
        real :: A(*), B(*), C(*)
        A = B + C
    end subroutine SumArrays

    function SumPointerVectors( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
!DEC$ ATTRIBUTES FORCEINLINE ::   SumArrays
        call SumArrays(result%Array, vector1%Array, vector2%Array)
    end function SumPointerVectors
    
    function SumPointerVectorsViaPointers( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        real, pointer :: array(:), array1(:), array2(:)
        
        allocate( result%Array(Size( vector1%Array )) )
        
        array => result%Array
        array1 => vector1%Array
        array2 => vector2%Array
        
        ! LOOP WAS VECTORIZED.
        ! LOOP WAS VECTORIZED.
        array = array1 + array2
    end function SumPointerVectorsViaPointers
    
    function SumAllocatableVectors( vector1, vector2 ) result(result)
        type(AllocatableVector) result
        type(AllocatableVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
        
!DEC$ ATTRIBUTES FORCEINLINE ::   SumArrays
        call SumArrays(result%Array, vector1%Array, vector2%Array)
    end function SumAllocatableVectors
end module Types

Intel_C_Intel · ‎02-08-2006

Dear Stanislaus,

Please allow me to comment on the vectorization-related questions. The issues arise because compilers sometimes must make conservative assumptions that are not always directly obvious to humans (occasionally even including compiler engineers :-).

1. Why it [technically] impossible to vectorize
result%Array = vector1%Array + vector2%Array

This construct translatesinternally into memory references with components for which the compiler cannoteasily disprove loop invariance. This disambiguation problem impacts many optimizations, not just vectorization. Technically, runtime disambiguation tests could help to vectorize this code, as done for other cases, but here the testing overhead and code size increase was deemed prohibitive.

2. Why loop in SumPointerVectorsViaPointers
array = array1 + array2
was vectorized twice while intrinsic arrays would be vectorized only once?

Again, the compiler fails to disambiguate the memory references and translates the single implied loop into a setup loop and computation loop to preserve F90 semantics. The vectorizer simply vectorizes both loops.

5. Itanium doesnt have vectorization. Why so if its true? And what about vectorization in IA-64 in the future?

As Steve said, other techniques, like software pipelining, are more important for IA64. Nevertheless, we plan to incorporate some of the vectorization techniques for IA32 and EM64T into the IA64 compilers as well, but this will only deal with integer operations, not floating-point.

Aart Bik
http://www.aartbik.com/

Message Edited by abik on 02-08-2006 10:30 AM

Intel_C_Intel · ‎02-08-2006

Thanks for answers, Jim, Steve, Aart (last post I've just seen, will answer some later).

FORCEINLINE doesn't work: "unresolved _SumArrays"

In SumArrays I have to use : instead of * -- compilers should know upper bound.

INLINE doesn't work.. but I tried to use /Qipo.

Below code is the same but comments are different.

New code without /Qipo:

Code:

module Types
    implicit none
    
    type PointerVector
        real, pointer :: Array(:)
    end type
    
    type AllocatableVector
        real, allocatable :: Array(:)
    end type
    
    interface operator(.viaSumArrays.)
        module procedure SumPointerVectorsAdditionalSumProcedure,&
            SumAllocatableVectorsAdditionalSumProcedure
    end interface operator(.viaSumArrays.)
    
    interface operator(.viaPointers.)
        module procedure SumPointerVectorsViaPointers
    end interface operator(.viaPointers.)
    
    interface operator(+)
        module procedure SimpleSumPointerVectors, SimpleSumAllocatableVectors
    end interface operator(+)
    
    integer, parameter :: N = 1000000
    
contains
    subroutine SumArrays( A, B, C )
        real :: A(:), B(:), C(:)
        
        ! LOOPS WAS VECTORIZED.
        A = B + C
    end subroutine SumArrays

    function SumPointerVectorsAdditionalSumProcedure( vector1, vector2 )&
        result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )

        call SumArrays( result%Array, vector1%Array, vector2%Array )
    end function SumPointerVectorsAdditionalSumProcedure
    
    function SumAllocatableVectorsAdditionalSumProcedure( vector1, vector2 )&
        result(result)
        type(AllocatableVector) result
        type(AllocatableVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )

        call SumArrays( result%Array, vector1%Array, vector2%Array )
    end function SumAllocatableVectorsAdditionalSumProcedure

    function SumPointerVectorsViaPointers( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        real, pointer :: array(:), array1(:), array2(:)
        
        allocate( result%Array(Size( vector1%Array )) )
        
        array => result%Array
        array1 => vector1%Array
        array2 => vector2%Array
        
        ! LOOPS WAS VECTORIZED.
        ! LOOPS WAS VECTORIZED.
        array = array1 + array2
    end function SumPointerVectorsViaPointers
    
    function SimpleSumPointerVectors( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
        
        ! loop was not vectorized: unsupported loop structure.
        result%Array = vector1%Array + vector2%Array
    end function SimpleSumPointerVectors
    
    function SimpleSumAllocatableVectors( vector1, vector2 ) result(result)
        type(AllocatableVector) result
        type(AllocatableVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
        
        ! loop was not vectorized: unsuppor
ted loop structure.
        result%Array = vector1%Array + vector2%Array
    end function SimpleSumAllocatableVectors
end module Types

Code:

program Alloc
    use Types

    implicit none
    
    type(PointerVector) Vector1Pointer, Vector2Pointer
    type(AllocatableVector) Vector1Allocatable, Vector2Allocatable

    allocate( Vector1Pointer%Array(N), Vector2Pointer%Array(N),&
        Vector1Allocatable%Array(N), Vector2Allocatable%Array(N) )

    ! LOOPS WAS VECTORIZED.
    Vector1Pointer%Array = 1
    Vector2Pointer%Array = 2
    Vector1Allocatable%Array = 3
    Vector2Allocatable%Array = 4

    Vector1Pointer = Vector1Pointer .viaSumArrays. Vector2Pointer
    
    Vector1Pointer = Vector1Pointer .viaPointers. Vector2Pointer
    
    Vector1Pointer = Vector1Pointer + Vector2Pointer
    
    ! loop was not vectorized: unsupported loop structure.
    Vector1Allocatable = Vector1Allocatable .viaSumArrays. Vector2Allocatable
    
    ! loop was not vectorized: unsupported loop structure.
    Vector1Allocatable = Vector1Allocatable + Vector2Allocatable

    read *

end program Alloc

And with /Qipo:

Code:

module Types
    implicit none
    
    type PointerVector
        real, pointer :: Array(:)
    end type
    
    type AllocatableVector
        real, allocatable :: Array(:)
    end type
    
    interface operator(.viaSumArrays.)
        module procedure SumPointerVectorsAdditionalSumProcedure,&
            SumAllocatableVectorsAdditionalSumProcedure
    end interface operator(.viaSumArrays.)
    
    interface operator(.viaPointers.)
        module procedure SumPointerVectorsViaPointers
    end interface operator(.viaPointers.)
    
    interface operator(+)
        module procedure SimpleSumPointerVectors, SimpleSumAllocatableVectors
    end interface operator(+)
    
    integer, parameter :: N = 1000000
    
contains
    subroutine SumArrays( A, B, C )
        real :: A(:), B(:), C(:)
        
        A = B + C
    end subroutine SumArrays

    function SumPointerVectorsAdditionalSumProcedure( vector1, vector2 )&
        result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )

        ! LOOPS WAS VECTORIZED.
        call SumArrays( result%Array, vector1%Array, vector2%Array )
    end function SumPointerVectorsAdditionalSumProcedure
    
    function SumAllocatableVectorsAdditionalSumProcedure( vector1, vector2 )&
        result(result)
        type(AllocatableVector) result
        type(AllocatableVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )

        ! LOOPS WAS VECTORIZED.
        call SumArrays( result%Array, vector1%Array, vector2%Array )
    end function SumAllocatableVectorsAdditionalSumProcedure

    function SumPointerVectorsViaPointers( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        real, pointer :: array(:), array1(:), array2(:)
        
        allocate( result%Array(Size( vector1%Array )) )
        
        array => result%Array
        array1 => vector1%Array
        array2
 => vector2%Array
        
        array = array1 + array2
    end function SumPointerVectorsViaPointers
    
    function SimpleSumPointerVectors( vector1, vector2 ) result(result)
        type(PointerVector) result
        type(PointerVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
        
        result%Array = vector1%Array + vector2%Array
    end function SimpleSumPointerVectors
    
    function SimpleSumAllocatableVectors( vector1, vector2 ) result(result)
        type(AllocatableVector) result
        type(AllocatableVector), intent(in) :: vector1, vector2
        
        allocate( result%Array(Size( vector1%Array )) )
        
        result%Array = vector1%Array + vector2%Array
    end function SimpleSumAllocatableVectors
end module Types

Code:

program Alloc
    use Types

    implicit none
    
    type(PointerVector) Vector1Pointer, Vector2Pointer
    type(AllocatableVector) Vector1Allocatable, Vector2Allocatable

    allocate( Vector1Pointer%Array(N), Vector2Pointer%Array(N),&
        Vector1Allocatable%Array(N), Vector2Allocatable%Array(N) )

    ! LOOPS WAS VECTORIZED.
    Vector1Pointer%Array = 1
    Vector2Pointer%Array = 2
    Vector1Allocatable%Array = 3
    Vector2Allocatable%Array = 4

    Vector1Pointer = Vector1Pointer .viaSumArrays. Vector2Pointer
    
    ! LOOPS WAS VECTORIZED.
    ! LOOPS WAS VECTORIZED.
    Vector1Pointer = Vector1Pointer .viaPointers. Vector2Pointer
    
    ! LOOPS WAS VECTORIZED.
    Vector1Pointer = Vector1Pointer + Vector2Pointer
    
    ! loop was not vectorized: unsupported loop structure.
    Vector1Allocatable = Vector1Allocatable .viaSumArrays. Vector2Allocatable
    
    ! LOOPS WAS VECTORIZED.
    ! loop was not vectorized: unsupported loop structure.
    Vector1Allocatable = Vector1Allocatable + Vector2Allocatable

    read *

end program Alloc

So, what we have:

Code:

No /Qipo
Pointer:
viaSumArrays: LOOPS WAS VECTORIZED. but only once.
viaPointers:  LOOPS WAS VECTORIZED.
              LOOPS WAS VECTORIZED.
+:            loop was not vectorized: unsupported loop structure.

Allocatable:
viaSumArrays: LOOPS WAS VECTORIZED. but only once.
              loop was not vectorized: unsupported loop structure. (in program) ???
+:            loop was not vectorized: unsupported loop structure.
              loop was not vectorized: unsupported loop structure. (in program)

/Qipo
Pointer:
viaSumArrays: LOOPS WAS VECTORIZED. (in program) but only once.
viaPointers:  LOOPS WAS VECTORIZED. (in program)
              LOOPS WAS VECTORIZED. (in program)
+:            LOOPS WAS VECTORIZED. (in program) !!! but only once.

Allocatable:
viaSumArrays: LOOPS WAS VECTORIZED. but only once.
              loop was not vectorized: unsupported loop structure. (in program) ???
+:            LOOPS WAS VECTORIZED. (in program) !!! but only once.
              loop was not vectorized: (in program) unsupported loop structure. ???

Interesting, isn't it?

SumArrays works always but vectorization is only once.
ViaPointers (for pointers only) works always, vectorization is twice.

+ hav e vectorization only with /Qipo, in program.

Allocatables always have "loop was not vectorized: unsupported loop structure" even when theywere vectorized (instead when SumArrays is used). Why so? (pointers are clear)

/Qipo gives onevectorization (in each case). So, compiler can do this if it wants. :) Looks like /Qipo can do vectorization (butonce). Why /Qipo gives vectorization in program instead SumArrays? Will /Qipo always save us? I mean, give vectorization.

Why viaPointers have two vectorizations? I think assignment (for viaPointers)takes additional vectorization. No?

But what I should do with Thirteen Diagonal Simetric Matrix? If I will use Sum/Product/SubtractArrays program would become very hard readable. But if I will use viaPointers (13 aditional pointers) code would become clearer + twice vectorization for each forall-string. Example without anything.

Code:

module TDSMatrixVector
    implicit none
    
    integer, parameter :: N = 100000
    
    type Vector
		integer :: Length = 0
		real, pointer :: Array(:)
	end type Vector
	
    
    ! TDSMatrix -- Thirteen Diagonal Simetric Matrix.
    
    type TDSMatrixDimension
		integer :: Size = 0
		integer :: Sec = 0
	end type TDSMatrixDimension
	
	type TDSMatrix
		type(TDSMatrixDimension) :: Dim
		
		real, pointer :: alfa(:) => Null()
		real, pointer :: beta(:) => Null()
		real, pointer :: gamma(:) => Null()
		real, pointer :: delta(:) => Null()
		real, pointer :: epsilon(:) => Null()
		real, pointer :: dzeta(:) => Null()
		real, pointer :: eta(:) => Null()
	end type TDSMatrix
	
	interface operator(*)
		module procedure ProductOfTDSMatrixVector
	end interface operator(*)

contains
    function ProductOfTDSMatrixVector( TDSMat, vect ) result(result)
		type(Vector) result
		type(TDSMatrix), intent(in) :: TDSMat
		type(Vector), intent(in) :: vect
		
		integer i
		
		if( vect%Length > 0 ) then
			result%Length = vect%Length
			allocate(result%Array(result%Length))
			
			forall( i = 1:vect%Length )&
				result%Array(i) = TDSMat%alfa(i) * vect%Array(i)
			
			forall( i = 1:vect%Length - 1 )&
				result%Array(i) = result%Array(i) + TDSMat%beta(i)*vect%Array(i+1)
			
			forall( i = 2:vect%Length )&
				result%Array(i) = result%Array(i) +&
					TDSMat%beta(i-1)*vect%Array(i-1)
			
			forall( i = 1:vect%Length - 2 )&
				result%Array(i) = result%Array(i) +&
					TDSMat%gamma(i)*vect%Array(i+2)
			
			forall( i = 3:vect%Length )&
				result%Array(i) = result%Array(i) +&
					TDSMat%gamma(i-2)*vect%Array(i-2)
			
			forall( i = 1:TDSMat%Dim%Sec + 1 )&
				result%Array(i) = result%Array(i) +&
					TDSMat%delta(i)*vect%Array(i+TDSMat%Dim%Size-TDSMat%Dim%Sec-1)
			
			forall( i = TDSMat%Dim%Size - TDSMat%Dim%Sec:vect%Length )&
				result%Array(i) = result%Array(i) +&
					TDSMat%delta(i-TDSMat%Dim%Size+TDSMat%Dim%Sec+1)*&
					vect%Array(i-TDSMat%Dim%Size+TDSMat%Dim%Sec+1)
			
			forall( i = 1:TDSMat%Dim%Sec )&
				result%Array(i) = result%Array(i) +&
					TDSMat%epsilon(i)*vect%Array(i+TDSMat%Dim%Size-TDSMat%Dim%Sec)
			
			forall( i = TDSMat%Dim%Size - TDSMat%Dim%Sec + 1:vect%Length )&
				result%Array(i) = result%Array(i) +&
					TDSMat%epsilon(i-TDSMat%Dim%Size+TDSMat%Dim%Sec)*&
					vect%Array
(i-TDSMat%Dim%Size+TDSMat%Dim%Sec)
			
			forall( i = 1:TDSMat%Dim%Sec - 1 )&
				result%Array(i) = result%Array(i) +&
					TDSMat%dzeta(i)*vect%Array(i+TDSMat%Dim%Size-TDSMat%Dim%Sec+1)
			
			forall( i = TDSMat%Dim%Size - TDSMat%Dim%Sec + 2:vect%Length )&
				result%Array(i) = result%Array(i) +&
					TDSMat%dzeta(i-TDSMat%Dim%Size+TDSMat%Dim%Sec-1)*&
					vect%Array(i-TDSMat%Dim%Size+TDSMat%Dim%Sec-1)

			forall( i = 1:2*TDSMat%Dim%Sec - TDSMat%Dim%Size )&
				result%Array(i) = result%Array(i) +&
					TDSMat%eta(i)*vect%Array(i+2*(TDSMat%Dim%Size-TDSMat%Dim%Sec))
			
			forall( i = 1 + 2*(TDSMat%Dim%Size - TDSMat%Dim%Sec):vect%Length )&
				result%Array(i) = result%Array(i) +&
					TDSMat%eta(i-2*(TDSMat%Dim%Size-TDSMat%Dim%Sec))*&
					vect%Array(i-2*(TDSMat%Dim%Size-TDSMat%Dim%Sec))
		end if
	end function ProductOfTDSMatrixVector	
end module TDSMatrixVector

Code:

program ComplexAlloc
    use TDSMatrixVector

    implicit none

    type(TDSMatrix) TDS
    type(Vector) vect
    
    allocate( vect%Array(N), TDS%alfa(N), TDS%beta(N), TDS%gamma(N), TDS%delta(N),&
        TDS%epsilon(N), TDS%dzeta(N), TDS%eta(N) )
    
    vect = TDS * vect

end program ComplexAlloc

So, what use:

1. Sum/Product/Subtract -- hard read, one vectorization, always.

2. /Qipo -- nothing to do,one vectorization,there is no surance. For TDS-example it doesn't works, for example.

3. Via aditional pointers -- aditional variables, two vectorizations, always.

4. Leave as is and live happy without vectorization :(.

Intel_C_Intel · ‎02-08-2006

Thank you very much for your explanations, Aart.

1. Why it [technically] impossible to vectorize
result%Array = vector1%Array + vector2%Array

This construct translatesinternally into memory references with components for which the compiler cannoteasily disprove loop invariance. This disambiguation problem impacts many optimizations, not just vectorization. Technically, runtime disambiguation tests could help to vectorize this code, as done for other cases, but here the testing overhead and code size increase was deemed prohibitive.

But as we can see in prevuous my post /Qipo allows do that (see SimplePlus-Procedures for (+) in /Qipo mode).

Why compiler does work with components but not with pointers as they are (like they would be alone)? I mean, compiler can take from, sayvector1, address of Array component and work with it like it isn't the component.

2. Why loop in SumPointerVectorsViaPointers
array = array1 + array2
was vectorized twice while intrinsic arrays would be vectorized only once?

Again, the compiler fails to disambiguate the memory references and translates the single implied loop into a setup loop and computation loop to preserve F90 semantics. The vectorizer simply vectorizes both loops.

So, it would work longer than while it would be one vectorization (for intrinsic arrays)?

5. Itanium doesnt have vectorization. Why so if its true? And what about vectorization in IA-64 in the future?

As Steve said, other techniques, like software pipelining, are more important for IA64. Nevertheless, we plan to incorporate some of the vectorization techniques for IA32 and EM64T into the IA64 compilers as well, but this will only deal with integer operations, not floating-point.

Should I write vectorization-oriented programs anyway (usually I have access to processors with SSEx)? But in future all my work wiil become unnecessary (IA64 would use vectorization).

This example unsettled me a little bit.

Code:

program Vectorization

    implicit none

    integer, parameter :: N = 100000
    real, pointer :: VectorP1(:), VectorP2(:)
    real, allocatable :: VectorA1(:), VectorA2(:)
    real Vector1(N), Vector2(N)
    
    allocate( VectorP1(N), VectorP2(N), VectorA1(N), VectorA2(N) )
    
    ! LOOP WAS VECTORIZED.
    ! LOOP WAS VECTORIZED.
    VectorP1 = VectorP1 + VectorP2
    
    ! loop was not vectorized: existence of vector dependence.
    VectorA1 = VectorA1 + VectorA2

    ! LOOP WAS VECTORIZED.
    Vector1 = Vector1 + Vector2

end program Vectorization

Intel_C_Intel · ‎02-08-2006

Dear Stanislaus,

Failure to vectorize in most cases you encountered is due to the use of dope vectors where pretty much all components are symbolic (base, lowerbound, stride, etc.). When more context is visible (like under Qipo), constants can be propagated into the symbolic components, which enables better analysis and, hence, optimization. Directives, like the one below, can often workaround issues like this.

!DIR$ ivdep
VectorA1 = VectorA1 + VectorA2

Please be assured that I completely understand how frustrating this is and how our customers sometimes would like to live without vectorization. Nevertheless, I continue my attempts to improve vectorization and even encountered a few cases in your examples that at first glance- seem open for improvement. Give me some time to ponder over those.

Aart Bik
http://www.aartbik.com/

jim_dempsey · ‎02-10-2006

Stanislaus,

Is the object of the coding exercize to make the code look neet and clean or to make the code lean and mean (fast)?

You are creating an overloaded operator functions. The code using the overloaded operators will never see the hacks needed to impliment the lean (fast) functions. Often it is a good practice to use contitional compilation such that the two coding method appear adjacent and selected via a compile time variable. You can enable/disable during testing and then use the fast code for production.

Jim Dempsey

Intel_C_Intel · ‎02-10-2006

Of course I need a fast code. But in this case I can write all on ASM, can't I? I have a lot of very serious code with a lot of operands in one string and so on. If I will write hard readable code I will make a lot of mistakes wich would require much time for detecting them. Time for developing can take very much time so I will loose, not won. It doesn't mean that I wouldn't use primitive procedures. Actually for simple operations they quite good. May be they should be used also for fully working code after all chekings and debuggings. I didn't decided yet but I like your aproach because using pointers not idealtoo and requires two vectorizations. Yak. I do not know. That's why I have created this topic.

What about operations -- I do not use them actually. For this case I have procedures like ProductTDSMatrixVector( result, TDSMatrix, vector ), Soit doesn'tcreate temporary memory. It would be better, I tnink, to have operations like += -= *= /=. For this kind of operations we would declare procedures withh three operands like ProductTDSMatrixVector. But last operation may be explain why we do not have them...

Thanks for discussion.

Intel_C_Intel · ‎02-11-2006

Hello,

In software development there are probably as many different programming styles as there are programmers -the following is my point of view only,and also possiblya little bit off-topic :-).

Writing simple straight forward code, for example in Fortran 77, has a high degree of readability, and the code may usually be easy to understand both for the programmer, co-programmers, and also easy to optimize for a compiler.The drawback is thatthis approach may result in much more code to read, than both Fortran 90 and C++, where code can be re-used and the code is much more compact and powerful than Fortran 77. However, as I see it, it is sometimes difficult to read, and also detect bugs in, code that is developed to be as compact as possible. In particular in codes that are implementations of some sort of mathematics or physics formulas, it is convenient that the code in itself look more like the written formulas in the Microsoft Word document which most commonly is the case when using Fortran 77. Another point is that one particular pageof the code(filling one computer screen) should be as easy as possible to read and understand without so much prior knowledge about the rest of the code. For example if you write Y(:) = A(:,I)*X(:) + B(:), it is not possible to know what is the size of the vectors at hand, readability of the program is decreased and the probability of bugs is larger, as one has to go to another place in the program to determine the size of the vectors,vectors may not have the same size?The following is much better Y(1:N) = A(1:N,I)*X(1:N) + B(1:N), because then you know that the size of the vectors is equal to N, you have to find the value of N only to understand what is going on. If one is using new inventions like templats and operator overloading the code becomes even more unreadable without any prior knowledge of the coding standard applied or the rest of the code in the program. Onereason for writing very compact, generic and powerful code is frequently said to be that the programmer should not "worry too much" about what is going on at a deeper level in the program. This sometimes makes me a little anoyed, as both understanding what is really going on and also detecting bugs becomes more difficult - the readability of the code is actually reduced, despite that it appears to be easier to read.

Personally, if I have the choice between high readability code and efficient code, I usually go for the high readability version and instead expect that future compilers will be better at optimizing what I think is simple. Recently I have seen that Intel compilers withcompilers options/Qipo and /O3 have started to agree more and more with my point of view -a clear straigh forward and simple Fortran 77 approach, easy to understand both for compilers and human beings. Surely this approach is not applicable to all areas in software engineering, but in physics and mathematics I have found that it works quite well.

Best Regards,

Lars Petter Endresen