SIMD cross product

David_DiLaura1 · ‎02-23-2016

Colleagues,

I need to form the cross products of a single vector with a long string of other vectors (all vectors 3D). I have not been able to see how to vectorize this process in Fortran. C++ allows insertion of assembler, so something can be hand-crafted in that compiler. But it's not portable. Any suggestions?

David

JVanB · ‎02-23-2016

Fortran allows the programmer to write out steps using vector syntax that the compiler could, in principle, translate directly into efficient SIMD code. However, I have never seen ifort perform such a translation. Can you perhaps be a bit more specific as to the KINDs of your data and how they are laid out in memory?

David_DiLaura1 · ‎02-23-2016

All the vectors involved use 4-byte floats. The layout in memory can be changed to make vectorization possible. But, for example, in other vectorized operations in the same section of code, the data is already laid out Vector(1:N, 1:3) rather than Vector(1:3,1:N). That is all the x coordinates are in contiguous memory. As with the y and z-coordinates.

jimdempseyatthecove · ‎02-23-2016

Can you explain in a little more detail what your in's and out's are. From my read (which may be incorrect). Pseudo code

LongVectorOf3PointVectors
Crossing3PointVector

For Each 3PointVectorReference v in LongVectorOf3PointVectors
v = v cross Crossing3PointVector
end for

IOW you seem to be asking for an in situ cross. Is this correct?

Or, is the output going to a different LongVectorOf3PointVectors?

You might want to make a variation on this:

! in some module   
 ! AVX four-up double vector
    type TypeYMM
        SEQUENCE
        real(8) :: v(0:3)
    end type TypeYMM
...
RECURSIVE SUBROUTINE CROSS_ymm_ymm_ymm(VA,VB,VC)
    use MOD_ALL ! contains types
    type(TypeYMM) :: VA(3), VB(3), VC(3)
    type(TypeYMM), automatic :: VS(3)

    if((loc(VC) .eq. loc(VA)) .or. (loc(VC) .eq. loc(VB))) then
        ! COMPUTE THE CROSS PRODUCT
        !DEC$ VECTOR ALWAYS
        VS(1)%v = VA(2)%v*VB(3)%v - VA(3)%v*VB(2)%v
        !DEC$ VECTOR ALWAYS
        VS(2)%v = VA(3)%v*VB(1)%v - VA(1)%v*VB(3)%v
        !DEC$ VECTOR ALWAYS
        VS(3)%v = VA(1)%v*VB(2)%v - VA(2)%v*VB(1)%v

        ! TRANSFER THE CROSS PRODUCT RESULT TO THE OUTPUT VECTOR
        !DEC$ VECTOR ALWAYS
        VC(1)%v = VS(1)%v
        !DEC$ VECTOR ALWAYS
        VC(2)%v = VS(2)%v
        !DEC$ VECTOR ALWAYS
        VC(3)%v = VS(3)%v
    else
        ! COMPUTE THE CROSS PRODUCT without temps
        !DEC$ VECTOR ALWAYS
        VC(1)%v = VA(2)%v*VB(3)%v - VA(3)%v*VB(2)%v
        !DEC$ VECTOR ALWAYS
        VC(2)%v = VA(3)%v*VB(1)%v - VA(1)%v*VB(3)%v
        !DEC$ VECTOR ALWAYS
        VC(3)%v = VA(1)%v*VB(2)%v - VA(2)%v*VB(1)%v
    endif
END SUBROUTINE CROSS_ymm_ymm_ymm

Jim Dempsey

JVanB · ‎02-23-2016

The obvious thing to do is

cross_product = reshape([single(2)*vector(:,3)-single(3)*vector(:,2),single(3)*vector(:,1)-single(1)*vector(:,3),single(1)*vector(:,2)-single(2)*vector(:,1)],shape(vector))

which gives the compiler maximum information and freedom to vectorize. However, don't be surprised if given sufficient rope, the compiler manages to hang itself. Are you targeting systems with xmm, ymm, or zmm registers?

andrew_4619 · ‎02-23-2016

In a bored moment I ran some quick cross product tests. Vectorisation ( for real(4) ) did speed it up but even without I could do 600,000 cross products in less than 1ms on a basic laptop (I5 processor). Is it worth worrying about?