Autovectorizing complex*16 data type

saratoga · ‎11-19-2009

I'm having some trouble understanding why a seemingly simple loop doesn't get vectorized:

COMPLEX*16 VV,VH,HV,HH

...

DO 400 N=NMIN,NMAX

//some setup

DV1N=M*DV1(N)
DV2N=DV2(N)

CT11=DCMPLX(TR11(M1,N,NN),TI11(M1,N,NN))
CT22=DCMPLX(TR22(M1,N,NN),TI22(M1,N,NN))
CT12=DCMPLX(TR12(M1,N,NN),TI12(M1,N,NN))
CT21=DCMPLX(TR21(M1,N,NN),TI21(M1,N,NN))

CN1=CAL(N,NN)*FC
CN2=CAL(N,NN)*FS

D11=DV1N*DV1NN
D12=DV1N*DV2NN
D21=DV2N*DV1NN
D22=DV2N*DV2NN

//does not vectorize!

VV=VV+(CT11*D11+CT21*D21+CT12*D12+CT22*D22)*CN1

VH=VH+(CT11*D12+CT21*D22 +CT12*D11+CT22*D21)*CN2

HV=HV-(CT11*D21+CT21*D11+CT12*D22+CT22*D12)*CN2

HH=HH+(CT11*D22+CT21*D12+CT12*D21+CT22*D11)*CN1

400 CONTINUE

Compiling with:

ifort -msse4.1 -O3 --vec-report3

Gives:

vec2.f(214): (col. 16) remark: loop was not vectorized: unsupported data type.

line 214 is the "DO 400 N=NMIN,NMAX" line. Unfortunately theres no otherexplanationwhat the unsupported data type might be, and searching on google turned up nothing to help me. This loop looks trivially parallel to me, its literally just multiplying data from a few arrays and accumulating into 4 complex numbers. I'm just not sure how to expose that parallelism to the compiler.

jimdempseyatthecove · ‎11-19-2009

REAL*4 and REAL*8 are performed with supported SSE instruction set (in hardware). REAL*16 is not supported by the hardware (SSE). REAL*16 is performed instead by software emulation and is not vectorizable. COMPLEX*16 is a composite of two REAL*16 variables so would not be vectorizable using SSE instructions.

Jim Dempsey

TimP · ‎11-19-2009

Quoting - jimdempseyatthecove

COMPLEX*16 is a composite of two REAL*16 variables so would not be vectorizable using SSE instructions.

complex*16 traditionally is composed from double precision components (a legacy extension which never was standard Fortran). Likewise, dcmplx is a legacy extension not supported by the standard, nor by all compilers. We'd have to see more of the context to comment on the lack of vectorization, including all the declarations and alignments. A compilable example would be much more useful.

Steven_L_Intel1 · ‎11-23-2009

I don't pretend to be an expert in this, but my understanding is that SSE4 has instructions to aid in vectorizing single-precision complex, but not double-precision complex. Given that a single value of the latter would fill an SSE register, this is not too surprising.

Tim is right - complex*16 is double-precision, not quad-precision. complex(16) would be quad-precision. The * notation indicates the total length in bytes of the datatype, not the size of each component.