Single-precision Complex Optimisation

Ben3 · ‎05-03-2012

Hi,

I'm having trouble with the optimisation of single-precision complex arithmetic. While for single complex numbers it does well (i.e. adds via movsd/addps/movsd), when working on arrays it uses the same algorithms, wasting the upper half of the xmm registers.

For example, this (trivial) function
[fortran]pure function array_plus_array ( left, right ) result ( r ) complex(kind=C_FLOAT), dimension(2), intent(in) :: left complex(kind=C_FLOAT), dimension(2), intent(in) :: right complex(kind=C_FLOAT), dimension(2) :: r r = left + right end function array_plus_array[/fortran] compiles (using /QxHost /O3 on an i7 920) into
[plain]; parameter 1: rcx ; parameter 2: rdx ; parameter 3: r8 mov r9, QWORD PTR [rcx] movsd xmm1, QWORD PTR [rdx] movsd xmm0, QWORD PTR [r8] addps xmm1, xmm0 movsd QWORD PTR [r9], xmm1 movsd xmm1, QWORD PTR [8+rdx] movsd xmm0, QWORD PTR [8+r8] addps xmm1, xmm0 movsd QWORD PTR [8+r9], xmm1 mov rax, rcx ret[/plain] I've trimmed it to show the important stuff, in that it's still working on each complex number separately.

A much better routine would be along the lines of
[plain]mov r9, QWORD PTR [rcx] movups xmm1, QWORD PTR [rdx] movups xmm0, QWORD PTR [r8] addps xmm1, xmm0 movups QWORD PTR [r9], xmm1 mov rax, rcx ret[/plain] This essentially halves the number of instructions. While not important in this example, I'm working with arrays of ~150 million elements or more, so it adds up quickly.

Is there a way to get better optimisation?

As a potential solution, I have C routines written using intrinsics that generate better assembly. If I compile both the C and Fortran using /Qipo, will it be able to optimise across the mixed languages?

Cheers,
Ben

EDIT: Sorry, my version of ifort is 12.1.0.233, build 20110811 (Intel 64).
EDIT: Fixed suggested assembly (changed movpd to movups).

Steven_L_Intel1 · ‎05-03-2012

IPO will optimize across the languages. I'm not sufficiently familiar with the instruction set to judge your suggested code but I will pass it on to the developers for their comments.

TimP · ‎05-03-2012

In the example of arrays of length 2, it may be that the compiler would say "seems inefficient" for complex even though it might recognize data alignment. !dir$ vector aligned could over-rule "seems inefficient." There is no possibility of determining alignment at compile time in the case presented here.
While not applicable to this case, the option -complex-limited-range is likely to be needed to get vector speedup (at the expense of the limited range) for sequences including complex abs, divide, and sqrt.
Also not applicable to this case, for double precision complex, even though the compiler optimizes with simd instructions, it will report vectorization only for AVX (and MIC) compilations, given that the 128-bit simd instructions implement only a single real/imaginary operand pair.

Ben3 · ‎05-03-2012

Thanks guys.

Yes, this case is rather trivial. However, I normally use derived types containing an array, like this
[fortran]type :: testType complex(kind=C_FLOAT), dimension(4, 4, 3, 3) :: sc end type testType[/fortran]and have helper functions that look like
[fortran]elemental type(testType) function add_testType ( left, right ) result ( r ) type(testType), intent(in) :: left type(testType), intent(in) :: right r%sc = left%sc + right%sc end function add_testType[/fortran] I get the same assembly, just completely unrolled (one set of movsd/addps/movsd for each of the 144 elements). Turning vec-report on, it repors this
[plain]D:Codetestingprogram.f90(14): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient. D:Codetestingprogram.f90(14): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient. D:Codetestingprogram.f90(14): (col. 5) remark: loop was not vectorized: not inner loop.[/plain] If I replace the dimension(4, 4, 3, 3) with dimension(144), it does indeed vectorise it (but not completely unrolled, and it looks like it's checking for alignment - is there any way to force all variables of this type to be 16-byte aligned, rather than using directives for each one?). I'm guessing it's getting hung up about the multiple small dimension, even though they're sequential in memory.

Cheers,
Ben

TimP · ‎05-03-2012

The only ways the compiler could take advantage of alignment would be by the directive (where you take responsibility that the operands are aligned) or by interprocedural analysis, with the caller declaring the arrays and callee in a single invocation of ifort.
I don't know whether specifying a sequence attribute would help out; without that or the inter-procedural analysis, the compiler can't assume it.
I submitted a premier.com report myself recently on a case where the compiler doesn't optimize well with 4 subscripts, some of them short sizes. You are certainly entitled to file such a report where you believe the compiler should do better.
ifort doesn't go out of its way to unroll loops fully; I suppose the alignment and sequence questions are more important here.

Steven_L_Intel1 · ‎05-03-2012

You can tell the compiler to assume that the operands are aligned as follows

!DIR$ ASSUME_ALIGNED left:16,right:16

but I didn't find that helped. There are various directives such as !DIR$ SIMD which may be of help.

You can't attach alignment to a type, but I think you'll tend to get at least 16-byte alignment anyway.

TimP · ‎05-03-2012

Under normal circumstances, X64 gives 16-byte alignment, unless you pass a section in the middle of an array or a COMMON block or derived type with misalignment. So the compiler can't assume alignment for separate compilation; it has to see the connection to the declaration.
!dir$ simd includes the effect of !dir$ vector always (compiler doesn't make "seems ineffective" analyses) but doesn't assert alignment.

ZlamalJakub · ‎05-04-2012

Under normal circumstances, you will get 16-byte alignment ...

When I used win32 API function LocalAlloc for memory allocation, it sometimes returned 8-byte aligned memory and I had problems with optimized code. So do not count on 16-byte alignment of this function.

TimP · ‎05-06-2012

We were discussing X64, which has better default alignments than win32. Besides, when you go to API or C++ programming, you expose more problems than you see in portable Fortran. In win32, you often require special functions such as aligned_malloc. The early version of that, mm_malloc, didn't translate consistently to X64.