Array alignment

kt_mic · ‎02-17-2006

Hello,

I am somewhat unsure about the effects of array alignment. Assume e.g. that a 2D double precision array is the argument of a subprogram, e.g.

CALL SUB(NDA,A,......)

SUBROUTINE SUB(NDA,A,....
DOUBLE PRECISION A(NDA,*),....

Does the compiler have any possibilities of telling whether elements of A er 16-byte aligned or 8-byte aligned? 16-bye alignment permits aligned MMX doubleword instructions whereas only 8-byte alignment only permits single word operations.

It appears that on declaration arrays are aligned on
16 byte boundaries, and consequently, making NDA even would imply that the elements on the 1st row of A were also on 16 byte boundaries. But is there any way I can tell the compiler that NDA is even? Or does it make a difference if A is given fixed dimensions, eg. in a COMMON area?

Michael

TimP · ‎02-17-2006

When you declare an array starting at a 16-byte displacement from the beginning of a COMMON, or a local array, ifort will make it 16-byte aligned. Where you pass an array as a function argument, the compiler has no way to know the alignment, unless possibly when the function is in-lined into the subroutine where the array is declared.
Ifort directives such as
!dir$ vector aligned
and
!dir$ vector nontemporal
tell the compiler to assume the arrays in the designated loop are all 16-byte aligned, and request vectorization without regard to default rules about efficiency. This would produce run-time failure when the assertion is violated.

Intel_C_Intel · ‎02-17-2006

Dear Michael,

The Intel vectorizer also performs inter-procedural alignment analysis, where values related to alignment are propagated through the call graphs. Consequently, if a subroutine is always called with aligned arguments and appropriate dimensions, the loop will be optimized accordingly, assuming the compiler has sufficient compilation scope (e.g. under Qipo). If subroutines are compiled in isolation, or if the analysis failed, the compiler usually still vectorizes the loop, but with some more overhead to force alignment at runtime. Hints like !DIR$ VECTOR ALIGNED (loop-oriented) and, a new feature, !DIR$ ASSUME_ALIGNED a:16 (data-oriented) are useful to eliminate that overhead when the programmer is certain about incoming alignment properties.

Hope this is insightful.

Aart Bik
http://www.aartbik.com/

kt_mic · ‎02-17-2006

Thanks for the replies. Both are useful.

I am trying to optimize matrix operations like triangular decomposition and symmetric triangular decomposition (LDL^T) with fairly small matrices - size typically 12-24 (which in turn have to be performed repeatedly). MKL makes wonders for large matrices, but the overhead is probably large for small matrices, and the default of the vectorizer - 8 elements - means that substantial time is spent in the tail loop.

It appears from Vtune analysis that branch prediction is critical, and I am looking for optimal ways of blocking to reduce loop counts. Here, manual unrolling of outer loops seems quite effective!

Michael

TimP · ‎02-17-2006

For such short loops, setting -O1 may improve performance, by reducing the work done in remainder loops.

kt_mic · ‎02-18-2006

This is my (current) specific problem:

s(1)=v(1)*a(ic,j)+v(2)*a(ic+1,j)+
& v(3)*a(ic+2,j)+v(4)*a(ic+3,j)
a(j,i) = a(j,i)-s(1)

s(2)=v(1)*a(ic,j+1)+v(2)*a(ic+1,j+1)+
& v(3)*a(ic+2,j+1)+v(4)*a(ic+3,j+1)
a(j+1,i)=a(j+1,i)-s(2)

In this codeblock it is known that v(1) and a(ic,j)
is located at a 16 byte boundary, and that the same
is the case with a(ic,j+1) (since the leading dimension of a, as well as ic, is even). Yet, I can
see from the assembler listings that I get a list of
single word operations (movps, addps, mulps).

Any way of getting the compiler to reckognize the structure, short of hand-patching the assembler
listing?

I have tried to implement the calculations as a fuction call with a 4-element dot product, and here I could get doubleword operations (only if I wrote the product as a loop, rather than as an explicit sequence) but det net result was increased computing time!

Michale

TimP · ‎02-18-2006

Those are parallel operations.

kt_mic · ‎02-27-2006

I am not quite sure I get the implications. Suppose a local 2D array argument to a subprogram is declared

DOUBLE PRECISION A (LDA,*)

where LDA is an argument of the subprogram. In the
subprogram, I have loops like

DO K=1,N
ASUM = ASUM + A(K,I)*X(K)
ENDDO

Clearly, if LDA is even the elements of the first row of A will be 16-byte aligned, and access can be performed accordingly. When LDA is an argument the compiler cannot decide on its own whether A is even or odd. My question is how I can tell the compiler that LDA is even such that the first row is 16-byte aligned.

The compiler seems to be able to make the right decision when A is placed in a COMMON area or when
the leading dimension is given explicitly, but that seems rather clumsy.

Michael

Intel_C_Intel · ‎02-27-2006

Dear Michael,

As stated earlier, if the function is compile in isolation (so that the compiler cannot deduce any property on the parameters), Intels vectorizer generates low-overheade runtime code that will enforce aligned access patterns through A and X. *Only* if you feel the overhead is prohibitive, simply use

!DIR$ VECTOR ALIGNED
DO K=1,N
ASUM = ASUM + A(K,I)*X(K)
ENDDO

to obtain more compact code that simply assumes access patterns through A and X are aligned.

When the function is compiled with more context, the compiler may detect properties on LDA, A, and X that could avoid (some of) the runtime overhead, even without the directive, but that depends on your application and optimization level. Feel free to send me a test case (aart.bik@intel.com), so I can check if the alignment analysis does a good job on your code.

Aart Bik
http://www.aartbik.com/