- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am somewhat unsure about the effects of array alignment. Assume e.g. that a 2D double precision array is the argument of a subprogram, e.g.
CALL SUB(NDA,A,......)
SUBROUTINE SUB(NDA,A,....
DOUBLE PRECISION A(NDA,*),....
Does the compiler have any possibilities of telling whether elements of A er 16-byte aligned or 8-byte aligned? 16-bye alignment permits aligned MMX doubleword instructions whereas only 8-byte alignment only permits single word operations.
It appears that on declaration arrays are aligned on
16 byte boundaries, and consequently, making NDA even would imply that the elements on the 1st row of A were also on 16 byte boundaries. But is there any way I can tell the compiler that NDA is even? Or does it make a difference if A is given fixed dimensions, eg. in a COMMON area?
Michael
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ifort directives such as
!dir$ vector aligned
and
!dir$ vector nontemporal
tell the compiler to assume the arrays in the designated loop are all 16-byte aligned, and request vectorization without regard to default rules about efficiency. This would produce run-time failure when the assertion is violated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Michael,
The Intel vectorizer also performs inter-procedural alignment analysis, where values related to alignment are propagated through the call graphs. Consequently, if a subroutine is always called with aligned arguments and appropriate dimensions, the loop will be optimized accordingly, assuming the compiler has sufficient compilation scope (e.g. under Qipo). If subroutines are compiled in isolation, or if the analysis failed, the compiler usually still vectorizes the loop, but with some more overhead to force alignment at runtime. Hints like !DIR$ VECTOR ALIGNED (loop-oriented) and, a new feature, !DIR$ ASSUME_ALIGNED a:16 (data-oriented) are useful to eliminate that overhead when the programmer is certain about incoming alignment properties.
Hope this is insightful.
Aart Bik
http://www.aartbik.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to optimize matrix operations like triangular decomposition and symmetric triangular decomposition (LDL^T) with fairly small matrices - size typically 12-24 (which in turn have to be performed repeatedly). MKL makes wonders for large matrices, but the overhead is probably large for small matrices, and the default of the vectorizer - 8 elements - means that substantial time is spent in the tail loop.
It appears from Vtune analysis that branch prediction is critical, and I am looking for optimal ways of blocking to reduce loop counts. Here, manual unrolling of outer loops seems quite effective!
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
s(1)=v(1)*a(ic,j)+v(2)*a(ic+1,j)+
& v(3)*a(ic+2,j)+v(4)*a(ic+3,j)
a(j,i) = a(j,i)-s(1)
s(2)=v(1)*a(ic,j+1)+v(2)*a(ic+1,j+1)+
& v(3)*a(ic+2,j+1)+v(4)*a(ic+3,j+1)
a(j+1,i)=a(j+1,i)-s(2)
In this codeblock it is known that v(1) and a(ic,j)
is located at a 16 byte boundary, and that the same
is the case with a(ic,j+1) (since the leading dimension of a, as well as ic, is even). Yet, I can
see from the assembler listings that I get a list of
single word operations (movps, addps, mulps).
Any way of getting the compiler to reckognize the structure, short of hand-patching the assembler
listing?
I have tried to implement the calculations as a fuction call with a 4-element dot product, and here I could get doubleword operations (only if I wrote the product as a loop, rather than as an explicit sequence) but det net result was increased computing time!
Michale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DOUBLE PRECISION A (LDA,*)
where LDA is an argument of the subprogram. In the
subprogram, I have loops like
DO K=1,N
ASUM = ASUM + A(K,I)*X(K)
ENDDO
Clearly, if LDA is even the elements of the first row of A will be 16-byte aligned, and access can be performed accordingly. When LDA is an argument the compiler cannot decide on its own whether A is even or odd. My question is how I can tell the compiler that LDA is even such that the first row is 16-byte aligned.
The compiler seems to be able to make the right decision when A is placed in a COMMON area or when
the leading dimension is given explicitly, but that seems rather clumsy.
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Michael,
As stated earlier, if the function is compile in isolation (so that the compiler cannot deduce any property on the parameters), Intels vectorizer generates low-overheade runtime code that will enforce aligned access patterns through A and X. *Only* if you feel the overhead is prohibitive, simply use
!DIR$ VECTOR ALIGNED
DO K=1,N
ASUM = ASUM + A(K,I)*X(K)
ENDDO
to obtain more compact code that simply assumes access patterns through A and X are aligned.
When the function is compiled with more context, the compiler may detect properties on LDA, A, and X that could avoid (some of) the runtime overhead, even without the directive, but that depends on your application and optimization level. Feel free to send me a test case (aart.bik@intel.com), so I can check if the alignment analysis does a good job on your code.
Aart Bik
http://www.aartbik.com/

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page