- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Ergin,
If you can kindly show me a code fragment, I may be able to explain/resolve the problem. By the way, note that a brief introduction to all vectorization switches of the Intel compilers can be found in the on-line IDS article:
http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm
Aart
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
short int X1
int X3
#pragma vector always
for(k=0;k< N;k+=2)
{
X3
X3[k+1] += X1
}
Compile using:
icc -O3 -xN -vec_report2 -o test test.c
and the result for corresponding loop:
test.c(30) : (col. 3) remark: loop was not vectorized: operator unsuited for vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Ergin,
In order to maximize SIMD parallelism, the Intel compiler always tries to vectorize a loop in the narrowest data type that still preserves semantics. For example, a loop like
char a[100];
char b[100];
for (i = 0; i < 100; i++) {
a = b + 1;
}
will be vectorized using the 8-way SIMD parallel instruction paddb with narrower char (8-bit) precision, even though C semantics strictly require the addition to be done in the wider int (32-bit) precision. For your example, the Intel compiler would like to vectorize the loop with int precision but, alas, SSE/SSE2/SSE3 does not support a packed dwords multiply (viz. the instruction pmullw has no pmulld equivalent). Hence, the operator unsuited complaint in the vectorization diagnostics.
Changing the data type of X3[] into short enables vectorization of this loop, but I am afraid that is not a workable solution for you. As written, I also see little opportunities to extractmuch other effective SIMD parallelism from this loop. I am sorry I could not be of more help.
Aart
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Ergin,
To give a hopefullyclearer answer, let me copya few linesof Table 5.3 from Chapter 5 in The Software Vectorization Handbook (http://www.intel.com/intelpress/sum_vmmx.htm):
OperatorPacked bytes words dwords qwords floats doubles
---------------------------------------------------------------
+ paddb paddw paddd paddq addps addpd
- psubb psubw psubd psubq subps subpd
* - pmullw - - mulps mulpd
As you can see, SIMD support for addition and subtraction is orthogonal (enabling vectorization of the C data types char, short, int, long long, float, and double) while multiplication is only supported forsome(enabling vectorization of data types short, float, and double).
The idiomatic int <- short x short operation performed in your example can, in principle, still be vectorized by combining the results of pmullw and either pmulhuw (unsigned) or pmulhw (signed) with some unpack instructions. The Intel compiler recognizes this idiom in certain cases, but, due to the associated overhead, I never even bothered recognizing the idiom in the context of a non-unit stride (as in your example), which is why the operation is rejected (rather than reporting that vectorization is possible but deemed inefficient, as is really the case).
Note that SSE3 introduced support for complex operations (which are more along the lines of your example if it were coded for single- or double-precision FP complex numbers).Two new instructions addsubps and addsubpd can alternate subtraction and addition within the SIMD instruction so that your two statements in the loop:
for ( ; k < N; k+=2) {
X3
X3[k+1] += .. + ..;
effectively can be implemented as unit-stride operations again.
Aart Bik
http://www.aartbik.com
Message Edited by abik on 10-21-2004 11:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#define N 1024
float X1
void doit() {
int k;
#pragma vector always
for(k=0;k
X3[k+1] += X1
}
}
ergin.c
ergin.c(6) : (col. 1) remark: LOOP WAS VECTORIZED.
#define N 512 // 512x2=1024
float _Complex X1
void doit() {
int k;
for(k=0;k
}
}
ergin2.c
ergin2.c(5) : (col. 1) remark: LOOP WAS VECTORIZED.
Message Edited by abik on 10-22-2004 01:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Ergin,
Intels vectorizer currently does not extract the complex arithmetic from the ergin.c version (recognizing the complex multiplication onadjacent locations is a feature I am pondering on, however). As written, the loop is therefore vectorized with non-unit strides, which is why the vectorizers efficiency heuristics currently (and correctly) reject vectorization of the loop by default. If pragma vector always is used to override this decision, slowdown may indeed result. The ergin2.c example, on the other hand, will vectorize by default and exhibit great speedup.
You seem rather hasty to conclude that auto-vectorization is useless for high performance computing. Many knowledgeable customers that understand the capabilities as well as the limitations of automatic vectorization have been able to use Intel's vectorizer to exploit multimedia extensions with no or few changes in their source code and without resorting to assembly programming. I am sorry I could not make you a fan as well.
The book I mentioned will be useful for assembly programmers too, since it describes the multimedia instruction set in detail and discusses many ways to convert sequential constructs into SIMD form.
Message Edited by abik on 10-29-2004 05:50 PM
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page