Re: C++ 8.1 Linux, complex vectors!

ergin · ‎10-10-2004

When I try to compile routines that multiply two complex integer vectors, the compiler refuses to vectorize it saying the loop is unsuitable.

Did anybody have a similiar problem and able to tweak the compiler?

I know this should be possible since FFT usesit a lot!

Intel_C_Intel · ‎10-11-2004

Dear Ergin,

If you can kindly show me a code fragment, I may be able to explain/resolve the problem. By the way, note that a brief introduction to all vectorization switches of the Intel compilers can be found in the on-line IDS article:

http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm

Aart

ergin · ‎10-16-2004

Here is the sample code:(N=1024)

short int X1,X2;
int X3;

#pragma vector always
for(k=0;k< N;k+=2)
{
X3 += X1*X2 - X1[k+1]*X2[k+1];
X3[k+1] += X1*X2[k+1] + X1[k+1]*X2;
}
Compile using:
icc -O3 -xN -vec_report2 -o test test.c

and the result for corresponding loop:
test.c(30) : (col. 3) remark: loop was not vectorized: operator unsuited for vectorization.

Intel_C_Intel · ‎10-18-2004

Dear Ergin,

In order to maximize SIMD parallelism, the Intel compiler always tries to vectorize a loop in the narrowest data type that still preserves semantics. For example, a loop like

char a[100];
char b[100];
for (i = 0; i < 100; i++) {
a = b + 1;
}

will be vectorized using the 8-way SIMD parallel instruction paddb with narrower char (8-bit) precision, even though C semantics strictly require the addition to be done in the wider int (32-bit) precision. For your example, the Intel compiler would like to vectorize the loop with int precision but, alas, SSE/SSE2/SSE3 does not support a packed dwords multiply (viz. the instruction pmullw has no pmulld equivalent). Hence, the operator unsuited complaint in the vectorization diagnostics.

Changing the data type of X3[] into short enables vectorization of this loop, but I am afraid that is not a workable solution for you. As written, I also see little opportunities to extractmuch other effective SIMD parallelism from this loop. I am sorry I could not be of more help.

Aart

ergin · ‎10-21-2004

Dear Aart,

From you answer it is not very clearwhich data types are vectorizable

under addition and multiplication.

In other words which type of vectors can be complex multiplied in SIMD parallel?

Intel_C_Intel · ‎10-21-2004

Dear Ergin,

To give a hopefullyclearer answer, let me copya few linesof Table 5.3 from Chapter 5 in The Software Vectorization Handbook (http://www.intel.com/intelpress/sum_vmmx.htm):

OperatorPacked bytes words dwords qwords floats doubles
---------------------------------------------------------------
+ paddb paddw paddd paddq addps addpd
- psubb psubw psubd psubq subps subpd
* - pmullw - - mulps mulpd

As you can see, SIMD support for addition and subtraction is orthogonal (enabling vectorization of the C data types char, short, int, long long, float, and double) while multiplication is only supported forsome(enabling vectorization of data types short, float, and double).

The idiomatic int <- short x short operation performed in your example can, in principle, still be vectorized by combining the results of pmullw and either pmulhuw (unsigned) or pmulhw (signed) with some unpack instructions. The Intel compiler recognizes this idiom in certain cases, but, due to the associated overhead, I never even bothered recognizing the idiom in the context of a non-unit stride (as in your example), which is why the operation is rejected (rather than reporting that vectorization is possible but deemed inefficient, as is really the case).

Note that SSE3 introduced support for complex operations (which are more along the lines of your example if it were coded for single- or double-precision FP complex numbers).Two new instructions addsubps and addsubpd can alternate subtraction and addition within the SIMD instruction so that your two statements in the loop:

for ( ; k < N; k+=2) {
X3 += .. - ..;
X3[k+1] += .. + ..;
}

effectively can be implemented as unit-stride operations again.

Aart Bik
http://www.aartbik.com

Message Edited by abik on 10-21-2004 11:35 AM

ergin · ‎10-22-2004

Dear Aart,

According to your comment changing the data type to float should vectorize the loop. However, I still get the same compiler report, "operator unsuited for vectorization".

TimP · ‎10-22-2004

Aart was talking about the -xP option of recent compilers, which uses SSE3 instructions to vectorize complex float, if all the operands and destinations are that type. I don't know whether the compiler recognizes complex, without the use of C99 or equivalent explicit declaration. This code would run only on a Prescott CPU.

ergin · ‎10-22-2004

Again, Intel IPPS FFT performance level suggests that complex vector multiplication is vectorazible even with SSE2.

From your comments I conclude that the Intel C++ compiler cannot vectorize this simple code.

Intel_C_Intel · ‎10-22-2004

cat ergin.c
#define N 1024
float X1, X2, X3;
void doit() {
int k;
#pragma vector always
for(k=0;kX3 += X1*X2 - X1[k+1]*X2[k+1];
X3[k+1] += X1*X2[k+1] + X1[k+1]*X2;
}
}
icl -QxN -nologo -c ergin.c
ergin.c
ergin.c(6) : (col. 1) remark: LOOP WAS VECTORIZED.
cat ergin2.c
#define N 512 // 512x2=1024
float _Complex X1, X2, X3;
void doit() {
int k;
for(k=0;kX3 += X1*X2;
}
}
icl -QxP -Qc99 -nologo -c ergin2.c
ergin2.c
ergin2.c(5) : (col. 1) remark: LOOP WAS VECTORIZED.

Message Edited by abik on 10-22-2004 01:08 PM

ergin · ‎10-29-2004

Dear Aart,

I was able get the vectorization report as well.

Something peculiar has happended: The vectorizedexecutable is 5 times slowerthan theregular one.

Again my reference is IPP FFT that achieves much higher performance level.

After consulting Intel engineers, now I know that their routine was coded in assembly, that's why it's much faster, and properly vectorized.

In the very end I conclude that auto vectorization is useless for high performance computing.

The book you mention above, does it explain how to vectorize using assembly?

Thanks,

Ergin

Intel_C_Intel · ‎10-29-2004

Dear Ergin,

Intels vectorizer currently does not extract the complex arithmetic from the ergin.c version (recognizing the complex multiplication onadjacent locations is a feature I am pondering on, however). As written, the loop is therefore vectorized with non-unit strides, which is why the vectorizers efficiency heuristics currently (and correctly) reject vectorization of the loop by default. If pragma vector always is used to override this decision, slowdown may indeed result. The ergin2.c example, on the other hand, will vectorize by default and exhibit great speedup.
You seem rather hasty to conclude that auto-vectorization is useless for high performance computing. Many knowledgeable customers that understand the capabilities as well as the limitations of automatic vectorization have been able to use Intel's vectorizer to exploit multimedia extensions with no or few changes in their source code and without resorting to assembly programming. I am sorry I could not make you a fan as well.
The book I mentioned will be useful for assembly programmers too, since it describes the multimedia instruction set in detail and discusses many ways to convert sequential constructs into SIMD form.

Aart
http://www.aartbik.com

Message Edited by abik on 10-29-2004 05:50 PM

C++ 8.1 Linux, cimplex vectors!