Limits of Intra-Vectorization

scottrwth · ‎09-20-2005

Dear All,

I have managed to obtain some useful results concerning the
Intra-Register Vectorization. Already, I get a factor of
2 speedup overall for many parts of the program.

The burning question arises: what is the maximum number of
vector registers?

If it can be as large as 200 or 500, this would be most useful.

Is the number of these registers dependent to the actual
code being vectorized?

Any feedback would be most appreciated

best wishes

Tony

TimP · ‎09-20-2005

The same xmm registers (16 of them directly program-accessible, in 64-bit mode) are used for vector and scalar SSE2. If you're interested, you may be able to find documentation on certain processor steppings on how many additional registers are available to hardware register renaming. It's enough that other limits are more important to your program. The most successful historical vector machines had only 8 physical registers.
As you can see from the PARTIAL reports, or by examining generated code, ifort vectorizer does a thorough analysis to see how complicated loops should be split up ("distributed"), to fit the hardware resources. In some cases, it helps to avoid use of local scalars which would be required in more than one distributed loop, by declaring temporary arrays, just as in the good old vector days.
The limited number of Write Combine buffers on P4 and Xeon usually requires loop distribution sooner than register pressure.

scottrwth · ‎09-20-2005

For the test routine:

program test
integer,parameter :: n=100000000
integer i
double precision a(n),b,c
double precision vor,nach,cputime,timesum
c=5.0d0
vor=cputime()
DO 10 i=1,n
b = dexp(-dble(i))
c = b+1.0d0
a(i) = dsin(dble(i)*b+c)
10 CONTINUE
nach=cputime()
timesum=(nach-vor)
write(6,*) ' CPU time: ',vor,' sec'
write(6,*) ' CPU time: ',nach,' sec'
write(6,*) ' CPU time: ',timesum,' sec'
stop
end

with the timer given by the .c routine:

#include

double cputime_() {

clock_t start;
double cpu;

start = clock();
/* cpu = ((double) start) / CLOCKS_PER_SEC; */
cpu = ((double) start) * 1.e-6;
return cpu;
}

Results:

ifort -O3 testvec.f machc.o
[scott@alpha src]$ ./a.out
CPU time: 0.000000000000000E+000 sec
CPU time: 9.57000000000000 sec
CPU time: 9.57000000000000 sec

and vectorization gives me:

ifort -xP -O3 testvec.f machc.o
testvec.f(8) : (col. 9) remark: LOOP WAS VECTORIZED.
[scott@alpha3 src]$ ./a.out
CPU time: 0.000000000000000E+000 sec
CPU time: 5.06000000000000 sec
CPU time: 5.06000000000000 sec

The vectorization was successful but only gave a factor
of speedup here.

Am I limited to only 2 intra-registers?

best wishes

Tony

TimP · ‎09-20-2005

If you would like to disassemble the code for exp() and sin() in the svml library and verify that it uses more than 2 registers, you're welcome to do it. The speedup, of course, is related to the degree of parallelism, and efficiency of memory access, not to whether more registers are used in the vector case. As you are using all 128 bits of each register in the vectorized case, rather than 64, you have a reasonable correlation with the speedup for your case.

Intel_C_Intel · ‎09-20-2005

Dear Tony,

I fail to see your assumed correlation between vector register and obtained speedup. A penalty is only paid when the compiler has to spill vector registers, which is not the case for your benchmark. Incidentally, your benchmark puts more focus on testing memory performance than on testing computational performance (and as you can see, intra-register vectorization can even help improve memory performance). If you modify your benchmark to operate many times on an array of, for instance, only 1024 elements, you will be surprised by the resulting speedup! Such a benchmark would be more representative of a computational-bound application.

Aart Bik
http://www.aartbik.com/

scottrwth · ‎09-21-2005

Many many thanks for your messages,

Please enlighten me as to what example is illustrative in terms
of finding (the answer to my question) just how many vector
registers I can have.

One email mentioned 16 xmm regiters directly program-accessible
in 64-bit mode.

I'm dumb, in the end, it's the speedup that matters. What
can I do? What is it that I have to do to get a do-loop
of the form:

do 10 i=1,n

...

10 continue

be as fast as possible for n independent processes?

best wishes and thanks in advance

Tony

Intel_C_Intel · ‎09-21-2005

Dear Tony,

IA32 supports eight 128-bit registers (xmm0-xmm7) and EM64T supports sixteen 128-bit registers (xmm0-xmm15). For double-precision, the vector length is two, which generally also defines an upper-bound on the speedup that can be expected. In some cases, like vectorizing math functions with SVML, super-linear speedup is achievable. To demonstrate this, here the execution times of your test on my 3.4GHz P4 of your original code (t1), and a modified version that executes the loop 100000 times for an array length of 1000 (t2). Clearly, the impact of intra-register vectorization is much more profound on the computationally bound kernel.

O2 xP speedup
t1 6.56 3.81 1.7
t2 12.07 3.50 3.4

When vectorizing a loop that operates on a double-precision array like:

DO I = 1, N
.. A(I) ..
ENDDO

with multimedia instructions, one effectively strip-mines the loop as follows (note that the vector length 2 is much smaller than vector lengths supported by traditional vector processors):

DO I = 1, N, 2
xmm0 = A(I:I+1:1)
..
ENDDO

Note that SSE/SSE2/SSE3 supports the following packed data types:

packed bytes : VL=16
packed shorts: VL=8
packed ints : VL=4
packed quads : VL=2
packed SP : VL=4
packed DP : VL=2

Here the vector-length VL provides a rough first estimate on the obtainable speedup. Hope this enlightens some more (otherwise you will have to buy my book :-)

Aart Bik
http://www.aartbik.com/