- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have managed to obtain some useful results concerning the
Intra-Register Vectorization. Already, I get a factor of
2 speedup overall for many parts of the program.
The burning question arises: what is the maximum number of
vector registers?
If it can be as large as 200 or 500, this would be most useful.
Is the number of these registers dependent to the actual
code being vectorized?
Any feedback would be most appreciated
best wishes
Tony
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you can see from the PARTIAL reports, or by examining generated code, ifort vectorizer does a thorough analysis to see how complicated loops should be split up ("distributed"), to fit the hardware resources. In some cases, it helps to avoid use of local scalars which would be required in more than one distributed loop, by declaring temporary arrays, just as in the good old vector days.
The limited number of Write Combine buffers on P4 and Xeon usually requires loop distribution sooner than register pressure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program test
integer,parameter :: n=100000000
integer i
double precision a(n),b,c
double precision vor,nach,cputime,timesum
c=5.0d0
vor=cputime()
DO 10 i=1,n
b = dexp(-dble(i))
c = b+1.0d0
a(i) = dsin(dble(i)*b+c)
10 CONTINUE
nach=cputime()
timesum=(nach-vor)
write(6,*) ' CPU time: ',vor,' sec'
write(6,*) ' CPU time: ',nach,' sec'
write(6,*) ' CPU time: ',timesum,' sec'
stop
end
with the timer given by the .c routine:
#include
double cputime_() {
clock_t start;
double cpu;
start = clock();
/* cpu = ((double) start) / CLOCKS_PER_SEC; */
cpu = ((double) start) * 1.e-6;
return cpu;
}
Results:
ifort -O3 testvec.f machc.o
[scott@alpha src]$ ./a.out
CPU time: 0.000000000000000E+000 sec
CPU time: 9.57000000000000 sec
CPU time: 9.57000000000000 sec
and vectorization gives me:
ifort -xP -O3 testvec.f machc.o
testvec.f(8) : (col. 9) remark: LOOP WAS VECTORIZED.
[scott@alpha3 src]$ ./a.out
CPU time: 0.000000000000000E+000 sec
CPU time: 5.06000000000000 sec
CPU time: 5.06000000000000 sec
The vectorization was successful but only gave a factor
of speedup here.
Am I limited to only 2 intra-registers?
best wishes
Tony
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Tony,
I fail to see your assumed correlation between vector register and obtained speedup. A penalty is only paid when the compiler has to spill vector registers, which is not the case for your benchmark. Incidentally, your benchmark puts more focus on testing memory performance than on testing computational performance (and as you can see, intra-register vectorization can even help improve memory performance). If you modify your benchmark to operate many times on an array of, for instance, only 1024 elements, you will be surprised by the resulting speedup! Such a benchmark would be more representative of a computational-bound application.
Aart Bik
http://www.aartbik.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please enlighten me as to what example is illustrative in terms
of finding (the answer to my question) just how many vector
registers I can have.
One email mentioned 16 xmm regiters directly program-accessible
in 64-bit mode.
I'm dumb, in the end, it's the speedup that matters. What
can I do? What is it that I have to do to get a do-loop
of the form:
do 10 i=1,n
...
10 continue
be as fast as possible for n independent processes?
best wishes and thanks in advance
Tony
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Tony,
IA32 supports eight 128-bit registers (xmm0-xmm7) and EM64T supports sixteen 128-bit registers (xmm0-xmm15). For double-precision, the vector length is two, which generally also defines an upper-bound on the speedup that can be expected. In some cases, like vectorizing math functions with SVML, super-linear speedup is achievable. To demonstrate this, here the execution times of your test on my 3.4GHz P4 of your original code (t1), and a modified version that executes the loop 100000 times for an array length of 1000 (t2). Clearly, the impact of intra-register vectorization is much more profound on the computationally bound kernel.
O2 xP speedup
t1 6.56 3.81 1.7
t2 12.07 3.50 3.4
When vectorizing a loop that operates on a double-precision array like:
DO I = 1, N
.. A(I) ..
ENDDO
with multimedia instructions, one effectively strip-mines the loop as follows (note that the vector length 2 is much smaller than vector lengths supported by traditional vector processors):
DO I = 1, N, 2
xmm0 = A(I:I+1:1)
..
ENDDO
Note that SSE/SSE2/SSE3 supports the following packed data types:
packed bytes : VL=16
packed shorts: VL=8
packed ints : VL=4
packed quads : VL=2
packed SP : VL=4
packed DP : VL=2
Here the vector-length VL provides a rough first estimate on the obtainable speedup. Hope this enlightens some more (otherwise you will have to buy my book :-)
Aart Bik
http://www.aartbik.com/

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page