double loop optim

finjulhich · ‎11-08-2007

hi,

-----------
#include
#include
int main()
{
const double p0 = 0.51;
const double p1 = 0.49;
const int N = 100000;
__declspec(align(16)) double opt;
for (int i=0; i opt = i;
const double t0 = omp_get_wtime();
for (int ts=N-1; ts>0; ts--)
for (int i=0; i opt = p0*opt + p1*opt[i+1];
const double t1 = omp_get_wtime();
std::cout<< t1-t0 <<:ENDL>}
-----------

I use the 10.0.020 and build it with
icpc -xT -O3 -openmp -o 2 2.cpp
I have a macpro xeon core2duo (SSSE3) L2 4M cache

ps: the loops are vectorized
2.cpp(12): (col. 2) remark: LOOP WAS VECTORIZED.
2.cpp(17): (col. 3) remark: LOOP WAS VECTORIZED.

It takes 4.5s

Are there any suggestions to speed up more, perhaps rearranging some bits, or using intrinsics?

ps: the use of omp is purely to get the time function.

I looked at the source-annotated assembly... In the comments, there are references like #14.20. Are these section 14 item 20 of some document somewhere?

rds,

TimP · ‎11-08-2007

Those comments in the assembly are source line and column references.
Your code sample is recursive in the outer loop (each iteration of the outer loop has to wait for data to be produced from the previous iteration), so it seems unlikely any further speedup could be found. Speed may be limited by the rate at which data can move to and from registers, taking into account that you have an vector unaligned access.
You may be interested in examining which tradeoffs the compiler has made in order to deal with cache line boundary crossing. As you specified alignment, I assume you have arranged for the compiler to use aligned moves for store and one of the loads. A trick there is to recognize that movaps does the same thing as movapd, but sometimes faster, due to saving 1 byte of code. One of the techniques which the compiler sometimes uses is full cache line unrolling, so that unaligned parallel loads can be used for the operands which don't split cache lines, and the pair of operands which cross the boundary can be loaded individually.
The gcc method of using the individual loads for the unaligned operand would be superior to exclusive use of unaligned parallel loads. This then shows that you must allow 4 clock cycles per loop iteration pair for data movement, even with full pipelining disregarding the recursion stall cycles. So that very limited analysis could explain a mere 50% performance increase from vectorization, in case that is what you see.
The incentive for undertaking such detailed work is limited, given that the Penryn CPUs are to be launched to market next week, and they should greatly reduce the cache line split penalty.
Another issue of detail is to try to fit your unrolled loop code (if not full cache line unrolled) into 4 16-byte aligned chunks. As you have no jumps inside the loop, this should be accomplished easily by making the top of the loop code body 16 byte aligned. Making it 32-byte aligned could improve efficiency of hardware prefetch. If icpc didn't do this, you can try it by editing the assembly (or checking a current gcc).

levicki · ‎11-13-2007

You'll be happy to know that ICC 10.1.011 for Windows doesn't vectorize this code. Great regression testing Intel, way to go.

finjulhich · ‎11-15-2007

Thanks,
The gain from vectorization only was 1.5s (from 5.8s to 4.3s = 26%).

The source-annotated assembly

L_B1.3:
;;;
;;; for (int ts=N-1; ts>0; ts--)
;;; for (int i=0; i;;; opt = p0*opt + p1*opt[i+1];
movaps _2il0floatpacket.3(%rip), %xmm3 #13.13
xorl %edx, %edx #11.2
movl $99999, %eax #
movaps _2il0floatpacket.4(%rip), %xmm2 #13.25
movsd _2il0floatpacket.5(%rip), %xmm1 #
movsd _2il0floatpacket.6(%rip), %xmm0 #
L_B1.4:
testl %eax, %eax #12.19
jle L_B1.13 # Prob 50%
L_B1.5:
movslq %eax, %rcx #12.3
cmpq $8, %rcx #12.3
jl L_B1.15 # Prob 10%
L_B1.6:
movl %ecx, %edi #12.3
movl %edi, %esi& nbsp; #12.3
andl $7, %esi #12.3
subl %esi, %edi #12.3
movslq %edi, %r8 #12.3
xorl %edi, %edi #
movq %r8, %rsi #
shlq $3, %rsi
L_B1.7:
movaps (%rsp,%rdi), %xmm5 #13.16
movsd 8(%rsp,%rdi), %xmm4 #13.28
movhpd 16(%rsp,%rdi), %xmm4 #13.28
movaps 16(%rsp,%rdi), %xmm7 #13.16
movsd 24(%rsp,%rdi), %xmm6 #13.28
movhpd 32(%rsp,%rdi), %xmm6 #13.28
movaps 32(%rsp,%rdi), %xmm9 #13.16
movsd&n bsp; 40(%rsp,%rdi), %xmm8 #13.28
movhpd 48(%rsp,%rdi), %xmm8 #13.28
movaps 48(%rsp,%rdi), %xmm11 #13.16
movsd 56(%rsp,%rdi), %xmm10 #13.28
movhpd 64(%rsp,%rdi), %xmm10 #13.28
mulpd %xmm3, %xmm5 #13.16
mulpd %xmm2, %xmm4 #13.28
addpd %xmm4, %xmm5 #13.28
movaps %xmm5, (%rsp,%rdi) #13.4
mulpd %xmm3, %xmm7 #13.16
mulpd %xmm2, %xmm6 #13.28
addpd %xmm6, %xmm7 #13.28
movaps %xmm7, 16(%rsp,%rdi) #13.4
mul pd %xmm3, %xmm9 #13.16
mulpd %xmm2, %xmm8 #13.28
addpd %xmm8, %xmm9 #13.28
movaps %xmm9, 32(%rsp,%rdi) #13.4
mulpd %xmm3, %xmm11 #13.16
mulpd %xmm2, %xmm10 #13.28
addpd %xmm10, %xmm11 #13.28
movaps %xmm11, 48(%rsp,%rdi) #13.4
addq $64, %rdi #12.3
cmpq %rsi, %rdi #12.3
jl L_B1.7 # Prob 99% #12.3
L_B1.9: # Preds L_B1.7 L_B1.15
movq %r8, %rdi&nbs p; #
shlq $3, %rdi #
movq %rcx, %rsi #
shlq $3, %rsi #
cmpq %rcx, %r8 #12.3
jge L_B1.13 # Prob 1%
L_B1.11:
movsd (%rsp,%rdi), %xmm5 #13.16
movsd 8(%rsp,%rdi), %xmm4 #13.28
mulsd %xmm1, %xmm5 #13.16
mulsd %xmm0, %xmm4 #13.28
addsd %xmm4, %xmm5 #13.28
movsd %xmm5, (%rsp,%rdi) #13.4
addq $8, %rdi #12.3
cmpq %rsi, %rdi #12.3
jl L_B1.11 # Prob 99% #12.3
L_B1.13:
addl $-1, %eax #11.2
addl $1, %edx #11.2
cmpl $99999, %edx #11.2
jb L_B1.4 # Prob 99%
L_B1.14:
xorl %eax, %eax #14.1
addq $800008, %rsp #14.1
L____tag_value__main.12: #
ret #14.1
L____tag_value__main.13: #
# LOE
L_B1.15: # Preds L_B1.5 # Infreq
xorl %r8d, %r8d #12.3
jmp L_B1.9 # Prob 100% #12.3
.align 1
L____tag_value__main.14: #
# mark_end;
.section __DATA,__data
# -- End _main
.section __DATA,__data
.align 4
.align 4

Is movaps _2il0floatpacket.3(%rip), %xmm3the top of the loop body? Do i have to place a

.align(16) just before that?

Thanks tim,

jimdempseyatthecove · ‎11-15-2007

The L_B1.7: is the main loop. Your .align would go in front of that line.

Jim Dempsey

jimdempseyatthecove · ‎11-15-2007

Finjulhic,

Experiment with

for (int ts=N-1; ts>0; ts--)
{
for (int i=0; iopt1 =p1*opt[i+1];// target 2nd array opt1
for (int i=0; iopt = p0*opt +opt1; // combineopt1 with opt
}

Although this appears to be twice the work it may run faster.
I assume opt[ts] is initialized to 0. or some innocuous number.

The other method using one statement require code to shuffle the results. This method will not. It does require additional memory for the opt1 array.

Jim Dempsey

finjulhich · ‎11-15-2007

with this,
const double t0 = omp_get_wtime();
for (int ts=N-1; ts>0; ts--) {
for (int i=0; i opt1 = p1*opt[i+1];
for (int i=0; i opt = p0*opt + opt1;
}
const double t1 = omp_get_wtime();
3.cpp(17): (col. 3) remark: FUSED LOOP WAS VECTORIZED.
>>> 17 is the 1st ... for (int i=0; i

4.23s instead of 4.33s
2.3% gain

I'll try now with aligning to loop code.

finjulhich · ‎11-16-2007

Putting .align 16 at the top of the loop says:

2.cpp:48:Alignment too large: 15. assumed.
with .align 15 it takes obviously way too long

I tried with align loop code on 8bytes-boundary, it's 5.2s instead of 4.3s

perhaps this is as fast as one can go on this processor config?

TimP · ‎11-16-2007

Check your alignment directives. If you are using a linux assembler, you might follow the patterns used by gcc, where
.p2align 4,,7
.p2align 3
means to use up to 7 bytes of padding in order to get 16-byte alignment, followed by unconditional 8-byte alignment. If your alignment directive is taken to mean 2^15 (2 raised to power 15) byte alignment, it will not do what you think. "p2" reminds us that this facility was instituted for Pentium II, so people who use assembler have had some opportunity to get used to it.
Try making an object file and viewing the code there (for gnu binutils, objdump -S) so you can see how the align is implemented.
It looks like your loop body may be too large for alignment to help much. The most likely reason for alignment to help would be if it avoids an awkward split of the last instruction (jl L_B1.7) across a 16-byte boundary.
The largest loop alignment which is likely to help is 32-byte alignment, but the platform doesn't give adequate support for this. It's worth bearing in mind when you test.

jimdempseyatthecove · ‎11-16-2007

Finjulhic,

If you go the double inner loop method each loopcan be aligned seperately.

Also, if ts gets large to the point where both arrays opt and opt1 won't fit in the processor cache, then consider fragmenting up the work. From your other code examples posted I would assume ts would have been reduced to a cache fitable amout of memory.

Jim

levicki · ‎11-16-2007

But why does 10.1.011 for windows fail to vectorize this at all?

jimdempseyatthecove · ‎11-16-2007

>>But why does 10.1.011 for windows fail to vectorize this at all?

Premier Support should be able to tell you.

Another potential work around is for the two inner loops (opt and opt1) is to perform each loop in increments of 2 using two statements (explicitly stating pairs). Then after each loop test to see if ts was odd and if so perform the last step. Forcing and declaring the opt arrays aligned would help too.

The compiler should have been able to make this determination. The "cost" to you is a handful of statements.

You can wait for a fix, if that is what you want.

Jim Dempsey

JenniferJ · ‎11-16-2007

There's a bug in the Windows compiler related to optimization with EH (exception handling).

If you comment out the last code line (std::cout), it should vectorize.

levicki · ‎11-16-2007

Jennifer, can that bug affect anything else? Would the code vectorize if you disable SEH?

JenniferJ · ‎11-16-2007

It only affects loop vectorization with code containing EH. It happens with some cases like this, not always.

Disabling SEH doesn't help. But moving code around does. For this case, moving the "std::cout" line into a funtion works.

levicki · ‎11-16-2007

Would using printf() instead of std::cout help? In other words, is it related to using std:: namespace somehow?

finjulhich · ‎11-17-2007

Yes, using print() instead helps, at least on 10.0 windows.
On linux/mac, there's no bug.

rds,

finjulhich · ‎11-20-2007

hi,

aligning the 2 inner loops didn't yield much compared to the non-explicit align case.

With the 2 inner loops version, the assembly generated looks better than previous case.

movaps stuff from opt[]and opt1[](8 doubles , bcz cache line size=64bytes) and put it to xmm registers, some unaligned moves (shifted by 1) from opt[] to other xmm registers. addition of xmm registers (after multiplication) and then aligned stores back to opt[] and to opt1[], and then jump to begining label of loop.

All examples run with sizeof(opt + opt1) < 2MB, so inside L2 cache. (so i haven't tested with fragmentation)

4.2s... I doubt much more gain can be obtained?

I'm curious if Penryn SSE4 Dotproduct instructions could bring any benefit here?
In the 1 inner-loop case => p0*opt+p1*opt[i+1], that looks neatly like a DotProduct instruction.

TimP · ‎11-20-2007

If you want the compiler to consider special optimizations for short fixed loop lengths, such as 1, you need the
#pragma loop count (1,n2,...)
which expresses your preferences, as well as setting -xS. You may not like the generated code, if it does in fact generate separate versions for various cases.
On Penryn, optimized SSE2 dot product code may be best for long loops.

jimdempseyatthecove · ‎11-20-2007

Finjulhich,

The optimizer may have done this for you but from the ASM code you posted a while back it looked like the optimizer burned some instructions to avoid what looked like an unnecessary load from memory. Try making the inner loop go in steps of 2

opt = p0*opt + p1*opt[i+1];
opt[i+1] = p0*opt[i+1] + p1*opt[i+2];

The ASM code you posted seemed like the code tried to move the opt[i+1] from register to register. In this case I think what looks like a redundant read might be offset by what may end up as always writing in aligned pairs of doubles.

You might be at the case of diminishing returns and would be better off focusing your efforts elsewhere.

Jim Dempsey

levicki · ‎11-21-2007

"Optimizer" is generating awfull code for this loop:

	const int N = 100000;
	__declspec(align(16)) double opt;
	for (int i = 0; i < N; i++) {
		opt = i;
	}

Take a look at this HORRIBLE asm code:

	xor		eax, eax
	movdqa		xmm1, oword ptr ds:oword_41F1C0 ; constant 2 2 2 2
	movdqa		xmm0, oword ptr ds:oword_41F1D0 ; constant 3 2 1 0
loc_40104E:
	cvtdq2pd	xmm2, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+427580h], xmm2
	cvtdq2pd	xmm3, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+427590h], xmm3
	cvtdq2pd	xmm4, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+4275A0h], xmm4
	cvtdq2pd	xmm5, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+4275B0h], xmm5
	add		eax, 40h
	cmp		eax, 0C3500h
	jb		short loc_40104E

Compiler is using cvtdq2pd so abundantly like if it has ZERO clocks latency!!! I understand that Core architecture has lower latency so it may not be a problem, but Netburst has 8 clocks latency and 3 clocks throughput for CVTDQ2PD. The question is — is it really neccessary?

Why not something like this?

		xor	eax, eax
		movapd	xmm0, xmmword ptr [c1] ; constant 8 8 8 8
		movapd	xmm1, xmmword ptr [c1 + 16] ; constant 0 1
		movapd	xmm2, xmmword ptr [c1 + 32] ; constant 2 3
		movapd	xmm3, xmmword ptr [c1 + 48] ; constant 4 5
		movapd	xmm7, xmmword ptr [c1 + 64] ; constant 6 7
fill:
		movapd	[opt + eax + 0], xmm0
		movapd	[opt + eax + 16], xmm1
		movapd	[opt + eax + 32], xmm2
		movapd	[opt + eax + 48], xmm3
		addpd	xmm0, xmm7
		addpd	xmm1, xmm7
		addpd	xmm2, xmm7
		addpd	xmm3, xmm7
		add	eax, 64
		cmp	eax, 800000 ; N * 8
		jb	fill

For me it works faster — 0.0368 .vs. compiler's 0.0371. If you change movapd to movntpd in loop body and insert sfence after jb you get 0.0355.