Re: loop alignment

david_livshin1 · ‎06-01-2007

Hi,

Inserting "leal 0(%eax),%eax" ( form of a NOP ) anywhere ( at least I tried in many places ) inside the body of the loop

___dcox86_wl_3_:

movsd as1+32040(,%edx,8),%xmm7
addl $8,%edx
movsd as1+31984(,%edx,8),%xmm6
cmpl %ecx,%edx
movsd as1+31992(,%edx,8),%xmm5
movsd as1+32000(,%edx,8),%xmm4
movsd as1+32008(,%edx,8),%xmm3
movsd as1+32016(,%edx,8),%xmm2
movsd as1+32024(,%edx,8),%xmm1
movsd as1+32032(,%edx,8),%xmm0
subsd as1+31968(,%edx,8),%xmm7
subsd as1+31976(,%edx,8),%xmm6
subsd as1+31984(,%edx,8),%xmm5
subsd as1+31992(,%edx,8),%xmm4
subsd as1+32000(,%edx,8),%xmm3
subsd as1+32008(,%edx,8),%xmm2
movsd %xmm7,as1+23960(,%edx,8)
movsd %xmm6,as1+23968(,%edx,8)
subsd as1+32016(,%edx,8),%xmm1
movsd %xmm5,as1+23976(,%edx,8)
subsd as1+32024(,%edx,8),%xmm0
movsd %xmm4,as1+23984(,%edx,8)
movsd %xmm3,as1+23992(,%edx,8)
movsd %xmm2,as1+24000(,%edx,8)
movsd %xmm1,as1+24008(,%edx,8)
movsd %xmm0,as1+24016(,%edx,8)

jne ___dcox86_wl_3_

make it run faster by more than 20% ( test is done on Pentium4 ).

Any idea why?

Nothing happens if "leal 0(%eax),%eax" inserted right before the loop.

Thank you in advance,

David

------

David Livshin

http://www.dalsoft.com

TimP · ‎06-01-2007

Such situations may be easier to analyze if measures are taken to align the top of the loop, as your title hints. One of the more effective methods is the conditional alignment directive emitted by gnu compilers and supported by gnu ld. This pads with NOP equivalents only when 7 bytes or less of NOPs are needed.
You might get some insight into remaining effects by running an event collecting profiler like Intel VTune. Needless to say, it would take a real expert to enumerate all likely causes of the effect you observed.
A possibility which comes to mind is that some stalls occasion a retry some fixed number of cycles later. Then, inserting a shorter delay, sufficient to avoid the stall, could increase performance. So you would look with an event profiler to see if you can identify a stall which becomes less frequent with the padding.

xorpd · ‎06-02-2007

___dcox86_wl_3_:
 movapd xmm0, [rcx+rdx]
 shufpd xmm2, xmm0, 1
 movapd xmm1, xmm0
 subpd xmm0, xmm2
 movapd [rax+rdx], xmm0
 movapd xmm0, [rcx+rdx+16]
 shufpd xmm1, xmm0, 1
 movapd xmm2, xmm0
 subpd xmm0, xmm1
 movapd [rax+rdx+16], xmm0
 add rdx, 32
jns ___dcox86_wl_3_

david_livshin1 · ‎06-03-2007

Nice code ( although 64-bit and not in gnu as ) - but what about "movapd" alignment?

If alignment is not known ( like it is in the case I posted which is gcc 4.2.0 generated
loop of the kernel 12 from the Livermoore loops benchmark:

/*
*******************************************************************
* Kernel 12 -- first difference
*******************************************************************
*/

parameters (12);

do
{
for ( k=0 ; k {
x = y[k+1] - y;
}

endloop (12);
}
while (count < loop);

) the code, when packing is attempted, is:

	
___dcox86_wl_3_:
	movsd as1+32040(,%edx,8),%xmm2
	addl $8,%edx
	movhpd as1+31984(,%edx,8),%xmm2
	cmpl %ecx,%edx
	movsd as1+31968(,%edx,8),%xmm5
	movhpd as1+31976(,%edx,8),%xmm5
	movsd as1+31992(,%edx,8),%xmm3
	movhpd as1+32000(,%edx,8),%xmm3
	movsd as1+31984(,%edx,8),%xmm6
	movhpd as1+31992(,%edx,8),%xmm6
	movsd as1+32008(,%edx,8),%xmm4
	movhpd as1+32016(,%edx,8),%xmm4
	subpd %xmm5,%xmm2
	movsd as1+32000(,%edx,8),%xmm7
	movhpd as1+32008(,%edx,8),%xmm7
	movsd as1+32024(,%edx,8),%xmm1
	movsd %xmm2,as1+23960(,%edx,8)
	movhpd as1+32032(,%edx,8),%xmm1
	subpd %xmm6,%xmm3
	movsd as1+32016(,%edx,8),%xmm0
	movhpd %xmm2,as1+23968(,%edx,8)
	movhpd as1+32024(,%edx,8),%xmm0
	movsd %xmm3,as1+23976(,%edx,8)
	movhpd %xmm3,as1+23984(,%edx,8)
	subpd %xmm7,%xmm4
	movsd %xmm4,as1+23992(,%edx,8)
	movhpd %xmm4,as1+24000(,%edx,8)
	subpd %xmm0,%xmm1
	movsd %xmm1,as1+24008(,%edx,8)
	movhpd %xmm1,as1+24016(,%edx,8)
	jne ___dcox86_wl_3_
which is not near as efficient as one, you posted, appears to be.

----
David Livshin

http://www.dalsoft.com

TimP · ‎06-03-2007

Compilers normally put in a single conditional scalar loop remainder iteration (before and after) to align the destination, so that vector code such as this (icc) can be used:

..B1.133: # Preds ..B1.133 ..B1.132
movaps 32040+space1_(%rdi), %xmm1
movsd 32032+space1_(%rdi), %xmm0
movhpd 32040+space1_(%rdi), %xmm0
movaps 32056+space1_(%rdi), %xmm3
movsd 32048+space1_(%rdi), %xmm2
movhpd 32056+space1_(%rdi), %xmm2
movaps 32072+space1_(%rdi), %xmm5
movsd 32064+space1_(%rdi), %xmm4
movhpd 32072+space1_(%rdi), %xmm4
movaps 32088+space1_(%rdi), %xmm7
movsd 32080+space1_(%rdi), %xmm6
movhpd 32088+space1_(%rdi), %xmm6
subpd %xmm0, %xmm1
movaps %xmm1, 24024+space1_(%rdi)
subpd %xmm2, %xmm3
movaps %xmm3, 24040+space1_(%rdi)
subpd %xmm4, %xmm5
movaps %xmm5, 24056+space1_(%rdi)
subpd %xmm6, %xmm7
movaps %xmm7, 24072+space1_(%rdi)
addq $64, %rdi
cmpq %rsi, %rdi
jl ..B1.133 # Prob 99%

Current CPUs do work better avoiding unaligned load by half register loads. The job certainly can be done with fewer instructions.

xorpd · ‎06-04-2007

Nice code ( although 64-bit and not in gnu as ) - but what about "movapd" alignment?

Of course a second loop must be coded to take into account the possibility that source and destination may be misaligned relative to one another:

___dcox86_wl_4:
 movapd xmm0, [rcx+rdx]
 shufpd xmm2, xmm0, 1
 subpd xmm2, xmm1
 movapd xmm1, xmm0
 movapd [rax+rdx], xmm2
 movapd xmm2, [rcx+rdx+16]
 shufpd xmm0, xmm2, 1
 subpd xmm0, xmm1
 movapd xmm1, xmm2
 movapd [rax+rdx+16], xmm0
 add rdx, 32
js ___dcox86_wl_4 ; Should have been js in original code, too.

The prolog must be capable of detecting this relative misalignment and selecting the appropriate inner loop as well as setting up registers for induction variable elimination and picking off the first destination element if the destination array is misaligned.The epilog must be ready to pick off a stray destination element as well if necessary.

Multiple loops to handle the same task depending on alignment are to be expected when SIMD operations are used. In code to replace memcpy() with movapd instructions, memcpy.asm I needed 16 version of the inner loop because palignr only takes an immediate shift count. Results: the worst case was under 1200 clocks to copy 8000 bytes, and the easiest case (already aligned) was 670 clocks. I think I could have gotten the worst case under 1000 clocks by improving prolog and epilog code and by running the loops in descending order instead of ascending. As it stands that example compares favorably with rep movsq which takes over 1000 clocks for the most favorable alignment and over 3000 clocks for slightly unfavorable alignment and over 10000 clocks for hated alignment.

TimP · ‎06-04-2007

In the case originally cited, there are 2 overlapping sources, one of which must be aligned relative to the destination. Context, not shown, but visible to the compiler, actually determines which is aligned with the destination, so 2 versions aren't needed.

xorpd · ‎06-04-2007

Current CPUs do work better avoiding unaligned load by half register loads.

I thought that I might take the time to stigmatize the above as total rubbish. In my memcpy() example cited above, I get from 640 to 790 clocks including prolog, epilog, and function call to copy 8000 bytes when source and destination have relative alignment 8 mod 16; if half loads were used the best that could have been obtained would have been 1000 clocks due to the 1000 instructions issued to port 2.

In the present case as well, the icc code couldn't possibly retire 16 bytes of destination in fewer than 3 clocks because of the 3 loads required to do so, but there doesn't seem to be anything stopping a core microarchitecture processor from retiring that same 16 bytes in slightly under 2 clocks. Some improvement might also be seen on a P4, and I would like to see the original poster implement our differing opinions in his format so that he could determine which code is the faster.

On at least core microarchitecture machines one is normally better off with fewer loads and instead using dedicated data-swizzling operations on the fly so as to avoid scalar operationsor data reloads to the extent possible.

david_livshin1 · ‎06-04-2007

In the case originally cited, there are 2 overlapping sources, one of which must be aligned relative to the destination. Context, not shown, but visible to the compiler, actually determines which is aligned with the destination, so 2 versions aren't needed.

The source loop

for ( k=0 ; k
{
x = y[k+1] - y;
}

was unrolled by the compiler ( gcc 4.2.0 ) which, in order to insure the loop count ( n ) to be multiple of the unroll count of 4, prefaced it with conditional code which make it impossible to determine the alignment of memory references, e.g. in

movsd as1+32040(,%edx,8),%xmm7

at the entry to the loop %ebx is in the range from 0 to 3, so alignment of as1+32040(,%edx,8) is not clear ( even thought alignment of as1 and therefore of as1+32040 could be determined ).

Also, I don't understand why shall the source "be aligned relative to the destination".

Back to my original posting.

I am writing x86 assembly code optimizer ( see http://www.dalsoft.com ) and need to understand the situation I described in order to be able to find the solution and translate it to C++. Perhaps someone from IntelSoftware NetworkSupport may provide me with proper documentation and/or algorithm and/or heuristic that might help me to solve my problem. Are there other forums that might be appropriate to post my question?

------

David Livshin

http://www.dalsoft.com

Intel_Software_Netw1 · ‎06-04-2007

David,

Posting your questions to this forum is fine, although you might also be interested in the Intel VTune Performance Analyzer for Windows* or Linux* forums ifyou'd like to try Tim18's suggestions, and the Intel C++ Compiler forummight contain some useful information for you on C++ in general.

Of course,there are alsothe Intel 64 and IA-32 Architectures Software Developer's Manuals, specifically Volumes 3A and 3B, System Programming Guide, and the Intel 64 and IA-32 Architectures Optimization Reference Manual.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

TimP · ‎06-05-2007

In order to use aligned parallel stores, and aligned parallel loads for the appropriate operand, the compiler needs to find out which operand is aligned relative to the destination. In the case in question, the arrays are defined in a struct, and the vectorizing compiler sees that y[k+1] and x are relatively aligned. Compile time tests determine whether to execute just one loop iteration for alignment, as well as which source operand is aligned. If the alignments were not known at compile time, it would be done with run-time tests.
As xorpd noted, the aligned operand can be read by aligned loads; the unaligned operand, being the same except for the 8-byte offset, could be set up from the aligned one by mov and shufpd instructions. This still involves half-register operations, in my view.
I note that gnu compilers don't manage to vectorize this loop in my copy of Livermore Kernels, although gfortran 4.3 vectorizes a fair amount of LFK. I doubt that a preference for the scheme which gcc uses to adjust for non-vector unrolling is a reason for not vectorizing; gcc knows an appropriate method for remainder loops for vectorization, involving remainders both before and after the vectorized loop body. More likely, the problem is this loop exhibits special requirements which come up relatively rarely.
Hardware vendors have recognized the desirability of minimizing performance penalty of unaligned full width loads; if that happened, gcc might be able to treat vector loops more like scalar.

levicki · ‎07-18-2007

May I suggest this little experiment:

	jmp	loop_start
	align	16
loop_start:
	...
	jne	loop_start

If that also helps then it is the alignment. If not, then it is most likely that by including the delay you encounter less replays if you are running the above code on a NetBurst architecture CPU.

Moreover, if you have more than 100 iterations this form of the loop would probably be faster:

	mov	edx, [count]
	jmp	loop_start
	align	16
loop_start:
	test	edx, edx
	jz	loop_exit
	...
	sub	edx, n		; n = number of elements
				; processed in one loop pass
	jmp	loop_start
loop_exit:
	...

Or if you want to be able to use edx as an array index:

	mov	ecx, [count]
	xor	edx, edx
	jmp	loop_start
	align	16
loop_start:
	cmp	edx, ecx
	jz	loop_exit
	...
	add	edx, n		; n = number of elements
				; processed in one loop pass
	jmp	loop_start

david_livshin · ‎07-22-2007

May I suggest this little experiment:
	jmp	loop_start
	align	16
loop_start:
	...
	jne	loop_start

No, you modification doesn't affect execution time. Actually I tried many modifications of the code before submitting my original message, and the only difference was when NOP was inserted inside the loop, interestingly, anywhere inside the loop.

The exact execution times on 2.8GHz Pentium4 under Linux:

Original loop	5.52 seconds
With NOP inside the loop	4.4 seconds

-- 
David Livshin

http://www.dalsoft.com

jimdempseyatthecove · ‎08-30-2007

David,

What may be happening, and this is only a guess on my part, is the P4 does speculative loads and executions. i.e. the loop as coded without the NOP is situated such that the P4 is executing the (or part of the)instruction(s) following the jne during each iteration of the loop.This may be a case where the padd from NOP inside the loop causes the look ahead (speculative load and/or execution) to work less (or with less introduced latency).

What happens if you place a (or a series of) NOP following the loop jne? If this effects the run time at all, then it would be a strong indicator that the look ahead is causing the effect. You may need a few NOPs to fill the pipeline.

An alternative test would be something like

jne loop_start
jmp short further
NOP;NOP;NOP;NOP;NOP
further:

The idea is to make the speculative load/execution give up by seeing the jmp.

Now if you find this is the case, you might also be able to determine a generalized rule to end your loops

jne loop_start
NOP
align(4)

jne loop_start
jmpover
NOP
align(16)
over:

Or something like that.

Jim Dempsey

david_livshin · ‎09-02-2007

What happens if you place a (or a series of) NOP following the loop jne? If this effects the run time at all, then it would be a strong indicator that the look ahead is causing the effect. You may need a few NOPs to fill the pipeline.

This modification doesn't affect execution time.

My current "theory" about behavior of this loop based on the observation that every iteration of the loop introduces a cache miss. Perhaps induction of an instruction that doesn't use the cache ( like the NOP I was using ) somehow relieves the queue of instructions that may not be issued due to cache miss ( if such a queue exists at all ) and which need to be issued later ( when data from the cache will become available ) and, perhaps, such a retry is expensive. Unfortunately I couldn't find enough information to verify my assumptions.

-- 
David Livshin

http://www.dalsoft.com