Dot Products and overhead of Address increments....

k_sarnath · ‎06-02-2010

All,

I was just trying out to optimize the "Dot Product" operation of 2 vectors. Both the vectors are laid out in aligned memory locations as arrays.

I did an assembly implementation only to realize that repeated additions are causing resource stalls (at least thats what I infer)

For example, consider this:

[bash].
.
.
pxor       xmm7, xmm7       ; Result

movapd     xmm0, [esi]
mulpd      xmm0, [edi]
movapd     xmm1, [esi+16]   ; Uses INTEGER_ADD port
mulpd      xmm1, [edi+16]   ; Uses INTEGER_ADD port
addpd      xmm7, xmm0       ; Uses SIMD ADD port
movapd     xmm2, [esi+32]   ; Uses INTEGER ADD port
mulpd      xmm2, [edi+32]   ; Uses INTEGER ADD port
addpd      xmm7, xmm1       ; Uses SIMD ADD port
addpd      xmm7, xmm2       ; Uses SIMD ADD port
.
.
.[/bash]

To me, the repeated use of integer unit seems to cause lot of stalls in the resulting code -- This results in poor SSE performance. Can some1 throw some light on this?

Is this the right way to write SIMD code for Dot Products (Let us keep the DPPS instruction away for a while -- I want to understand SIMD correctly.....)

Kindly advice,

Thanks for any help,
Best Regards,
Sarnath

TimP · ‎06-02-2010

This is close enough to the way gcc auto-vectorizes. Avoidable resource stalls, as you said, are due to repeated addition to the same register (xmm7), not to "repeated use of integer unit." There's no point in so much unrolling when using a single register accumulator. If you would look at how icc does it, you would see that 2 xmm registers are used to accumulate partial sums (when the SIMD width is 4 or less), so as to overcome the latency, a practice referred to as riffling. 20 years ago, it was referred to as interleave. That degree of unrolling requires a fairly long dot product (100 or so) to pay off.
There's not much useful role for dpps in a dot product long enough for unrolling.

k_sarnath · ‎06-02-2010

Hi Tim18,

Thanks for the repsonse. The 2-way accumulator logic is good.

Even in plain 'C' - I was able to get significant performance (12 seconds to 9 seconds -- without touching assembly or SSE- MSVC)...

But with SSE (+ the double accumulator) -- I was able to reach 7 seconds... Thats all. I thought I would reach 4 or 5 seconds with SSE and that did not happen -- which is quite disappointing for me. In another kernel that we are developing, we are able to get 2.4x performance with "SSE" on doubles -- which is very good....(hand optimized assembly..).

Here is an excerpt of our Dot Product implementation.

[plain]		pxor xmm7, xmm7
		pxor xmm6, xmm6

        mov     ecx, 4
 start_dp:
		movapd  xmm0, [esi]
		movapd  xmm1, [esi + 16]
		mulpd   xmm0, [edi]
		movapd  xmm2, [esi + 32]
		mulpd   xmm1, [edi + 16]
		movapd  xmm3, [esi + 48]
		addpd   xmm7, xmm0
		mulpd   xmm2, [edi + 32]
		addpd   xmm6, xmm1
		mulpd   xmm3, [edi + 48]
		addpd   xmm7, xmm2
		add		esi, 64
		add		edi, 64
		addpd  xmm6, xmm3
        dec     ecx
        jnz     start_dp

; May be, we should use horizatal instructions for below
; However, the following code is more portable
		addpd   xmm7, xmm6
		movddup xmm0, xmm7
		movhlps xmm7, xmm7
		addsd   xmm0, xmm7
		lea		esi, retval
		movsd	[esi], xmm0
[/plain]

Thats why I posted here.

Don't you think the "additions" in address calculation contend for the "Integer Add" port and hence consume execution resources possibly resulting "Resource Stalls" ??? If not, Can you explain why those instructions do "not" consume execution resources for address calculation ?

Appreciate your time on this,
Best Regards,
Sarnath

TimP · ‎06-03-2010

Those address calculations should easily fit in the sequence without falling on the critical path.
I would expect considerations you haven't addressed to be more important. Did you affinitize the operation to a single core? How are you managing cache? If you are testing products of length 32, as shown here, I already mentioned you won't get up to full speed, even if you are repeating the operations on the same cache lines. If you want to see asymptotic speed, you may have to use entire 4KB pages.
The slow operations with sequential data dependencies at the end probably are implicated in your inability to get linear scaling. You might even see the differences among various choices of instruction sequence; horizontal add is probably slower on Intel CPUs, possibly faster on AMD.
I'm no expert on the various CPU architectures; I suspect that use of dec and jnz together may be OK on recent CPUs but slow on earlier ones. For a long time, we were advised never to use dec or inc, replacing them with add or sub with immediate operand. You've probably resolved those questions already for the CPUs which interest you.

k_sarnath · ‎06-03-2010

Hello Tim18,

We do tie the app to single core (SetThreadAffinityMask(GetCurrentThread(), 1) in windows).

As far as cache management -- We are just experimenting with linear arrays which are cache aligned. I am aware of BLAS-3 level, block-matrices, cache-blocking etc. This experiment is to find how efficient one can implement "dot product" in Intel CPUs.

So, the problem statement starts with 2 linear cache-aligned arrays whose dot product need to be calculated.

32 doubles occupy 256 bytes == 4 L1 cache lines x 2 Arrays == 8 L1 cache lines -- Can you explain why this is bad? I don't quite understand the 4K page thing that u r talking about. Can you explain that too please? Hardware prefetcher will be pleased with it... No?

+

I am surprised that the continuous "add" muops won't be a problem..... Do you say that because ofhigher "throughput" of the "add" muops? Even if therez a 1 cycle delay between successive add operations -- it will almost half-my performance... (assuming other muops flow freely). No? I am usinga Core2Duo laptop.

+

The reason for "dec-jnz" combo--> "loop instruction" was performing very badly... We noticed a signficiant performance difference between VC6 and VC8 generated code and that was because of the "loop" instruction.... I will look at the "add/sub" instruction as you suggest. (one more add..... oops... :-) )

Appreciate your time,

Thank you,

Best Regards,
Sarnath

TimP · ‎06-03-2010

If you are running the benchmark multiple times on the same cache, you probably can ignore any cache timing issues. If you run on cold cache, you will have to satisfy at least 2 cache misses not taken care of by hardware prefetch, before hardware prefetch can start up. So it could take a very long loop to approach asymptotic speed of something close to 5 clock cycles per unrolled loop.
Those integer micro-ops should easily be able to issue on the same cycles as earlier floating point operations, taking advantage of out-of-order.
dec is the same as an add instruction, with additional requirements on setting flag registers. There may be a difference even between early Core 2 Duo laptops without SSE4.1 (like mine) or recent ones (such as I bought for my wife) as to whether the count and jnz instruction should be adjacent for fusion or separated as far as possible.

Brijender_B_Intel · ‎06-03-2010

I am not surehow big is your data set and whether you are having any cache block issues.You can easily find it through vtune.If you are having any cache evictions or blocks, you may want to allocate one of the input/output at a distance of one or two cachelines i.e. add a size of CACHELINE (64) to the allocation.

neni · ‎06-03-2010

I don't think the adds are your problem here (you can use IACA to see how far from optimal you are)
but anyway, the way to get of all these loads is
before the loop:
lea esi,[esi+4*64]
lea edi,[edi+4*64]
movecx,-4*64

in loop body use esi+ecx whenever using esi (same for edi)

at the loop end put:

add ecx,64

jnz loop_head

you don't need any additional increments

k_sarnath · ‎06-03-2010

Tim18,

32 was just an indicative number. The loop can be quite big as well. So that should make the hardware prefetcher happy at least.... Thanks for bringing up that point!

Good to see that integer muops would flow through easily.... As you say, they may even complete one after another (with temporary regs) in a more pipeline way ... probably overlapped with the memory loads...

I need one more advice. Consider an instruction like "mov eax, 18" -- Are immediate operands fetched aspart of "instruction decode" phase? OR Do they get converted as a "Load muop from I-cache?"

@Brijender -- While doing a dot product, the best data structure to have is a cache-aligned linear array. You cannot really "block" it (since its not 2D). So, I am sure thatthis code must be cache-friendly because of linear increase in addresses (for long loops - the hardware prefetcher should help)

@Neni - I did try the register variant with no big performance jump.....

THanks for all your time,
Best Regards,
Sarnath

k_sarnath · ‎06-03-2010

Vow! Thanks for bringing up the IACA thing... I did not know about it earlier!!! You made my day! Thanks a LOT!

TimP · ‎06-04-2010

A few cycles could be saved at the end by using a single accumulator, but adding the new results to each other inside the loop body before adding to the accumulated result from the previous unrolled iteration:
float sum = 0;
for(i = 0; i < n-1; i += 2) sum += a * b + a[i+1] * b[i+1];
if(i < n) sum += a + b;