- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was just trying out to optimize the "Dot Product" operation of 2 vectors. Both the vectors are laid out in aligned memory locations as arrays.
I did an assembly implementation only to realize that repeated additions are causing resource stalls (at least thats what I infer)
For example, consider this:
[bash]. . . pxor xmm7, xmm7 ; Result movapd xmm0, [esi] mulpd xmm0, [edi] movapd xmm1, [esi+16] ; Uses INTEGER_ADD port mulpd xmm1, [edi+16] ; Uses INTEGER_ADD port addpd xmm7, xmm0 ; Uses SIMD ADD port movapd xmm2, [esi+32] ; Uses INTEGER ADD port mulpd xmm2, [edi+32] ; Uses INTEGER ADD port addpd xmm7, xmm1 ; Uses SIMD ADD port addpd xmm7, xmm2 ; Uses SIMD ADD port . . .[/bash]
To me, the repeated use of integer unit seems to cause lot of stalls in the resulting code -- This results in poor SSE performance. Can some1 throw some light on this?
Is this the right way to write SIMD code for Dot Products (Let us keep the DPPS instruction away for a while -- I want to understand SIMD correctly.....)
Kindly advice,
Thanks for any help,
Best Regards,
Sarnath
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's not much useful role for dpps in a dot product long enough for unrolling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the repsonse. The 2-way accumulator logic is good.
Even in plain 'C' - I was able to get significant performance (12 seconds to 9 seconds -- without touching assembly or SSE- MSVC)...
But with SSE (+ the double accumulator) -- I was able to reach 7 seconds... Thats all. I thought I would reach 4 or 5 seconds with SSE and that did not happen -- which is quite disappointing for me. In another kernel that we are developing, we are able to get 2.4x performance with "SSE" on doubles -- which is very good....(hand optimized assembly..).
Here is an excerpt of our Dot Product implementation.
[plain] pxor xmm7, xmm7 pxor xmm6, xmm6 mov ecx, 4 start_dp: movapd xmm0, [esi] movapd xmm1, [esi + 16] mulpd xmm0, [edi] movapd xmm2, [esi + 32] mulpd xmm1, [edi + 16] movapd xmm3, [esi + 48] addpd xmm7, xmm0 mulpd xmm2, [edi + 32] addpd xmm6, xmm1 mulpd xmm3, [edi + 48] addpd xmm7, xmm2 add esi, 64 add edi, 64 addpd xmm6, xmm3 dec ecx jnz start_dp ; May be, we should use horizatal instructions for below ; However, the following code is more portable addpd xmm7, xmm6 movddup xmm0, xmm7 movhlps xmm7, xmm7 addsd xmm0, xmm7 lea esi, retval movsd [esi], xmm0 [/plain]
Thats why I posted here.
Don't you think the "additions" in address calculation contend for the "Integer Add" port and hence consume execution resources possibly resulting "Resource Stalls" ??? If not, Can you explain why those instructions do "not" consume execution resources for address calculation ?
Appreciate your time on this,
Best Regards,
Sarnath
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would expect considerations you haven't addressed to be more important. Did you affinitize the operation to a single core? How are you managing cache? If you are testing products of length 32, as shown here, I already mentioned you won't get up to full speed, even if you are repeating the operations on the same cache lines. If you want to see asymptotic speed, you may have to use entire 4KB pages.
The slow operations with sequential data dependencies at the end probably are implicated in your inability to get linear scaling. You might even see the differences among various choices of instruction sequence; horizontal add is probably slower on Intel CPUs, possibly faster on AMD.
I'm no expert on the various CPU architectures; I suspect that use of dec and jnz together may be OK on recent CPUs but slow on earlier ones. For a long time, we were advised never to use dec or inc, replacing them with add or sub with immediate operand. You've probably resolved those questions already for the CPUs which interest you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We do tie the app to single core (SetThreadAffinityMask(GetCurrentThread(), 1) in windows).
As far as cache management -- We are just experimenting with linear arrays which are cache aligned. I am aware of BLAS-3 level, block-matrices, cache-blocking etc. This experiment is to find how efficient one can implement "dot product" in Intel CPUs.
So, the problem statement starts with 2 linear cache-aligned arrays whose dot product need to be calculated.
32 doubles occupy 256 bytes == 4 L1 cache lines x 2 Arrays == 8 L1 cache lines -- Can you explain why this is bad? I don't quite understand the 4K page thing that u r talking about. Can you explain that too please? Hardware prefetcher will be pleased with it... No?
+
I am surprised that the continuous "add" muops won't be a problem..... Do you say that because ofhigher "throughput" of the "add" muops? Even if therez a 1 cycle delay between successive add operations -- it will almost half-my performance... (assuming other muops flow freely). No? I am usinga Core2Duo laptop.
+
The reason for "dec-jnz" combo--> "loop instruction" was performing very badly... We noticed a signficiant performance difference between VC6 and VC8 generated code and that was because of the "loop" instruction.... I will look at the "add/sub" instruction as you suggest. (one more add..... oops... :-) )
Appreciate your time,
Thank you,
Best Regards,
Sarnath
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Those integer micro-ops should easily be able to issue on the same cycles as earlier floating point operations, taking advantage of out-of-order.
dec is the same as an add instruction, with additional requirements on setting flag registers. There may be a difference even between early Core 2 Duo laptops without SSE4.1 (like mine) or recent ones (such as I bought for my wife) as to whether the count and jnz instruction should be adjacent for fusion or separated as far as possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not surehow big is your data set and whether you are having any cache block issues.You can easily find it through vtune.If you are having any cache evictions or blocks, you may want to allocate one of the input/output at a distance of one or two cachelines i.e. add a size of CACHELINE (64) to the allocation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think the adds are your problem here (you can use IACA to see how far from optimal you are)
but anyway, the way to get of all these loads is
before the loop:
lea esi,[esi+4*64]
lea edi,[edi+4*64]
movecx,-4*64
in loop body use esi+ecx whenever using esi (same for edi)
at the loop end put:
add ecx,64
jnz loop_head
you don't need any additional increments
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim18,
32 was just an indicative number. The loop can be quite big as well. So that should make the hardware prefetcher happy at least.... Thanks for bringing up that point!
Good to see that integer muops would flow through easily.... As you say, they may even complete one after another (with temporary regs) in a more pipeline way ... probably overlapped with the memory loads...
I need one more advice. Consider an instruction like "mov eax, 18" -- Are immediate operands fetched aspart of "instruction decode" phase? OR Do they get converted as a "Load muop from I-cache?"
@Brijender -- While doing a dot product, the best data structure to have is a cache-aligned linear array. You cannot really "block" it (since its not 2D). So, I am sure thatthis code must be cache-friendly because of linear increase in addresses (for long loops - the hardware prefetcher should help)
@Neni - I did try the register variant with no big performance jump.....
THanks for all your time,
Best Regards,
Sarnath
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
float sum = 0;
for(i = 0; i < n-1; i += 2) sum += a * b + a[i+1] * b[i+1];
if(i < n) sum += a + b;

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page