[bash]. . . pxor xmm7, xmm7 ; Result movapd xmm0, [esi] mulpd xmm0, [edi] movapd xmm1, [esi+16] ; Uses INTEGER_ADD port mulpd xmm1, [edi+16] ; Uses INTEGER_ADD port addpd xmm7, xmm0 ; Uses SIMD ADD port movapd xmm2, [esi+32] ; Uses INTEGER ADD port mulpd xmm2, [edi+32] ; Uses INTEGER ADD port addpd xmm7, xmm1 ; Uses SIMD ADD port addpd xmm7, xmm2 ; Uses SIMD ADD port . . .[/bash]
[plain] pxor xmm7, xmm7 pxor xmm6, xmm6 mov ecx, 4 start_dp: movapd xmm0, [esi] movapd xmm1, [esi + 16] mulpd xmm0, [edi] movapd xmm2, [esi + 32] mulpd xmm1, [edi + 16] movapd xmm3, [esi + 48] addpd xmm7, xmm0 mulpd xmm2, [edi + 32] addpd xmm6, xmm1 mulpd xmm3, [edi + 48] addpd xmm7, xmm2 add esi, 64 add edi, 64 addpd xmm6, xmm3 dec ecx jnz start_dp ; May be, we should use horizatal instructions for below ; However, the following code is more portable addpd xmm7, xmm6 movddup xmm0, xmm7 movhlps xmm7, xmm7 addsd xmm0, xmm7 lea esi, retval movsd [esi], xmm0 [/plain]
I am not surehow big is your data set and whether you are having any cache block issues.You can easily find it through vtune.If you are having any cache evictions or blocks, you may want to allocate one of the input/output at a distance of one or two cachelines i.e. add a size of CACHELINE (64) to the allocation.
I don't think the adds are your problem here (you can use IACA to see how far from optimal you are)
but anyway, the way to get of all these loads is
before the loop:
in loop body use esi+ecx whenever using esi (same for edi)
at the loop end put:
you don't need any additional increments
32 was just an indicative number. The loop can be quite big as well. So that should make the hardware prefetcher happy at least.... Thanks for bringing up that point!
Good to see that integer muops would flow through easily.... As you say, they may even complete one after another (with temporary regs) in a more pipeline way ... probably overlapped with the memory loads...
I need one more advice. Consider an instruction like "mov eax, 18" -- Are immediate operands fetched aspart of "instruction decode" phase? OR Do they get converted as a "Load muop from I-cache?"
@Brijender -- While doing a dot product, the best data structure to have is a cache-aligned linear array. You cannot really "block" it (since its not 2D). So, I am sure thatthis code must be cache-friendly because of linear increase in addresses (for long loops - the hardware prefetcher should help)
@Neni - I did try the register variant with no big performance jump.....
THanks for all your time,