What is the fastest assembler code for this task?

Peter_F_1 · ‎03-29-2013

void do(double *const pData, const std::size_t iColPos, const std::size_t _iMax, const unsigned int *const pFactorPositions, const std::size_f iOffset, const double d) { for (std::size_t iSource = iColPos; iSource < _iMax; ++iSource) pDatat[pFactorPositions[iSource + iOffset]]) -= pData[iSource]*d; } I've to state, that LHS and RHS indexes are always unique -- no LHS is ever the same again and is never identical to RHS. here is what g++ produces: 81088: 42 8d 14 00 lea (%rax,%r8,1),%edx 8108c: 89 c6 mov %eax,%esi 8108e: ff c0 inc %eax 81090: 66 0f 12 0c f1 movlpd (%rcx,%rsi,8),%xmm1 81095: 39 c7 cmp %eax,%edi 81097: 41 8b 14 91 mov (%r9,%rdx,4),%edx 8109b: f2 0f 59 ca mulsd %xmm2,%xmm1 8109f: 48 8d 14 d1 lea (%rcx,%rdx,8),%rdx 810a3: 66 0f 12 02 movlpd (%rdx),%xmm0 810a7: f2 0f 5c c1 subsd %xmm1,%xmm0 810ab: f2 0f 11 02 movsd %xmm0,(%rdx) 810af: 77 d7 ja 81088 I've tried to force unrolling this loop but the performance is the same? Would it matter if different registers are being used in every step of the unrolled loop?

Patrick_F_Intel1 · ‎03-29-2013

Hello Foelsche,

It may not matter very much what the assembly code looks like depending on the size of the arrays and how random the accesses to the pDatat array become because ot the indirect indexing from the pFactorPositions array.

How big are the arrays involved? The worst case is if Pdatat is very large ( larger than last level cache) if pDatat[pFactorPosition[index]] more or less results in random memory accesses for the pDatat locations. Then it won't matter what the assembly looks like, you'll be mostly waiting on memory.

Pat

Peter_F_1 · ‎03-29-2013

Pat, the vector pFactorPositions is sorted -- means addresses for the LHS are always increasing and may jump any number or none. Peter

Peter_F_1 · ‎03-29-2013

Pat, array size should be limited to less than 1000 elements -- sometimes only 1 or two elements. Of course the routine will be inlined... Peter

Patrick_F_Intel1 · ‎03-29-2013

Assuming that the pData and pFactorPositions and pData arrays are only of size 1000 elements, they should all fit in cache. If you wanted to see if the indirect addressing is causing havoc you could try setting pFactorPositions to have sequential values.

But I'm guessing that you are getting 1 or more instruction executed per clocktick.

Do you have any CPI or IPC stats for the loop?

Bernard · ‎03-29-2013

As Pat said it possibly does not maatter what assembly looks like.There is dependency on the indirect array indexing.Becuse of this data prefetching could not expect data spatial locality.

Sorry my mistake source array pData[iSource] has spatial locality because of lineary increased index.