Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

What is the fastest assembler code for this task?

Peter_F_1
Beginner
371 Views
void do(double *const pData, const std::size_t iColPos, const std::size_t _iMax, const unsigned int *const pFactorPositions, const std::size_f iOffset, const double d) { for (std::size_t iSource = iColPos; iSource < _iMax; ++iSource) pDatat[pFactorPositions[iSource + iOffset]]) -= pData[iSource]*d; } I've to state, that LHS and RHS indexes are always unique -- no LHS is ever the same again and is never identical to RHS. here is what g++ produces: 81088: 42 8d 14 00 lea (%rax,%r8,1),%edx 8108c: 89 c6 mov %eax,%esi 8108e: ff c0 inc %eax 81090: 66 0f 12 0c f1 movlpd (%rcx,%rsi,8),%xmm1 81095: 39 c7 cmp %eax,%edi 81097: 41 8b 14 91 mov (%r9,%rdx,4),%edx 8109b: f2 0f 59 ca mulsd %xmm2,%xmm1 8109f: 48 8d 14 d1 lea (%rcx,%rdx,8),%rdx 810a3: 66 0f 12 02 movlpd (%rdx),%xmm0 810a7: f2 0f 5c c1 subsd %xmm1,%xmm0 810ab: f2 0f 11 02 movsd %xmm0,(%rdx) 810af: 77 d7 ja 81088 I've tried to force unrolling this loop but the performance is the same? Would it matter if different registers are being used in every step of the unrolled loop?
0 Kudos
5 Replies
Patrick_F_Intel1
Employee
371 Views

Hello Foelsche,

It may not matter very much what the assembly code looks like depending on the size of the arrays and how random the accesses to the pDatat array become because ot the indirect indexing from the pFactorPositions array.

How big are the arrays involved? The worst case is if Pdatat is very large ( larger than last level cache) if pDatat[pFactorPosition[index]] more or less results in random memory accesses for the pDatat locations. Then it won't matter what the assembly looks like, you'll be mostly waiting on memory.

Pat

0 Kudos
Peter_F_1
Beginner
371 Views
Pat, the vector pFactorPositions is sorted -- means addresses for the LHS are always increasing and may jump any number or none. Peter
0 Kudos
Peter_F_1
Beginner
371 Views
Pat, array size should be limited to less than 1000 elements -- sometimes only 1 or two elements. Of course the routine will be inlined... Peter
0 Kudos
Patrick_F_Intel1
Employee
371 Views

Assuming that the pData and pFactorPositions and pData arrays are only of size 1000 elements, they should all fit in cache. If you wanted to see if the indirect addressing is causing havoc you could try setting pFactorPositions to have sequential values.

But I'm guessing that you are getting 1 or more instruction executed per clocktick.

Do you have any CPI or IPC stats for the loop?

0 Kudos
Bernard
Valued Contributor I
371 Views

As Pat said it possibly does not maatter what assembly looks like.There is dependency on the indirect array indexing.Becuse of this data prefetching could not expect data spatial locality.

Sorry my mistake source array pData[iSource] has spatial locality because of lineary increased index.

0 Kudos
Reply