It may not matter very much what the assembly code looks like depending on the size of the arrays and how random the accesses to the pDatat array become because ot the indirect indexing from the pFactorPositions array.
How big are the arrays involved? The worst case is if Pdatat is very large ( larger than last level cache) if pDatat[pFactorPosition[index]] more or less results in random memory accesses for the pDatat locations. Then it won't matter what the assembly looks like, you'll be mostly waiting on memory.
Assuming that the pData and pFactorPositions and pData arrays are only of size 1000 elements, they should all fit in cache. If you wanted to see if the indirect addressing is causing havoc you could try setting pFactorPositions to have sequential values.
But I'm guessing that you are getting 1 or more instruction executed per clocktick.
Do you have any CPI or IPC stats for the loop?
As Pat said it possibly does not maatter what assembly looks like.There is dependency on the indirect array indexing.Becuse of this data prefetching could not expect data spatial locality.
Sorry my mistake source array pData[iSource] has spatial locality because of lineary increased index.