You should try to get rid of the _mm_loadu_si128 instruction. On some architectures, it could have a significative impact on performance. You should compute the sum for the N first bytes so that SSE code is applied only on aligned data using _mm_load_si128. Then you can compute the remainder sum and add the 3.
You can probably reduce the number of unpack instructions by doing 16 bits adds before unpacking to 32bits
How can I estimate the performance boost when
SSE intrinsics is used?
it depends on your compiler, if your code is vectorized by the compiler 20% by using intrinsics is quite good. => post the ASM dump of the code generated by your compiler and tell us more about your target CPU (for example on Nehalem alignment is no big deal but it can hurt older CPUs like Pentium 4)
for example with Intel C++ 11.1, this code :
unsigned long Jogging (const unsigned char *FixedImg, unsigned long ImageHeight, unsigned long ImageWidth)
unsigned long SumC = 0;
const unsigned char *pSrc = FixedImg;
for (unsigned long i=0; i
is vectorized, ASM of the core loop is below:
.B8.4: ; Preds .B8.4 .B8.3
movd xmm2, DWORD PTR [eax+esi] ;254.58
punpcklbw xmm2, xmm0 ;254.58
punpcklwd xmm2, xmm0 ;254.58
paddd xmm1, xmm2 ;254.58
add eax, 4 ;254.3
cmp eax, edx ;254.3
jb .B8.4 ; Prob 82% ;254.3
pabsbw with zero to sum 8 horiz bytes to words X8 xn
reduce with paddw until you have 8 signed words
on 4 of those shift left by 16 or with the other 4 and apply pmaddws with 1,1 to get 4 dwords,
which you can sum with paddd
final stage is 2 phaddd
Using 16 bits addition is a good idea. A friend suggests it also.
When I write the intrinsics program at first, I try to avoid memory load. In the x86 processor, memory load has
latency even in the case of cache hit. But the x86 processors have out-of-order execution core, which can hide
the latency, so as few instructions as possible is expected in most cases.
I have a question about the aligned memory load. In the current instruction set, it appears that only 128 bit load have two versions for unaligned and aligned memory load respectively. 64 bit load instruction movq don't care about alignment.
The processor in my laptop is Core 2 Duo T7500. My current c compiler is from VS 2008.
The assembly code generated is below.
for( i=0; i
iresult0 += *pSrc++;
00401066 movzx eax,byte ptr [ecx]
00401069 add dword ptr [iresult0],eax
0040106C inc ecx
0040106D sub dword ptr [ebp-10h],1
00401071 jne main+66h (401066h)
Though one byte is processed each time, its execution speed is pretty fast.
the VS 2008code looks so horrible that it's very strange that your version is only 20% faster, maybe you are memory bandwidth bound (i.e. your image don't fit in L2 cache) ?
now, as hinted by another poster, the best is probably to use PSADBW (_mm_sad_epu8) to achieve 16 additions in a single instruction, then cumul the packed results with PADDW (_mm_add_epi32)
at the end you'll get something like : 0 | partialsum2 | 0 | partialsum1, then just add elt 2 to elt 0
the best for your experiments is probably to work first with a special test image: allocated with 16B alignment so that you can use aligned moves and not too big so that it can fit in the L2 cache, thenwhen your speedups are OK, generalize to the unaligned case