Performance boost is not as expected using SSE intrinsics

joggingsonggmail_com · ‎01-18-2010

Hi,all
In order to boost performance, I choose
to program using SSE intrinsics. After
I measure execution time, I find that
the improvement is not significant,
only 20%.

The original program calculate sum
sequentially, and SSE intrinsics program
calculate sum with four addition in parallel.
Several instructions are used for preparation
data in proper format, but it reduces the nubmer
of memory load by wide load instruction.
I expect the execution time is reduced
to at least 1/3. But the measured result
is very disappointing.

How can I estimate the performance boost when
SSE intrinsics is used? Does porgramming using
SSE intrinsics require programmers know
the details of processor architecture well?

The following is my code:
Start = GetCycleCount();
SumC = 0;
pSrc = FixedImg;
for(i = 0; i < ImageHeight*ImageWidth; i++)
{
SumC += *pSrc++;
}
Stop = GetCycleCount();
printf("Sum C cycle: %d\\n", (Stop - Start));
printf("SumC: %d\\n", SumC);

Start = GetCycleCount();
SumSSE = 0;
{
__m128i Sum, Dat1, Dat2, Dat3;
__m128i vzero;

pSrc = FixedImg;
vzero = _mm_setzero_si128();
Sum = _mm_setzero_si128();
for(i = 0; i < ImageHeight*ImageWidth/16; i++)
{
Dat1 = _mm_loadu_si128( (__m128i*)(pSrc));
Dat2 = _mm_unpacklo_epi8(Dat1, vzero);
Dat3 = _mm_unpacklo_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat3 = _mm_unpackhi_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat2 = _mm_unpackhi_epi8(Dat1, vzero);
Dat3 = _mm_unpacklo_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat3 = _mm_unpackhi_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
pSrc += 16;
}
Dat1 = _mm_unpacklo_epi32(Sum, vzero);
Dat2 = _mm_unpackhi_epi32(Sum, vzero);
Sum = _mm_add_epi64(Dat1, Dat2);
Dat1 = _mm_unpacklo_epi64(Sum, vzero);
Dat2 = _mm_unpackhi_epi64(Sum, vzero);
Sum = _mm_add_epi64(Dat1, Dat2);

SumSSE = _mm_cvtsi128_si32(Sum);
}
Stop = GetCycleCount();
printf("Sum SSE cycle: %d\\n", (Stop - Start));
printf("SumSSE: %d\\n", SumSSE);

Best Regards
Jogging

matthieu_darbois · ‎01-19-2010

Quoting joggingsonggmail.com

Hi,all
In order to boost performance, I choose
to program using SSE intrinsics. After
I measure execution time, I find that
the improvement is not significant,
only 20%.

The original program calculate sum
sequentially, and SSE intrinsics program
calculate sum with four addition in parallel.
Several instructions are used for preparation
data in proper format, but it reduces the nubmer
of memory load by wide load instruction.
I expect the execution time is reduced
to at least 1/3. But the measured result
is very disappointing.

How can I estimate the performance boost when
SSE intrinsics is used? Does porgramming using
SSE intrinsics require programmers know
the details of processor architecture well?

The following is my code:
Start = GetCycleCount();
SumC = 0;
pSrc = FixedImg;
for(i = 0; i < ImageHeight*ImageWidth; i++)
{
SumC += *pSrc++;
}
Stop = GetCycleCount();
printf("Sum C cycle: %d\n", (Stop - Start));
printf("SumC: %d\n", SumC);

Start = GetCycleCount();
SumSSE = 0;
{
__m128i Sum, Dat1, Dat2, Dat3;
__m128i vzero;

pSrc = FixedImg;
vzero = _mm_setzero_si128();
Sum = _mm_setzero_si128();
for(i = 0; i < ImageHeight*ImageWidth/16; i++)
{
Dat1 = _mm_loadu_si128( (__m128i*)(pSrc));
Dat2 = _mm_unpacklo_epi8(Dat1, vzero);
Dat3 = _mm_unpacklo_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat3 = _mm_unpackhi_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat2 = _mm_unpackhi_epi8(Dat1, vzero);
Dat3 = _mm_unpacklo_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat3 = _mm_unpackhi_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
pSrc += 16;
}
Dat1 = _mm_unpacklo_epi32(Sum, vzero);
Dat2 = _mm_unpackhi_epi32(Sum, vzero);
Sum = _mm_add_epi64(Dat1, Dat2);
Dat1 = _mm_unpacklo_epi64(Sum, vzero);
Dat2 = _mm_unpackhi_epi64(Sum, vzero);
Sum = _mm_add_epi64(Dat1, Dat2);

SumSSE = _mm_cvtsi128_si32(Sum);
}
Stop = GetCycleCount();
printf("Sum SSE cycle: %d\n", (Stop - Start));
printf("SumSSE: %d\n", SumSSE);

Best Regards
Jogging

Hi,

You should try to get rid of the _mm_loadu_si128 instruction. On some architectures, it could have a significative impact on performance. You should compute the sum for the N first bytes so that SSE code is applied only on aligned data using _mm_load_si128. Then you can compute the remainder sum and add the 3.

You can probably reduce the number of unpack instructions by doing 16 bits adds before unpacking to 32bits

Matthieu

bronxzv · ‎01-19-2010

How can I estimate the performance boost when
SSE intrinsics is used?

it depends on your compiler, if your code is vectorized by the compiler 20% by using intrinsics is quite good. => post the ASM dump of the code generated by your compiler and tell us more about your target CPU (for example on Nehalem alignment is no big deal but it can hurt older CPUs like Pentium 4)

for example with Intel C++ 11.1, this code :

unsigned long Jogging (const unsigned char *FixedImg, unsigned long ImageHeight, unsigned long ImageWidth)

{

unsigned long SumC = 0;

const unsigned char *pSrc = FixedImg;

for (unsigned long i=0; i

return SumC;

}

is vectorized, ASM of the core loop is below:

.B8.4: ; Preds .B8.4 .B8.3

$LN323:

movd xmm2, DWORD PTR [eax+esi] ;254.58

punpcklbw xmm2, xmm0 ;254.58

punpcklwd xmm2, xmm0 ;254.58

paddd xmm1, xmm2 ;254.58

$LN325:

add eax, 4 ;254.3

cmp eax, edx ;254.3

jb .B8.4 ; Prob 82% ;254.3

neni · ‎01-20-2010

try using

pabsbw with zero to sum 8 horiz bytes to words X8 xn

reduce with paddw until you have 8 signed words

on 4 of those shift left by 16 or with the other 4 and apply pmaddws with 1,1 to get 4 dwords,

which you can sum with paddd

final stage is 2 phaddd

joggingsonggmail_com · ‎01-20-2010

Quoting matthieu.darbois

Quoting joggingsonggmail.com

Hi,all
In order to boost performance, I choose
to program using SSE intrinsics. After
I measure execution time, I find that
the improvement is not significant,
only 20%.

The original program calculate sum
sequentially, and SSE intrinsics program
calculate sum with four addition in parallel.
Several instructions are used for preparation
data in proper format, but it reduces the nubmer
of memory load by wide load instruction.
I expect the execution time is reduced
to at least 1/3. But the measured result
is very disappointing.

How can I estimate the performance boost when
SSE intrinsics is used? Does porgramming using
SSE intrinsics require programmers know
the details of processor architecture well?

The following is my code:
Start = GetCycleCount();
SumC = 0;
pSrc = FixedImg;
for(i = 0; i < ImageHeight*ImageWidth; i++)
{
SumC += *pSrc++;
}
Stop = GetCycleCount();
printf("Sum C cycle: %d\n", (Stop - Start));
printf("SumC: %d\n", SumC);

Start = GetCycleCount();
SumSSE = 0;
{
__m128i Sum, Dat1, Dat2, Dat3;
__m128i vzero;

pSrc = FixedImg;
vzero = _mm_setzero_si128();
Sum = _mm_setzero_si128();
for(i = 0; i < ImageHeight*ImageWidth/16; i++)
{
Dat1 = _mm_loadu_si128( (__m128i*)(pSrc));
Dat2 = _mm_unpacklo_epi8(Dat1, vzero);
Dat3 = _mm_unpacklo_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat3 = _mm_unpackhi_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat2 = _mm_unpackhi_epi8(Dat1, vzero);
Dat3 = _mm_unpacklo_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
Dat3 = _mm_unpackhi_epi16(Dat2, vzero);
Sum = _mm_add_epi32(Sum, Dat3);
pSrc += 16;
}
Dat1 = _mm_unpacklo_epi32(Sum, vzero);
Dat2 = _mm_unpackhi_epi32(Sum, vzero);
Sum = _mm_add_epi64(Dat1, Dat2);
Dat1 = _mm_unpacklo_epi64(Sum, vzero);
Dat2 = _mm_unpackhi_epi64(Sum, vzero);
Sum = _mm_add_epi64(Dat1, Dat2);

SumSSE = _mm_cvtsi128_si32(Sum);
}
Stop = GetCycleCount();
printf("Sum SSE cycle: %d\n", (Stop - Start));
printf("SumSSE: %d\n", SumSSE);

Best Regards
Jogging

Hi,

You should try to get rid of the _mm_loadu_si128 instruction. On some architectures, it could have a significative impact on performance. You should compute the sum for the N first bytes so that SSE code is applied only on aligned data using _mm_load_si128. Then you can compute the remainder sum and add the 3.

You can probably reduce the number of unpack instructions by doing 16 bits adds before unpacking to 32bits

Matthieu

Thanks, Mattieu
Using 16 bits addition is a good idea. A friend suggests it also.
When I write the intrinsics program at first, I try to avoid memory load. In the x86 processor, memory load has
latency even in the case of cache hit. But the x86 processors have out-of-order execution core, which can hide
the latency, so as few instructions as possible is expected in most cases.

I have a question about the aligned memory load. In the current instruction set, it appears that only 128 bit load have two versions for unaligned and aligned memory load respectively. 64 bit load instruction movq don't care about alignment.

Best Regards
Jogging

joggingsonggmail_com · ‎01-20-2010

Quoting bronxzv

How can I estimate the performance boost when
SSE intrinsics is used?

it depends on your compiler, if your code is vectorized by the compiler 20% by using intrinsics is quite good. => post the ASM dump of the code generated by your compiler and tell us more about your target CPU (for example on Nehalem alignment is no big deal but it can hurt older CPUs like Pentium 4)

for example with Intel C++ 11.1, this code :

unsigned long Jogging (const unsigned char *FixedImg, unsigned long ImageHeight, unsigned long ImageWidth)

{

unsigned long SumC = 0;

const unsigned char *pSrc = FixedImg;

for (unsigned long i=0; i
return SumC;

}

is vectorized, ASM of the core loop is below:

.B8.4: ; Preds .B8.4 .B8.3

$LN323:

movd xmm2, DWORD PTR [eax+esi] ;254.58

punpcklbw xmm2, xmm0 ;254.58

punpcklwd xmm2, xmm0 ;254.58

paddd xmm1, xmm2 ;254.58

$LN325:

add eax, 4 ;254.3

cmp eax, edx ;254.3

jb .B8.4 ; Prob 82% ;254.3

Thanks.
The processor in my laptop is Core 2 Duo T7500. My current c compiler is from VS 2008.
The assembly code generated is below.

for( i=0; i {
iresult0 += *pSrc++;
00401066 movzx eax,byte ptr [ecx]
00401069 add dword ptr [iresult0],eax
0040106C inc ecx
0040106D sub dword ptr [ebp-10h],1
00401071 jne main+66h (401066h)
}

Though one byte is processed each time, its execution speed is pretty fast.

Regards

Jogging

bronxzv · ‎01-20-2010

the VS 2008code looks so horrible that it's very strange that your version is only 20% faster, maybe you are memory bandwidth bound (i.e. your image don't fit in L2 cache) ?

now, as hinted by another poster, the best is probably to use PSADBW (_mm_sad_epu8) to achieve 16 additions in a single instruction, then cumul the packed results with PADDW (_mm_add_epi32)

at the end you'll get something like : 0 | partialsum2 | 0 | partialsum1, then just add elt 2 to elt 0

the best for your experiments is probably to work first with a special test image: allocated with 16B alignment so that you can use aligned moves and not too big so that it can fit in the L2 cache, thenwhen your speedups are OK, generalize to the unaligned case