SSE2 slower than MMX/SSE(1)

dessa · ‎03-23-2003

OK, here is two functions,
one is SSE (P3/MMX et cetera), another is SSE2.
SSE2 is performing slower. Any advice why?

I've compiled with and without profile guided complation. results are the same. SSE2 is slower.

SSE2 function:

int SSE2_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m128i e=_mm_setzero_si128();
for (i=0;i<16;i++) {
__m128i unaligned=_mm_loadu_si128((__m128i*)pSrc);
e=_mm_add_epi16(e,_mm_sad_epu8(((__m128i*)pDst)[0],unaligned));
pDst+=w_Dst;
pSrc+=w_Src;
}
e = _mm_srli_si128(e, 8);
e = _mm_add_epi32 (e, e);
result=_mm_cvtsi128_si32(e);
_mm_empty();
return result;
}

SSE/MMX function:

int MMX_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m64 e0=0,e1=0;
for (i=0;i<16;i++) {
e0=_mm_add_pi32(e0,_mm_sad_pu8(((__m64*)pDst)[0],((__m64*)pSrc)[0]));
e1=_mm_add_pi32(e1,_mm_sad_pu8(((__m64*)pDst)[1],((__m64*)pSrc)[1]));
pDst+=w_Dst;
pSrc+=w_Src;
}
e1=_mm_add_pi32(e1,e0);
result=_m_to_int(e1);
_mm_empty();
return result;
}

Thanks,
Alex Telitsine
Streambox Inc.

dessa · ‎03-23-2003

Just in case, to present what plain C code does:

int C_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,err=0;
for (i=0;i<16;i++) {
err+=ABS(pDst[ 0]-pSrc[ 0]);
err+=ABS(pDst[ 1]-pSrc[ 1]);
err+=ABS(pDst[ 2]-pSrc[ 2]);
err+=ABS(pDst[ 3]-pSrc[ 3]);
err+=ABS(pDst[ 4]-pSrc[ 4]);
err+=ABS(pDst[ 5]-pSrc[ 5]);
err+=ABS(pDst[ 6]-pSrc[ 6]);
err+=ABS(pDst[ 7]-pSrc[ 7]);
err+=ABS(pDst[ 8]-pSrc[ 8]);
err+=ABS(pDst[ 9]-pSrc[ 9]);
err+=ABS(pDst[10]-pSrc[10]);
err+=ABS(pDst[11]-pSrc[11]);
err+=ABS(pDst[12]-pSrc[12]);
err+=ABS(pDst[13]-pSrc[13]);
err+=ABS(pDst[14]-pSrc[14]);
err+=ABS(pDst[15]-pSrc[15]);
pDst+=w_Dst;
pSrc+=w_Src;
}
return err;
}

and, for Intel platform, there are following definitions:

#ifdef __ICL // intel compiler
#define _X86_COMPATABLE_CPU_
#define RESTRICT restrict // -Qrestrict option should be ON
#define ABS(i) abs(i)
#ifndef INLINE
#define INLINE _inline // inline is always available on MMX
#endif
#define CAN_RW_UNALIGNED
#endif
.......
#ifndef BYTE
typedef unsigned char BYTE;
#endif

TimP · ‎03-23-2003

If you are incurring cache line splits in your SSE2 code, reduced performance is to be expected. I don't see that PGO is likely to do anything for this.

dessa · ‎03-24-2003

Well, according to SSE2 application notes, I should get x1.26 speed up over SSE/MMX code:

ftp://download.intel.com/design/perftool/cbts/appnotes/sse2/w_me_alg.pdf

Regarding cache line split, the application note says that unaligned load should be still faster then loading two 64 bits values.
Could it be that application note is not "entirely true"?
Anyway, I'll try to see what's going on in VTune today.

dessa · ‎03-24-2003

OK, problem was in the last lines of the SSE2 code,
only half of result was in use, and it did cause motion search to perform more block comparisons:
e = _mm_srli_si128(e, 8);
e = _mm_add_epi32 (e, e);

Unrolling the loop and interleaving e0/e1 in the loop gave a little improvement as well.

Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks.

below is final version of SSE2 code:

int SSE2_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m128i e0,e1;

e0= _mm_sad_epu8(*(__m128i*)(pDst +( 0)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 0)*w_Src)));
e1= _mm_sad_epu8(*(__m128i*)(pDst +( 1)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 1)*w_Src)));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 2)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 2)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 3)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 3)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 4)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 4)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 5)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 5)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 6)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 6)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 7)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 7)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 8)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 8)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 9)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 9)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(10)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (10)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(11)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (11)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(12)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (12)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(13)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (13)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(14)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (14)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(15)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (15)*w_Src))));
e0 = _mm_add_epi32 (e0, e1);
e0 = _mm_add_epi32 (e0, _mm_srli_si128(e0, 8));
result=_mm_cvtsi128_si32(e0);
_mm_empty();
return result;
}

Alex

kiran_N_ · ‎03-20-2014

>>Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks.

how did you arrive at the number of clock cycles?

Thanks in advance

Regards,

kiran

Bernard · ‎03-20-2014

kiran N. wrote:

>>Overall SSE2 gave x 1.27 improvement in clock ticks:

SSE-1969 clocks, SSE2-1542 clocks.

how did you arrive at the number of clock cycles?

Thanks in advance

Regards,

kiran

Probably by using _asm rdtsc or __rdtsc() intrinsic instruction.

TimP · ‎03-20-2014

alex-telitsine wrote:

ftp://download.intel.com/design/perftool/cbts/appnotes/sse2/w_me_alg.pdf

Regarding cache line split, the application note says that unaligned load should be still faster then loading two 64 bits values.
Could it be that application note is not "entirely true"?

On the Westmere CPU, I got consistently better results by splitting 128-bit loads, which some compilers did automatically when using source code, in a case where 50% of the loads were unaligned. I could never verify the often stated recommendation to build with SSE4.2 option for Westmere, when the earlier architecture options frequently use specific strategies for unaligned loads. This difference may have gone away for 128-bit loads on Sandy Bridge, but there the usual splitting of 256-bit unaligned loads is quite important. Even on the Ivy Bridge, where a late fix went in to reduce the penalty for 256-bit unaligned loads, my SSE2 intrinsics run faster than the AVX ones, for this reason (with the AVX transition penalties handled either by AVX-128 translation or explicit vzeroupper). That has changed with the Haswell CPUs.

It's difficult to infer this from VTune, unless you can count cache line splits and correlate them with stalls on instructions which consume the results of memory loads.

Anyway, the assumption which seems to be made here that the specific CPU architecture or stepping can be ignored when considering the effect of misalignment is not a good one.

Bernard · ‎03-21-2014

>>>OK, here is two functions,
one is SSE (P3/MMX et cetera), another is SSE2.
SSE2 is performing slower. Any advice why?>>>

Try to post your question on ISA forum.

Bernard · ‎03-21-2014

>>>Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks>>>

In this case SSE2 version does not relay on any for loop logic hence without the overhead of compiled for-loop instruction the code executed faster.Bear in mind that your code operates on SIMD vector operations and integer scalar operations, although Haswell can probably schedule loop instruction to be executed on Port6 thus freeing resources to operate on SIMD ALU.Here also arises interesting question related to low level implementation of integer ALU vs. SIMD ALU.I would like to know if integer scalar code is executed by the same ALU as SIMD integer code.

kiran_N_ · ‎03-21-2014

Thanks for the info.. was looking for it from long time :)

Bernard · ‎03-21-2014

>>>Thanks for the info.. was looking for it from long time :)>>>

You are welcome.

emmanuel_attia · ‎03-21-2014

Try to replace the _mm_loadu_si128 with _mm_lddqu_si128, you might have better performances in your case (purelly empirical).

And I guess _mm_empty is useless if you don't use MMX anymore in that SSE2 function.

dessa · ‎03-21-2014

Wow, that's a blast from the past , 2003 email? Yes, _mm_lddqu_si128 is used for SSE3 and higher code :-)