- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
one is SSE (P3/MMX et cetera), another is SSE2.
SSE2 is performing slower. Any advice why?
I've compiled with and without profile guided complation. results are the same. SSE2 is slower.
SSE2 function:
int SSE2_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m128i e=_mm_setzero_si128();
for (i=0;i<16;i++) {
__m128i unaligned=_mm_loadu_si128((__m128i*)pSrc);
e=_mm_add_epi16(e,_mm_sad_epu8(((__m128i*)pDst)[0],unaligned));
pDst+=w_Dst;
pSrc+=w_Src;
}
e = _mm_srli_si128(e, 8);
e = _mm_add_epi32 (e, e);
result=_mm_cvtsi128_si32(e);
_mm_empty();
return result;
}
SSE/MMX function:
int MMX_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m64 e0=0,e1=0;
for (i=0;i<16;i++) {
e0=_mm_add_pi32(e0,_mm_sad_pu8(((__m64*)pDst)[0],((__m64*)pSrc)[0]));
e1=_mm_add_pi32(e1,_mm_sad_pu8(((__m64*)pDst)[1],((__m64*)pSrc)[1]));
pDst+=w_Dst;
pSrc+=w_Src;
}
e1=_mm_add_pi32(e1,e0);
result=_m_to_int(e1);
_mm_empty();
return result;
}
Thanks,
Alex Telitsine
Streambox Inc.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
int C_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,err=0;
for (i=0;i<16;i++) {
err+=ABS(pDst[ 0]-pSrc[ 0]);
err+=ABS(pDst[ 1]-pSrc[ 1]);
err+=ABS(pDst[ 2]-pSrc[ 2]);
err+=ABS(pDst[ 3]-pSrc[ 3]);
err+=ABS(pDst[ 4]-pSrc[ 4]);
err+=ABS(pDst[ 5]-pSrc[ 5]);
err+=ABS(pDst[ 6]-pSrc[ 6]);
err+=ABS(pDst[ 7]-pSrc[ 7]);
err+=ABS(pDst[ 8]-pSrc[ 8]);
err+=ABS(pDst[ 9]-pSrc[ 9]);
err+=ABS(pDst[10]-pSrc[10]);
err+=ABS(pDst[11]-pSrc[11]);
err+=ABS(pDst[12]-pSrc[12]);
err+=ABS(pDst[13]-pSrc[13]);
err+=ABS(pDst[14]-pSrc[14]);
err+=ABS(pDst[15]-pSrc[15]);
pDst+=w_Dst;
pSrc+=w_Src;
}
return err;
}
and, for Intel platform, there are following definitions:
#ifdef __ICL // intel compiler
#define _X86_COMPATABLE_CPU_
#define RESTRICT restrict // -Qrestrict option should be ON
#define ABS(i) abs(i)
#ifndef INLINE
#define INLINE _inline // inline is always available on MMX
#endif
#define CAN_RW_UNALIGNED
#endif
.......
#ifndef BYTE
typedef unsigned char BYTE;
#endif
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ftp://download.intel.com/design/perftool/cbts/appnotes/sse2/w_me_alg.pdf
Regarding cache line split, the application note says that unaligned load should be still faster then loading two 64 bits values.
Could it be that application note is not "entirely true"?
Anyway, I'll try to see what's going on in VTune today.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
only half of result was in use, and it did cause motion search to perform more block comparisons:
e = _mm_srli_si128(e, 8);
e = _mm_add_epi32 (e, e);
Unrolling the loop and interleaving e0/e1 in the loop gave a little improvement as well.
Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks.
below is final version of SSE2 code:
int SSE2_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m128i e0,e1;
e0= _mm_sad_epu8(*(__m128i*)(pDst +( 0)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 0)*w_Src)));
e1= _mm_sad_epu8(*(__m128i*)(pDst +( 1)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 1)*w_Src)));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 2)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 2)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 3)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 3)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 4)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 4)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 5)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 5)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 6)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 6)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 7)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 7)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 8)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 8)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 9)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 9)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(10)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (10)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(11)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (11)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(12)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (12)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(13)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (13)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(14)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (14)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(15)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (15)*w_Src))));
e0 = _mm_add_epi32 (e0, e1);
e0 = _mm_add_epi32 (e0, _mm_srli_si128(e0, 8));
result=_mm_cvtsi128_si32(e0);
_mm_empty();
return result;
}
Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks.
how did you arrive at the number of clock cycles?
Thanks in advance
Regards,
kiran
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
kiran N. wrote:
>>Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks.
how did you arrive at the number of clock cycles?
Thanks in advance
Regards,
kiran
Probably by using _asm rdtsc or __rdtsc() intrinsic instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
alex-telitsine wrote:
ftp://download.intel.com/design/perftool/cbts/appnotes/sse2/w_me_alg.pdf
Regarding cache line split, the application note says that unaligned load should be still faster then loading two 64 bits values.
Could it be that application note is not "entirely true"?
On the Westmere CPU, I got consistently better results by splitting 128-bit loads, which some compilers did automatically when using source code, in a case where 50% of the loads were unaligned. I could never verify the often stated recommendation to build with SSE4.2 option for Westmere, when the earlier architecture options frequently use specific strategies for unaligned loads. This difference may have gone away for 128-bit loads on Sandy Bridge, but there the usual splitting of 256-bit unaligned loads is quite important. Even on the Ivy Bridge, where a late fix went in to reduce the penalty for 256-bit unaligned loads, my SSE2 intrinsics run faster than the AVX ones, for this reason (with the AVX transition penalties handled either by AVX-128 translation or explicit vzeroupper). That has changed with the Haswell CPUs.
It's difficult to infer this from VTune, unless you can count cache line splits and correlate them with stalls on instructions which consume the results of memory loads.
Anyway, the assumption which seems to be made here that the specific CPU architecture or stepping can be ignored when considering the effect of misalignment is not a good one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>OK, here is two functions,
one is SSE (P3/MMX et cetera), another is SSE2.
SSE2 is performing slower. Any advice why?>>>
Try to post your question on ISA forum.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks>>>
In this case SSE2 version does not relay on any for loop logic hence without the overhead of compiled for-loop instruction the code executed faster.Bear in mind that your code operates on SIMD vector operations and integer scalar operations, although Haswell can probably schedule loop instruction to be executed on Port6 thus freeing resources to operate on SIMD ALU.Here also arises interesting question related to low level implementation of integer ALU vs. SIMD ALU.I would like to know if integer scalar code is executed by the same ALU as SIMD integer code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the info.. was looking for it from long time :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Thanks for the info.. was looking for it from long time :)>>>
You are welcome.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try to replace the _mm_loadu_si128 with _mm_lddqu_si128, you might have better performances in your case (purelly empirical).
And I guess _mm_empty is useless if you don't use MMX anymore in that SSE2 function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wow, that's a blast from the past , 2003 email? Yes, _mm_lddqu_si128 is used for SSE3 and higher code :-)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page