- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi , I have a problem
When I use running " ippiYCbCr420ToYCbCr422_8u_P2C2R " function,but it is more slow than I use sse4.1 assembly language.
This is my code:
IppiSize nsize;
nsize.width = par->dst_width;
nsize.height = par->dst_height;
ippiYCbCr420ToYCbCr422_8u_P2C2R( m_pmfxOutSurface->Data.Y,m_pmfxOutSurface->Data.Pitch, m_pmfxOutSurface->Data.UV,
m_pmfxOutSurface->Data.Pitch, par->data[0], par->linesize[0], nsize );
Thank your help
Bill
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Chen.
I have 3 question:
What version of IPP do you use? ( ippiGetLibVersion() function)
Could you please provide sizes IppiSize nsize of your image?
Also how do you measure perf? The typical method is:
int NLOOPS=100000; start = rdtsc; ippFunc() for(n=0;n<NLOOPS;n++){ ippFunc() } stop = rdtsc; perf = (stop-start) / NLOOP print perf
Also the small reproducer from you will be great to provide you the best help.
Thanks for your feedback.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey,
1. My IPP version is " ippCore 2018.0.0 (r56444) "
2. Image size is 1920x1080
3. I use this code to measure perf
DWORD t1,t2;
t1 =clock();
for(int i=0;i<100;i++)
{
IppiSize nsize;
nsize.width = 1920;
nsize.height = 1080;
ippiYCbCr420ToYCbCr422_8u_P2C2R( m_pmfxOutSurface->Data.Y,m_pmfxOutSurface->Data.Pitch, m_pmfxOutSurface->Data.UV, m_pmfxOutSurface->Data.Pitch, par->data[0], par->linesize[0], nsize );
///////fnCGV_Color.NV12_to_YUY2();// this is my sse4.1 assembly language
}
t2 =clock();
char Msg[256];
sprintf( Msg, " Timer = %d", t2 - t1 );
OutputDebugString( Msg );
Thank your help
Bill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bill.
It is single threaded code. Perf is 0.38 clock per pixel but I cannot say if it fast or slow compared with your code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey,
My computer is " Intel(R) Core(TM) i5-4590 CPU @ 3.3GHz 8G RAM, Windows 8.1 x64" .
I run "ippiYCbCr420ToYCbCr422_8u_P2C2R" perf is about 13 ( 1300 / 100 ) clock,
but when I run sse 4.1 assembly code, the perf is about 1.7 ( 170 / 100 ) clock .
All of this test is single thread code.
This is my sse 4.1 assembly code
void CGV_Color::NV12_to_YUY2_X1Y1_sse41( void ) { int i, j; int nHeight, nWidth; int nPitchSrcY, nPitchSrcUV, nPitchDstYUY2; unsigned char *ptSrcY, *ptSrcUV, *ptDstYUY2_0, *ptDstYUY2_1; __m128i y0, y1, uv0; if( bFlip ) { nHeight = nDstHeight >> 1; nWidth = nDstWidth >> 4; ptSrcY = lpY; ptSrcUV = lpUV; nPitchSrcY = nPitchY; nPitchSrcUV = nPitchUV; ptDstYUY2_0 = lpYUY2 + ( nDstHeight - 1 ) * nPitchYUY2; ptDstYUY2_1 = lpYUY2 + ( nDstHeight - 2 ) * nPitchYUY2; nPitchDstYUY2 = -nPitchYUY2; } else { nHeight = nDstHeight >> 1; nWidth = nDstWidth >> 4; ptSrcY = lpY; ptSrcUV = lpUV; nPitchSrcY = nPitchY; nPitchSrcUV = nPitchUV; ptDstYUY2_0 = lpYUY2; ptDstYUY2_1 = lpYUY2 + nPitchYUY2; nPitchDstYUY2 = nPitchYUY2; } if( ((int)ptDstYUY2_0 %16 == 0 ) && ( nPitchDstYUY2%16 == 0 ) ) { for( j = 0; j < nHeight; j++ ) { _mm_mfence(); for( i = 0; i < nWidth; i++ ) { ///// UV ///// uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) ); ///// Y ///// y0 = _mm_stream_load_si128( (__m128i*)( ptSrcY ) ); y1 = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) ); ///// _mm_stream_si128( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) ); _mm_stream_si128( (__m128i*)( ptDstYUY2_0 + 16 ), _mm_unpackhi_epi8( y0, uv0 ) ); _mm_stream_si128( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) ); _mm_stream_si128( (__m128i*)( ptDstYUY2_1 + 16 ), _mm_unpackhi_epi8( y1, uv0 ) ); ///// ptSrcY += 16; ptSrcUV += 16; ptDstYUY2_0 += 32; ptDstYUY2_1 += 32; } ptSrcY += ( nPitchY << 1 ) - ( nWidth << 4 ); ptSrcUV += ( nPitchSrcUV ) - ( nWidth << 4 ); ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 ); ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 ); } } else { for( j = 0; j < nHeight; j++ ) { _mm_mfence(); for( i = 0; i < nWidth; i++ ) { ///// UV ///// uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) ); ///// Y ///// y0 = _mm_stream_load_si128( (__m128i*)( ptSrcY ) ); y1 = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) ); ///// _mm_storeu_si128( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) ); _mm_storeu_si128( (__m128i*)( ptDstYUY2_0 + 16 ), _mm_unpackhi_epi8( y0, uv0 ) ); _mm_storeu_si128( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) ); _mm_storeu_si128( (__m128i*)( ptDstYUY2_1 + 16 ), _mm_unpackhi_epi8( y1, uv0 ) ); ///// ptSrcY += 16; ptSrcUV += 16; ptDstYUY2_0 += 32; ptDstYUY2_1 += 32; } ptSrcY += ( nPitchY << 1 ) - ( nWidth << 4 ); ptSrcUV += ( nPitchSrcUV ) - ( nWidth << 4 ); ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 ); ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 ); } } int n = ( nDstWidth >> 4 ) << 4; // (dstWidth/16)*16 int remain = nDstWidth - n; if( remain ) { if( ( nPitchY%16 == 0 ) && ( remain%8 == 0 ) ) { if( bFlip ) { nWidth = remain >> 3; ptSrcY = lpY + n; ptSrcUV = lpUV + n; ptDstYUY2_0 = lpYUY2 + ( nDstHeight - 1 ) * nPitchYUY2 + ( n << 1 ); ptDstYUY2_1 = ptDstYUY2_0 + nPitchDstYUY2; } else { nWidth = remain >> 3; ptSrcY = lpY + n; ptSrcUV = lpUV + n; ptDstYUY2_0 = lpYUY2 + ( n << 1 ); ptDstYUY2_1 = ptDstYUY2_0 + nPitchDstYUY2; } for( j = 0; j < nHeight; j++ ) { _mm_mfence(); for( i = 0; i < nWidth; i++ ) { ///// UV ///// uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) ); ///// Y ///// y0 = _mm_stream_load_si128( (__m128i*)( ptSrcY ) ); y1 = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) ); ///// _mm_storeu_si128( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) ); _mm_storeu_si128( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) ); ///// ptSrcY += 8; ptSrcUV += 8; ptDstYUY2_0 += 16; ptDstYUY2_1 += 16; } ptSrcY += ( nPitchY << 1 ) - ( nWidth << 3 ); ptSrcUV += ( nPitchSrcUV ) - ( nWidth << 3 ); ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 4 ); ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 4 ); } } else if( ( nPitchY%16 == 0 ) && ( remain%4 == 0 ) ) { if( bFlip ) { nWidth = remain >> 2; ptSrcY = lpY + n; ptSrcUV = lpUV + n; ptDstYUY2_0 = lpYUY2 + ( nDstHeight - 1 ) * nPitchYUY2 + ( n << 1 ); ptDstYUY2_1 = ptDstYUY2_0 + nPitchDstYUY2; } else { nWidth = remain >> 2; ptSrcY = lpY + n; ptSrcUV = lpUV + n; ptDstYUY2_0 = lpYUY2 + ( n << 1 ); ptDstYUY2_1 = ptDstYUY2_0 + nPitchDstYUY2; } for( j = 0; j < nHeight; j++ ) { _mm_mfence(); for( i = 0; i < nWidth; i++ ) { ///// UV ///// uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) ); ///// Y ///// y0 = _mm_stream_load_si128( (__m128i*)( ptSrcY ) ); y1 = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) ); ///// _mm_storel_epi64( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) ); _mm_storel_epi64( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) ); ///// ptSrcY += 4; ptSrcUV += 4; ptDstYUY2_0 += 8; ptDstYUY2_1 += 8; } ptSrcY += ( nPitchY << 1 ) - ( nWidth << 2 ); ptSrcUV += ( nPitchSrcUV ) - ( nWidth << 2 ); ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 3 ); ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 3 ); } } else { lpY += n; lpUV += n; lpYUY2 += ( n << 1 ); nDstWidth = remain; this->NV12_to_YUY2_X1Y1_c(); } } }
Thanks.
Bill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bill.
Thanks for your reproducer. IPP has the similar optimized code but we don't use "stream" instructions in this functions because they are non-temporal. In the most application IPP functions are used in some pipeline and next function uses output of previous function as its own input. In this situation it is preferable output to be in cache. Also nontemporal instruction has negative effect for small image sizes because data are out of cache.
If you are interested in this area I recommend you to get cache size (ippGetCacheSizeB) and try to manipulate total images sizes(src+dst) L1, L2, L3 and _mm_stream_si128/_mm_store_si128, cold/hot cache, 1call/1000 calls combinations to see performance changes of your code.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey,
Really thank your help, by the way I also have other problem.
When I continue running my sse 4.1 assembly code for one weeek , it will suddenly crash.
Would you have another suggestions?
Thanks.
Bill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
Try to check reading/storing out of allocated memory buffer 16b alignment and step between lines. Both of mm_stream_load_si128 and _mm_store_si128 requires 16b aligned address. If not 16b aligned -> exception. For example you can download Intel SDE and run your app. This tool will show you exact asm instruction that reads from non 16b aligned address. It is very easy to install and run.
Also just for information _mm_stream_load_si128 requires special memory attribute. See here details for example https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075.
Thanks.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page