"ippiYCbCr420ToYCbCr422_8u_P2C2R" is Slow

CBill2 · ‎05-23-2018

Hi , I have a problem

When I use running " ippiYCbCr420ToYCbCr422_8u_P2C2R " function，but it is more slow than I use sse4.1 assembly language.

This is my code:

IppiSize nsize;

nsize.width = par->dst_width;

nsize.height = par->dst_height;

ippiYCbCr420ToYCbCr422_8u_P2C2R( m_pmfxOutSurface->Data.Y,m_pmfxOutSurface->Data.Pitch, m_pmfxOutSurface->Data.UV,

m_pmfxOutSurface->Data.Pitch, par->data[0], par->linesize[0], nsize );

Thank your help

Bill

Andrey_B_Intel · ‎05-24-2018

Hi Chen.

I have 3 question:

What version of IPP do you use? ( ippiGetLibVersion() function)

Could you please provide sizes IppiSize nsize of your image?

Also how do you measure perf? The typical method is:

int NLOOPS=100000;
start = rdtsc;
ippFunc()
for(n=0;n<NLOOPS;n++){
    ippFunc()
}
stop = rdtsc;
perf = (stop-start) / NLOOP
print perf

Also the small reproducer from you will be great to provide you the best help.

Thanks for your feedback.

CBill2 · ‎05-24-2018

Hi Andrey,

1. My IPP version is " ippCore 2018.0.0 (r56444) "

2. Image size is 1920x1080

3. I use this code to measure perf

DWORD t1,t2;
t1  =clock();
for(int i=0;i<100;i++)
{
IppiSize nsize;
nsize.width = 1920;
nsize.height = 1080;
ippiYCbCr420ToYCbCr422_8u_P2C2R( m_pmfxOutSurface->Data.Y,m_pmfxOutSurface->Data.Pitch, m_pmfxOutSurface->Data.UV, m_pmfxOutSurface->Data.Pitch, par->data[0], par->linesize[0], nsize );
///////fnCGV_Color.NV12_to_YUY2();// this is my  sse4.1 assembly language
}
t2  =clock();

char Msg[256];
sprintf( Msg, " Timer = %d", t2 - t1 );
OutputDebugString( Msg );

Thank your help

Bill

Andrey_B_Intel · ‎05-28-2018

Hi Bill.

I've run IPP ps tests of SSE4.2 code

CPU,Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 12x2.5 GHz, Max cache size 30720 K

Library,ippCC SSE4.2 (y8),

function,Parm1,Parm2,Parm3,Parm4,Parm5,Parm6,Parm7,Parm8,Comment,Clocks,per,Time (usec),MFlops

ippiYCbCr420ToYCbCr422,8u,P2C2R,1920x1072,-,-,-,-,-,nLps=10,0.388,px,319,-

It is single threaded code. Perf is 0.38 clock per pixel but I cannot say if it fast or slow compared with your code.

Could you please provide perf numbers at your machine?

Thanks.

CBill2 · ‎05-28-2018

Hi Andrey,

My computer is " Intel(R) Core(TM) i5-4590 CPU @ 3.3GHz 8G RAM, Windows 8.1 x64" .

I run "ippiYCbCr420ToYCbCr422_8u_P2C2R" perf is about 13 ( 1300 / 100 ) clock,

but when I run sse 4.1 assembly code, the perf is about 1.7 ( 170 / 100 ) clock .

All of this test is single thread code.

This is my sse 4.1 assembly code

void CGV_Color::NV12_to_YUY2_X1Y1_sse41( void )
{
	int				i, j;
	int				nHeight, nWidth;
 	int				nPitchSrcY, nPitchSrcUV, nPitchDstYUY2;
 	unsigned char	*ptSrcY, *ptSrcUV, *ptDstYUY2_0, *ptDstYUY2_1;
  	__m128i			y0, y1, uv0;

	if( bFlip )  
	{
		nHeight       = nDstHeight >> 1;
		nWidth        = nDstWidth  >> 4;
		ptSrcY	      = lpY;
 		ptSrcUV		  = lpUV;
		nPitchSrcY    = nPitchY;
		nPitchSrcUV   = nPitchUV;
 		ptDstYUY2_0	  = lpYUY2 + ( nDstHeight - 1 ) * nPitchYUY2;
 		ptDstYUY2_1	  = lpYUY2 + ( nDstHeight - 2 ) * nPitchYUY2;
		nPitchDstYUY2 = -nPitchYUY2;
	}
	else
	{
		nHeight       = nDstHeight >> 1;
		nWidth        = nDstWidth  >> 4;
		ptSrcY	      = lpY;
 		ptSrcUV		  = lpUV;
		nPitchSrcY    = nPitchY;
		nPitchSrcUV   = nPitchUV;
 		ptDstYUY2_0	  = lpYUY2;
 		ptDstYUY2_1	  = lpYUY2 + nPitchYUY2;
		nPitchDstYUY2 = nPitchYUY2;
 	}

	if( ((int)ptDstYUY2_0 %16 == 0 ) && ( nPitchDstYUY2%16 == 0 ) )
	{
		for( j = 0; j < nHeight; j++ )
		{ 
			_mm_mfence();

			for( i = 0; i < nWidth; i++ )
			{
				///// UV /////
				uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) );

				///// Y /////
				y0  = _mm_stream_load_si128( (__m128i*)( ptSrcY ) );
				y1  = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) );

				///// 
				_mm_stream_si128( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) );
				_mm_stream_si128( (__m128i*)( ptDstYUY2_0 + 16 ), _mm_unpackhi_epi8( y0, uv0 ) );
				_mm_stream_si128( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) );
				_mm_stream_si128( (__m128i*)( ptDstYUY2_1 + 16 ), _mm_unpackhi_epi8( y1, uv0 ) );

				/////
				ptSrcY += 16;
				ptSrcUV += 16;
				ptDstYUY2_0 += 32;
				ptDstYUY2_1 += 32;
			}

			ptSrcY	  += ( nPitchY << 1 ) - ( nWidth << 4 );
			ptSrcUV	  += ( nPitchSrcUV ) - ( nWidth << 4 );
			ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 );
			ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 );
		}
	}
	else
	{
		for( j = 0; j < nHeight; j++ )
		{ 
			_mm_mfence();

			for( i = 0; i < nWidth; i++ )
			{
				///// UV /////
				uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) );

				///// Y /////
				y0  = _mm_stream_load_si128( (__m128i*)( ptSrcY ) );
				y1  = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) );

				///// 
				_mm_storeu_si128( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) );
				_mm_storeu_si128( (__m128i*)( ptDstYUY2_0 + 16 ), _mm_unpackhi_epi8( y0, uv0 ) );
				_mm_storeu_si128( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) );
				_mm_storeu_si128( (__m128i*)( ptDstYUY2_1 + 16 ), _mm_unpackhi_epi8( y1, uv0 ) );

				/////
				ptSrcY += 16;
				ptSrcUV += 16;
				ptDstYUY2_0 += 32;
				ptDstYUY2_1 += 32;
			}

			ptSrcY	  += ( nPitchY << 1 ) - ( nWidth << 4 );
			ptSrcUV	  += ( nPitchSrcUV ) - ( nWidth << 4 );
			ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 );
			ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 5 );
		}
	}

 	int n = ( nDstWidth >> 4 ) << 4;  // (dstWidth/16)*16
 	int remain = nDstWidth - n;  
	if( remain ) 
	{
		if( ( nPitchY%16 == 0 ) && ( remain%8 == 0 ) ) 
		{
			if( bFlip )  
			{
				nWidth        = remain  >> 3;
				ptSrcY	      = lpY + n;
 				ptSrcUV		  = lpUV + n;
 				ptDstYUY2_0	  = lpYUY2 + ( nDstHeight - 1 ) * nPitchYUY2 + ( n << 1 );
 				ptDstYUY2_1	  = ptDstYUY2_0 + nPitchDstYUY2;
			}
			else
			{
				nWidth        = remain  >> 3;
				ptSrcY	      = lpY + n;
 				ptSrcUV		  = lpUV + n;
 				ptDstYUY2_0	  = lpYUY2 + ( n << 1 );
 				ptDstYUY2_1	  = ptDstYUY2_0 + nPitchDstYUY2;
 			}

			for( j = 0; j < nHeight; j++ )
			{ 
				_mm_mfence();

				for( i = 0; i < nWidth; i++ )
				{
					///// UV /////
					uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) );

					///// Y /////
					y0  = _mm_stream_load_si128( (__m128i*)( ptSrcY ) );
					y1  = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) );

					///// 
					_mm_storeu_si128( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) );
					_mm_storeu_si128( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) );

					/////
					ptSrcY += 8;
					ptSrcUV += 8;
					ptDstYUY2_0 += 16;
					ptDstYUY2_1 += 16;
				}

				ptSrcY	  += ( nPitchY << 1 ) - ( nWidth << 3 );
				ptSrcUV	  += ( nPitchSrcUV ) - ( nWidth << 3 );
				ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 4 );
				ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 4 );
			}
		}
		else if( ( nPitchY%16 == 0 ) && ( remain%4 == 0 ) )
		{
			if( bFlip )  
			{
				nWidth        = remain  >> 2;
				ptSrcY	      = lpY + n;
 				ptSrcUV		  = lpUV + n;
 				ptDstYUY2_0	  = lpYUY2 + ( nDstHeight - 1 ) * nPitchYUY2 + ( n << 1 );
 				ptDstYUY2_1	  = ptDstYUY2_0 + nPitchDstYUY2;
			}
			else
			{
				nWidth        = remain  >> 2;
				ptSrcY	      = lpY + n;
 				ptSrcUV		  = lpUV + n;
 				ptDstYUY2_0	  = lpYUY2 + ( n << 1 );
 				ptDstYUY2_1	  = ptDstYUY2_0 + nPitchDstYUY2;
 			}

			for( j = 0; j < nHeight; j++ )
			{ 
				_mm_mfence();

				for( i = 0; i < nWidth; i++ )
				{
					///// UV /////
					uv0 = _mm_stream_load_si128( (__m128i*)( ptSrcUV ) );

					///// Y /////
					y0  = _mm_stream_load_si128( (__m128i*)( ptSrcY ) );
					y1  = _mm_stream_load_si128( (__m128i*)( ptSrcY + nPitchY ) );

					///// 
					_mm_storel_epi64( (__m128i*)( ptDstYUY2_0 ), _mm_unpacklo_epi8( y0, uv0 ) );
					_mm_storel_epi64( (__m128i*)( ptDstYUY2_1 ), _mm_unpacklo_epi8( y1, uv0 ) );

					/////
					ptSrcY += 4;
					ptSrcUV += 4;
					ptDstYUY2_0 += 8;
					ptDstYUY2_1 += 8;
				}

				ptSrcY	  += ( nPitchY << 1 ) - ( nWidth << 2 );
				ptSrcUV	  += ( nPitchSrcUV ) - ( nWidth << 2 );
				ptDstYUY2_0 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 3 );
				ptDstYUY2_1 += ( nPitchDstYUY2 << 1 ) - ( nWidth << 3 );
			}
		}
		else
		{
			lpY    += n;
			lpUV   += n;
			lpYUY2 += ( n << 1 ); 
			nDstWidth = remain;
			this->NV12_to_YUY2_X1Y1_c();
		}
	}
}

Thanks.

Bill

Andrey_B_Intel · ‎05-29-2018

Hi Bill.

Thanks for your reproducer. IPP has the similar optimized code but we don't use "stream" instructions in this functions because they are non-temporal. In the most application IPP functions are used in some pipeline and next function uses output of previous function as its own input. In this situation it is preferable output to be in cache. Also nontemporal instruction has negative effect for small image sizes because data are out of cache.

If you are interested in this area I recommend you to get cache size (ippGetCacheSizeB) and try to manipulate total images sizes(src+dst) L1, L2, L3 and _mm_stream_si128/_mm_store_si128, cold/hot cache, 1call/1000 calls combinations to see performance changes of your code.

Thanks.

CBill2 · ‎05-29-2018

Hi Andrey,

Really thank your help, by the way I also have other problem.

When I continue running my sse 4.1 assembly code for one weeek , it will suddenly crash.

Would you have another suggestions?

Thanks.

Bill

Andrey_B_Intel · ‎05-30-2018

Hi.

Try to check reading/storing out of allocated memory buffer 16b alignment and step between lines. Both of mm_stream_load_si128 and _mm_store_si128 requires 16b aligned address. If not 16b aligned -> exception. For example you can download Intel SDE and run your app. This tool will show you exact asm instruction that reads from non 16b aligned address. It is very easy to install and run.

Also just for information _mm_stream_load_si128 requires special memory attribute. See here details for example https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075.

Thanks.