Speeding up H264 encoding with GPU

chrisdo — Tue, 17 Feb 2009 08:34:34 GMT

Hello,

I am currently trying to modify IPP samples for H264 encoding for speeding it up using CUDA.
My goal is to make Motion Estimation running on the GPU.

For that, I saw that each MacroBlock of a frame (actually a slice) is visited for ME and then encoding in the function H264CoreEncoder_Compress_Slice(). So the first part of my job is to do ME for all MBs first, and then to do encoding.
My problem is there, I'm stuck doing it : when I separate ME and encoding, the final encoded video is not as expected (it's all grey...).

Here is my code :

[cpp]Status H264ENC_MAKE_NAME(H264CoreEncoder_Compress_Slice_GPU)(
    void* state,
    H264SliceType *curr_slice,
	bool is_first_mb)
{
	H264CoreEncoderType* core_enc = (H264CoreEncoderType *)state;
	H264CurrentMacroblockDescriptorType &cur_mb = curr_slice->m_cur_mb;
	H264BsRealType* pBitstream = (H264BsRealType *)curr_slice->m_pbitstream;
	Ipp32s slice_num = curr_slice->m_slice_number;
	EnumSliceType slice_type = curr_slice->m_slice_type;
	Ipp8u uUsePCM = 0;

	Ipp8u *pStartBits;
	Ipp32u uStartBitOffset;

	Ipp32u uRecompressMB;
	Ipp8u  iLastQP;
	Ipp32u uSaved_Skip_Run;

	Ipp8u bSeenFirstMB = false;

	Status status = UMC_OK;

	Ipp32u uNumMBs = core_enc->m_HeightInMBs * core_enc->m_WidthInMBs;
	Ipp32u uFirstMB = core_enc->m_field_index * uNumMBs;

	H264CurrentMacroblockDescriptorType* mbTab = new H264CurrentMacroblockDescriptorType[ uNumMBs ];
	for( unsigned int idx = 0 ; idx < uNumMBs; idx++ )
	{
		mbTab[ idx ].LocalMacroblockInfo = new H264MacroblockLocalInfo;
		mbTab[ idx ].LocalMacroblockPairInfo = new  H264MacroblockLocalInfo;
		mbTab[ idx ].GlobalMacroblockInfo = new H264MacroblockGlobalInfo;
		mbTab[ idx ].GlobalMacroblockPairInfo = new H264MacroblockGlobalInfo;
		mbTab[ idx ].MacroblockCoeffsInfo = new H264MacroblockCoeffsInfo;
		mbTab[ idx ].intra_types = new T_AIMode;
		mbTab[ idx ].MVs[0] = new H264MacroblockMVs;
		mbTab[ idx ].MVs[1] = new H264MacroblockMVs;
		mbTab[ idx ].MVs[2] = new H264MacroblockMVs;
		mbTab[ idx ].MVs[3] = new H264MacroblockMVs;
		mbTab[ idx ].RefIdxs[0] = new H264MacroblockRefIdxs;
		mbTab[ idx ].RefIdxs[1] = new H264MacroblockRefIdxs;
	}

	Ipp32s MBYAdjust = 0;
	if (core_enc->m_field_index)
	{
		MBYAdjust  = core_enc->m_HeightInMBs;
	}

	curr_slice->m_InitialOffset = core_enc->m_InitialOffsets[core_enc->m_pCurrentFrame->m_bottom_field_flag[core_enc->m_field_index]];
	curr_slice->m_is_cur_mb_field = core_enc->m_pCurrentFrame->m_PictureStructureForDec < FRM_STRUCTURE;
	curr_slice->m_is_cur_mb_bottom_field = core_enc->m_pCurrentFrame->m_bottom_field_flag[core_enc->m_field_index] == 1;

	curr_slice->m_use_transform_for_intra_decision = 1;

	// loop over all MBs in the picture
// MB Motion Estimation is done here
	for (Ipp32u uMB = uFirstMB; uMB < uFirstMB + uNumMBs; uMB++)
	{
		// Is this MB in the current slice?  If not, move on...
		if (core_enc->m_pCurrentFrame->m_mbinfo.mbs[uMB].slice_id != slice_num) 
		{
			continue;
		} 
		else if (!bSeenFirstMB) 
		{
			// Reset xpos and ypos in framedata struct
			// This is necessary because the same slice may be recoded multiple times.

			// reset intra MB counter per slice
			curr_slice->m_Intra_MB_Counter = 0;
			curr_slice->m_MB_Counter = 0;

			// Fill in the first mb in slice field in the slice header.
			curr_slice->m_first_mb_in_slice = is_first_mb ? 0 : uMB - uFirstMB;

			// Fill in the current deblocking filter parameters.
			curr_slice->m_slice_alpha_c0_offset = (Ipp8s)core_enc->m_info.deblocking_filter_alpha;
			curr_slice->m_slice_beta_offset = (Ipp8s)core_enc->m_info.deblocking_filter_beta;
			curr_slice->m_disable_deblocking_filter_idc =  core_enc->m_info.deblocking_filter_idc;
			curr_slice->m_cabac_init_idc = core_enc->m_info.cabac_init_idc;

			// Write a slice header
			H264ENC_MAKE_NAME(H264BsReal_PutSliceHeader)(
				pBitstream,
				core_enc->m_SliceHeader,
				core_enc->m_PicParamSet,
				core_enc->m_SeqParamSet,
				core_enc->m_PicClass,
				curr_slice);
			bSeenFirstMB = true;

			// Fill in the correct value for m_iLastXmittedQP, used to correctly code
			// the per MB QP Delta
			curr_slice->m_iLastXmittedQP = core_enc->m_PicParamSet.pic_init_qp + curr_slice->m_slice_qp_delta;
			Ipp32s SliceQPy = curr_slice->m_iLastXmittedQP;

			if (core_enc->m_info.entropy_coding_mode)
			{
				if (slice_type==INTRASLICE)
					H264ENC_MAKE_NAME(H264BsReal_InitializeContextVariablesIntra_CABAC)(
					pBitstream,
					SliceQPy);
				else
					H264ENC_MAKE_NAME(H264BsReal_InitializeContextVariablesInter_CABAC)(
					pBitstream,
					SliceQPy,
					curr_slice->m_cabac_init_idc);
			}

			// Initialize the MB skip run counter
			curr_slice->m_uSkipRun = 0;
		}

		cur_mb.lambda = lambda_sq[curr_slice->m_iLastXmittedQP];
		cur_mb.uMB = uMB;
		cur_mb.chroma_format_idc = core_enc->m_PicParamSet.chroma_format_idc;
		cur_mb.mbPtr = core_enc->m_pCurrentFrame->m_pYPlane + core_enc->m_pMBOffsets[uMB].uLumaOffset[core_enc->m_is_cur_pic_afrm][curr_slice->m_is_cur_mb_field];
		cur_mb.mbPitchPixels =  core_enc->m_pCurrentFrame->m_pitchPixels << curr_slice->m_is_cur_mb_field;
		cur_mb.uMBx = uMB % core_enc->m_WidthInMBs;
		cur_mb.uMBy = uMB / core_enc->m_WidthInMBs - MBYAdjust;
		H264ENC_MAKE_NAME(H264CoreEncoder_UpdateCurrentMBInfo)(state, curr_slice);
		cur_mb.lumaQP = getLumaQP(cur_mb.LocalMacroblockInfo->QP, core_enc->m_PicParamSet.bit_depth_luma);
		cur_mb.lumaQP51 = getLumaQP51(cur_mb.LocalMacroblockInfo->QP, core_enc->m_PicParamSet.bit_depth_luma);
		cur_mb.chromaQP = getChromaQP(cur_mb.LocalMacroblockInfo->QP, core_enc->m_PicParamSet.chroma_qp_index_offset, core_enc->m_SeqParamSet.bit_depth_chroma);
		pSetMB8x8TSFlag(curr_slice->m_cur_mb.GlobalMacroblockInfo, 0);
		curr_slice->m_MB_Counter++;
		H264BsBase_GetState(&pBitstream->m_base, &pStartBits, &uStartBitOffset);
		iLastQP = curr_slice->m_iLastXmittedQP;
		uSaved_Skip_Run = curr_slice->m_uSkipRun;   // To restore it if we recompress
		uUsePCM = 0;    // Don't use the PCM mode initially.
		do 
		{    // this is to recompress MBs that are too big.
			H264ENC_MAKE_NAME(H264CoreEncoder_MB_Decision)(state, curr_slice, uMB);

			mbTab[ uMB-uFirstMB ].uMB = cur_mb.uMB;
			mbTab[ uMB-uFirstMB ].uMBpair = cur_mb.uMBpair;
			mbTab[ uMB-uFirstMB ].uMBx = cur_mb.uMBx;
			mbTab[ uMB-uFirstMB ].uMBy = cur_mb.uMBy;
			mbTab[ uMB-uFirstMB ].mbPtr = cur_mb.mbPtr;
			mbTab[ uMB-uFirstMB ].mbPitchPixels = cur_mb.mbPitchPixels;
			mbTab[ uMB-uFirstMB ].lambda = cur_mb.lambda;
			mbTab[ uMB-uFirstMB ].chroma_format_idc = cur_mb.chroma_format_idc;
			mbTab[ uMB-uFirstMB ].lumaQP = cur_mb.lumaQP;
			mbTab[ uMB-uFirstMB ].lumaQP51 = cur_mb.lumaQP51;
			mbTab[ uMB-uFirstMB ].chromaQP = cur_mb.chromaQP;
			memcpy( mbTab[ uMB-uFirstMB ].LocalMacroblockInfo, cur_mb.LocalMacroblockInfo, sizeof(H264MacroblockLocalInfo) );
			memcpy( mbTab[ uMB-uFirstMB ].LocalMacroblockPairInfo, cur_mb.LocalMacroblockPairInfo, sizeof(H264MacroblockLocalInfo) );
			memcpy( mbTab[ uMB-uFirstMB ].GlobalMacroblockInfo, cur_mb.GlobalMacroblockInfo, sizeof(H264MacroblockGlobalInfo) );
			memcpy( mbTab[ uMB-uFirstMB ].GlobalMacroblockPairInfo, cur_mb.GlobalMacroblockPairInfo, sizeof(H264MacroblockGlobalInfo) );
			memcpy( mbTab[ uMB-uFirstMB ].MacroblockCoeffsInfo, cur_mb.MacroblockCoeffsInfo, sizeof(H264MacroblockCoeffsInfo) );
			mbTab[ uMB-uFirstMB ].m_uIntraCBP4x4 = cur_mb.m_uIntraCBP4x4;
			memcpy( mbTab[ uMB-uFirstMB ].m_iNumCoeffs4x4, cur_mb.m_iNumCoeffs4x4, 16*sizeof(Ipp32s) );
			memcpy( mbTab[ uMB-uFirstMB ].m_iLastCoeff4x4, cur_mb.m_iLastCoeff4x4, 16*sizeof(Ipp32s) );
			mbTab[ uMB-uFirstMB ].m_uIntraCBP8x8 = cur_mb.m_uIntraCBP8x8;
			memcpy( mbTab[ uMB-uFirstMB ].m_iNumCoeffs8x8, cur_mb.m_iNumCoeffs8x8, 16*sizeof(Ipp32s) );
			memcpy( mbTab[ uMB-uFirstMB ].m_iLastCoeff8x8, cur_mb.m_iLastCoeff8x8, 16*sizeof(Ipp32s) );
			memcpy( mbTab[ uMB-uFirstMB ].intra_types, cur_mb.intra_types, sizeof(T_AIMode) );
			mbTab[ uMB-uFirstMB ].mb4x4 = cur_mb.mb4x4;
			mbTab[ uMB-uFirstMB ].mb8x8 = cur_mb.mb8x8;
			mbTab[ uMB-uFirstMB ].mb16x16 = cur_mb.mb16x16;
			mbTab[ uMB-uFirstMB ].mbInter = cur_mb.mbInter;
			mbTab[ uMB-uFirstMB ].mbChromaInter = cur_mb.mbChromaInter;
			mbTab[ uMB-uFirstMB ].mbChromaIntra = cur_mb.mbChromaIntra;

			memcpy( mbTab[ uMB-uFirstMB ].MVs[0], cur_mb.MVs[0], sizeof(H264MacroblockMVs) );
			memcpy( mbTab[ uMB-uFirstMB ].MVs[1], cur_mb.MVs[1], sizeof(H264MacroblockMVs) );
			memcpy( mbTab[ uMB-uFirstMB ].MVs[2], cur_mb.MVs[2], sizeof(H264MacroblockMVs) );
			memcpy( mbTab[ uMB-uFirstMB ].MVs[3], cur_mb.MVs[3], sizeof(H264MacroblockMVs) );
			memcpy( mbTab[ uMB-uFirstMB ].RefIdxs[0], cur_mb.RefIdxs[0], sizeof(H264MacroblockRefIdxs) );
			memcpy( mbTab[ uMB-uFirstMB ].RefIdxs[1], cur_mb.RefIdxs[1], sizeof(H264MacroblockRefIdxs) );
			mbTab[ uMB-uFirstMB ].MacroblockNeighbours = cur_mb.MacroblockNeighbours;
			mbTab[ uMB-uFirstMB ].BlockNeighbours = cur_mb.BlockNeighbours;

			uRecompressMB = 0;
		} while (uRecompressMB);        // End of the MB recompression loop.
	}

	bSeenFirstMB = false;





	// loop over all MBs in the picture
// encoding is done here
	for (Ipp32u uMB = uFirstMB; uMB < uFirstMB + uNumMBs; uMB++)
	{
		// Is this MB in the current slice?  If not, move on...
		if (core_enc->m_pCurrentFrame->m_mbinfo.mbs[uMB].slice_id != slice_num) 
		{
			continue;
		} 
		else if (!bSeenFirstMB) 
		{
			// Reset xpos and ypos in framedata struct
			// This is necessary because the same slice may be recoded multiple times.

			// reset intra MB counter per slice
			curr_slice->m_Intra_MB_Counter = 0;
			curr_slice->m_MB_Counter = 0;

			// Fill in the first mb in slice field in the slice header.
			curr_slice->m_first_mb_in_slice = is_first_mb ? 0 : uMB - uFirstMB;

			// Fill in the current deblocking filter parameters.
			curr_slice->m_slice_alpha_c0_offset = (Ipp8s)core_enc->m_info.deblocking_filter_alpha;
			curr_slice->m_slice_beta_offset = (Ipp8s)core_enc->m_info.deblocking_filter_beta;
			curr_slice->m_disable_deblocking_filter_idc =  core_enc->m_info.deblocking_filter_idc;
			curr_slice->m_cabac_init_idc = core_enc->m_info.cabac_init_idc;

			bSeenFirstMB = true;

			// Fill in the correct value for m_iLastXmittedQP, used to correctly code
			// the per MB QP Delta
			curr_slice->m_iLastXmittedQP = core_enc->m_PicParamSet.pic_init_qp + curr_slice->m_slice_qp_delta;
			Ipp32s SliceQPy = curr_slice->m_iLastXmittedQP;

			if (core_enc->m_info.entropy_coding_mode)
			{
				if (slice_type==INTRASLICE)
					H264ENC_MAKE_NAME(H264BsReal_InitializeContextVariablesIntra_CABAC)(
					pBitstream,
					SliceQPy);
				else
					H264ENC_MAKE_NAME(H264BsReal_InitializeContextVariablesInter_CABAC)(
					pBitstream,
					SliceQPy,
					curr_slice->m_cabac_init_idc);
			}

			// Initialize the MB skip run counter
			curr_slice->m_uSkipRun = 0;
		}

		cur_mb.lambda = lambda_sq[curr_slice->m_iLastXmittedQP];
		cur_mb.uMB = uMB;
		cur_mb.chroma_format_idc = core_enc->m_PicParamSet.chroma_format_idc;
		cur_mb.mbPtr = core_enc->m_pCurrentFrame->m_pYPlane + core_enc->m_pMBOffsets[uMB].uLumaOffset[core_enc->m_is_cur_pic_afrm][curr_slice->m_is_cur_mb_field];
		cur_mb.mbPitchPixels =  core_enc->m_pCurrentFrame->m_pitchPixels << curr_slice->m_is_cur_mb_field;
		cur_mb.uMBx = uMB % core_enc->m_WidthInMBs;
		cur_mb.uMBy = uMB / core_enc->m_WidthInMBs - MBYAdjust;
		H264ENC_MAKE_NAME(H264CoreEncoder_UpdateCurrentMBInfo)(state, curr_slice);
		cur_mb.lumaQP = getLumaQP(cur_mb.LocalMacroblockInfo->QP, core_enc->m_PicParamSet.bit_depth_luma);
		cur_mb.lumaQP51 = getLumaQP51(cur_mb.LocalMacroblockInfo->QP, core_enc->m_PicParamSet.bit_depth_luma);
		cur_mb.chromaQP = getChromaQP(cur_mb.LocalMacroblockInfo->QP, core_enc->m_PicParamSet.chroma_qp_index_offset, core_enc->m_SeqParamSet.bit_depth_chroma);
		pSetMB8x8TSFlag(curr_slice->m_cur_mb.GlobalMacroblockInfo, 0);
		curr_slice->m_MB_Counter++;
		H264BsBase_GetState(&pBitstream->m_base, &pStartBits, &uStartBitOffset);
		iLastQP = curr_slice->m_iLastXmittedQP;
		uSaved_Skip_Run = curr_slice->m_uSkipRun;   // To restore it if we recompress
		uUsePCM = 0;    // Don't use the PCM mode initially.
		do 
		{    // this is to recompress MBs that are too big.

// we restore cur_mb
			cur_mb.uMB = mbTab[ uMB-uFirstMB ].uMB;
			cur_mb.uMBpair = mbTab[ uMB-uFirstMB ].uMBpair;
			cur_mb.uMBx = mbTab[ uMB-uFirstMB ].uMBx;
			cur_mb.uMBy = mbTab[ uMB-uFirstMB ].uMBy;
			cur_mb.mbPtr = mbTab[ uMB-uFirstMB ].mbPtr;
			cur_mb.mbPitchPixels = mbTab[ uMB-uFirstMB ].mbPitchPixels;
			cur_mb.lambda = mbTab[ uMB-uFirstMB ].lambda;
			cur_mb.chroma_format_idc = mbTab[ uMB-uFirstMB ].chroma_format_idc;
			cur_mb.lumaQP = mbTab[ uMB-uFirstMB ].lumaQP;
			cur_mb.lumaQP51 = mbTab[ uMB-uFirstMB ].lumaQP51;
			cur_mb.chromaQP = mbTab[ uMB-uFirstMB ].chromaQP;
			memcpy( cur_mb.LocalMacroblockInfo, mbTab[ uMB-uFirstMB ].LocalMacroblockInfo, sizeof(H264MacroblockLocalInfo) );
			memcpy( cur_mb.LocalMacroblockPairInfo, mbTab[ uMB-uFirstMB ].LocalMacroblockPairInfo, sizeof(H264MacroblockLocalInfo) );
			memcpy( cur_mb.GlobalMacroblockInfo, mbTab[ uMB-uFirstMB ].GlobalMacroblockInfo, sizeof(H264MacroblockGlobalInfo) );
			memcpy( cur_mb.GlobalMacroblockPairInfo, mbTab[ uMB-uFirstMB ].GlobalMacroblockPairInfo, sizeof(H264MacroblockGlobalInfo) );
			memcpy( cur_mb.MacroblockCoeffsInfo, mbTab[ uMB-uFirstMB ].MacroblockCoeffsInfo, sizeof(H264MacroblockCoeffsInfo) );
			cur_mb.m_uIntraCBP4x4 = mbTab[ uMB-uFirstMB ].m_uIntraCBP4x4;
			memcpy( cur_mb.m_iNumCoeffs4x4, mbTab[ uMB-uFirstMB ].m_iNumCoeffs4x4, 16*sizeof(Ipp32s) );
			memcpy( cur_mb.m_iLastCoeff4x4, mbTab[ uMB-uFirstMB ].m_iLastCoeff4x4, 16*sizeof(Ipp32s) );
			cur_mb.m_uIntraCBP8x8 = mbTab[ uMB-uFirstMB ].m_uIntraCBP8x8;
			memcpy( cur_mb.m_iNumCoeffs8x8, mbTab[ uMB-uFirstMB ].m_iNumCoeffs8x8, 16*sizeof(Ipp32s) );
			memcpy( cur_mb.m_iLastCoeff8x8, mbTab[ uMB-uFirstMB ].m_iLastCoeff8x8, 16*sizeof(Ipp32s) );
			memcpy( cur_mb.intra_types, mbTab[ uMB-uFirstMB ].intra_types, sizeof(T_AIMode) );
			cur_mb.mb4x4 = mbTab[ uMB-uFirstMB ].mb4x4;
			cur_mb.mb8x8 = mbTab[ uMB-uFirstMB ].mb8x8;
			cur_mb.mb16x16 = mbTab[ uMB-uFirstMB ].mb16x16;
			cur_mb.mbInter = mbTab[ uMB-uFirstMB ].mbInter;
			cur_mb.mbChromaInter = mbTab[ uMB-uFirstMB ].mbChromaInter;
			cur_mb.mbChromaIntra = mbTab[ uMB-uFirstMB ].mbChromaIntra;

			memcpy( cur_mb.MVs[0], mbTab[ uMB-uFirstMB ].MVs[0], sizeof(H264MacroblockMVs) );
			memcpy( cur_mb.MVs[1], mbTab[ uMB-uFirstMB ].MVs[1], sizeof(H264MacroblockMVs) );
			memcpy( cur_mb.MVs[2], mbTab[ uMB-uFirstMB ].MVs[2], sizeof(H264MacroblockMVs) );
			memcpy( cur_mb.MVs[3], mbTab[ uMB-uFirstMB ].MVs[3], sizeof(H264MacroblockMVs) );
			memcpy( cur_mb.RefIdxs[0], mbTab[ uMB-uFirstMB ].RefIdxs[0], sizeof(H264MacroblockRefIdxs) );
			memcpy( cur_mb.RefIdxs[1], mbTab[ uMB-uFirstMB ].RefIdxs[1], sizeof(H264MacroblockRefIdxs) );
			cur_mb.MacroblockNeighbours = mbTab[ uMB-uFirstMB ].MacroblockNeighbours;
			cur_mb.BlockNeighbours = mbTab[ uMB-uFirstMB ].BlockNeighbours;

			Ipp32s mb_bits;
			Ipp32s bit_offset;
			if (core_enc->m_PicParamSet.entropy_coding_mode) 
			{
				bit_offset = pBitstream->m_base.m_nReadyBits;
				if (pBitstream->m_base.m_nReadyBits == 9) bit_offset = 8;
			}
			// Code the macroblock, all planes
			cur_mb.LocalMacroblockInfo->cbp_bits = 0;
			cur_mb.LocalMacroblockInfo->cbp_bits_chroma = 0;
			uSaved_Skip_Run = curr_slice->m_uSkipRun;
			H264ENC_MAKE_NAME(H264CoreEncoder_CEncAndRecMB)(state, curr_slice);

			mb_bits = 0;
			status = H264ENC_MAKE_NAME(H264CoreEncoder_Put_MB_Real)(state, curr_slice);
			if (status != UMC_OK) 
				goto done;

			Ipp8u *pEndBits;
			Ipp32u uEndBitOffset;
			H264BsBase_GetState(&pBitstream->m_base, &pEndBits, &uEndBitOffset);

			mb_bits += (Ipp32u) (pEndBits - pStartBits)*8;
			if (uEndBitOffset >= uStartBitOffset)
				mb_bits += uEndBitOffset - uStartBitOffset;
			else
				mb_bits -= uStartBitOffset - uEndBitOffset;

			// Should not recompress for CABAC
			if (!core_enc->m_PicParamSet.entropy_coding_mode && (mb_bits > MB_RECODE_THRESH) && core_enc->m_info.rate_controls.method == H264_RCM_QUANT)
			{
				// OK, this is bad, it's not compressing very much!!!
				// TBD: Tune this decision to QP...  Higher QPs will progressively trash PSNR,
				// so if they are still using a lot of bits, then PCM coding is extra attractive.

				// We're going to be recoding this MB, so reset some stuff.
				H264BsBase_SetState(&pBitstream->m_base, pStartBits, uStartBitOffset);
				// Zero out unused bits in buffer before OR in next op
				// This removes dependency on buffer being zeroed out.
				*pStartBits = (Ipp8u)((*pStartBits >> (8-uStartBitOffset)) << (8-uStartBitOffset));

				curr_slice->m_iLastXmittedQP = iLastQP; // Restore the last xmitted QP
				curr_slice->m_uSkipRun = uSaved_Skip_Run;   // Restore the skip run

				// If the QP has only been adjusted up 0 or 1 times, and QP != 51
				if (((cur_mb.LocalMacroblockInfo->QP -
					core_enc->m_PicParamSet.pic_init_qp + curr_slice->m_slice_qp_delta) < 2) &&
					(cur_mb.LocalMacroblockInfo->QP != 51))
				{
					// Quantize more and try again!
					cur_mb.LocalMacroblockInfo->QP++;
					uRecompressMB = 1;
				} 
				else 
				{
					// Code this block as a PCM MB next time around.
					uUsePCM = 1;
					uRecompressMB = 0;
					// Reset the MB QP value to the "last transmitted QP"
					// Since no DeltaQP will be transmitted for a PCM block
					// This is important, since the Loop Filter will use the
					// this value in filtering this MB
					cur_mb.LocalMacroblockInfo->QP = curr_slice->m_iLastXmittedQP;
				}

			} 
			else
			{
				uRecompressMB = 0;
			}
		} while (uRecompressMB);        // End of the MB recompression loop.

		// If the above MB encoding failed to efficiently predict the MB, then
		// code it as raw pixels using the mb_type = PCM
		if (uUsePCM)
		{
			cur_mb.GlobalMacroblockInfo->mbtype = MBTYPE_PCM;
			cur_mb.LocalMacroblockInfo->cbp_luma = 0xffff;

			memset(cur_mb.MacroblockCoeffsInfo->numCoeff, 16, 24);

			Ipp32s  k;     // block number, 0 to 15
			for (k = 0; k < 16; k++) {
				cur_mb.intra_types = 2;
				cur_mb.MVs[LIST_0]->MotionVectors = null_mv;
				cur_mb.MVs[LIST_1]->MotionVectors = null_mv;
				cur_mb.RefIdxs[LIST_0]->RefIdxs = -1;
				cur_mb.RefIdxs[LIST_1]->RefIdxs = -1;
			}

			H264ENC_MAKE_NAME(H264CoreEncoder_Put_MBHeader_Real)(state, curr_slice);   // PCM values are written in the MB Header.
		}

		if (core_enc->m_PicParamSet.entropy_coding_mode){
			H264ENC_MAKE_NAME(H264BsReal_EncodeFinalSingleBin_CABAC)(
				pBitstream,
				(uMB == uFirstMB + uNumMBs - 1) ||
				(core_enc->m_pCurrentFrame->m_mbinfo.mbs[uMB + 1].slice_id != slice_num));
			H264ENC_MAKE_NAME(H264CoreEncoder_ReconstuctCBP)(&cur_mb);
		}
	}   // loop over MBs

#ifndef NO_FINAL_SKIP_RUN
	// Check if the last N MBs were skip blocks.  If so, write a final skip run
	// NOTE!  This is _optional_.  The encoder is not required to do this, and
	// decoders need to be able to handle it either way.

	// Even though skip runs are not written for I Slices, m_uSkipRun can only be
	// non-zero for non-I slices, so the following test is OK.
	if (curr_slice->m_uSkipRun !=0 && core_enc->m_info.entropy_coding_mode==0) {
		H264ENC_MAKE_NAME(H264BsReal_PutVLCCode)(pBitstream, curr_slice->m_uSkipRun);
	}

#endif // NO_FINAL_SKIP_RUN

	// save the frame class

done:
	if (core_enc->m_PicParamSet.entropy_coding_mode) {
		H264ENC_MAKE_NAME(H264BsReal_TerminateEncode_CABAC)(pBitstream);
	}
	else {
		H264BsBase_WriteTrailingBits(&pBitstream->m_base);
	}

	return status;

}[/cpp]

Do you have an idea of what is wrong ?

Re: Speeding up H264 encoding with GPU

andrewk88 — Tue, 17 Feb 2009 14:50:07 GMT

>>My problem is there, I'm stuck doing it : when I separate ME and encoding, the final encoded video is not as >>expected (it's all grey...).

Regardless of your code, I'm afraid, there is a conceptual problem with your approach - you can't seperate "ME and encoding" due to the MB mode decision aimingatminimization ofthe encoding cost usually expressed as "D+lambda*R" to "optimize" a video quality loss vs. number of bits used for encoding. In other words,if youwant to encode a current MB in a reasonable/"optimal way"you should take a look atwhat is a modeof previous MB that has already been encoded, in particular a value of MV predictor.
Hope it helps,

AndrewK

PS.I'd reccomend to read some literature regarding the H.264 inter mode decision. Depending on what is exactly your ME algorithm youmightbe ableto"offload a low-level processing intensive part of ME" to GPU, but rather try to keep "encoding part" and mode decision on a single CPU / multicore with a shared memory. Alternatively, you might attempta multi-slice encoding but it has its own challenges.

Re: Speeding up H264 encoding with GPU

chrisdo — Tue, 17 Feb 2009 17:15:06 GMT

Quoting - andrewk88

Regardless of your code, I'm afraid, there is a conceptual problem with your approach - you can't seperate "ME and encoding" due to the MB mode decision aimingatminimization ofthe encoding cost usually expressed as "D+lambda*R" to "optimize" a video quality loss vs. number of bits used for encoding. In other words,if youwant to encode a current MB in a reasonable/"optimal way"you should take a look atwhat is a modeof previous MB that has already been encoded, in particular a value of MV predictor.
Hope it helps,

AndrewK

PS.I'd reccomend to read some literature regarding the H.264 inter mode decision. Depending on what is exactly your ME algorithm youmightbe ableto"offload a low-level processing intensive part of ME" to GPU, but rather try to keep "encoding part" and mode decision on a single CPU / multicore with a shared memory. Alternatively, you might attempta multi-slice encoding but it has its own challenges.

Hello Andrew, thanks for your answer!

It seems I was a little too optimistic... (and not enough documented) :D

Do you think I could enhance something doing SAD calculation for the macroblocks of a frame on the GPU before doing the real ME and encoding ?

Re: Speeding up H264 encoding with GPU

andrewk88 — Wed, 18 Feb 2009 14:25:34 GMT

Quoting - chrisdo

Definitely, you could do a full pel SADs on the GPU assuming some knowledge of a "promising" Motion Vectors candidates/predictors unless you can afford doing the Full Search on the GPU and sendinga portion of the bestSADs back to CPU to make a MB mode decision there. Depending on your ME algorithm, very often those MVs candidates/predictors are derived based on previously encoded MBs, so you'd need to establish a CPU-GPU communication protocol on a MB processing basis andI'm not sure if that is your intention and how beneficial that might be vs. processing& communcation on a frame level basis (half pel interpolationwould bea good candidate to be offloaded on a frame basis)

Best regards,

Andrew

Re: Speeding up H264 encoding with GPU

Priya_Natarajan — Mon, 23 Mar 2009 21:51:26 GMT

Quoting - andrewk88

Refer to the example/explanation in this article :
http://ieeexplore.org/stamp/stamp.jsp?tp=&arnumber=424284972

Definitely, you could do a full pel SADs on the GPU assuming some knowledge of a "promising" Motion Vectors candidates/predictors unless you can afford doing the Full Search on the GPU and sendinga portion of the bestSADs back to CPU to make a MB mode decision there. Depending on your ME algorithm, very often those MVs candidates/predictors are derived based on previously encoded MBs, so you'd need to establish a CPU-GPU communication protocol on a MB processing basis andI'm not sure if that is your intention and how beneficial that might be vs. processing& communcation on a frame level basis (half pel interpolationwould bea good candidate to be offloaded on a frame basis)

Best regards,

Andrew

topic Re: Speeding up H264 encoding with GPU in Intel® Integrated Performance Primitives

Speeding up H264 encoding with GPU

Re: Speeding up H264 encoding with GPU

Re: Speeding up H264 encoding with GPU

Re: Speeding up H264 encoding with GPU

Re: Speeding up H264 encoding with GPU