I am developing a transcode application using Intel Media SDK. As suggested by the Link , to get the best performance all units should be asynchronous. I have gone through the Intel Media SDK examples , which demonstrates a single transcode in single thread ensuring the buffers between individual units are thread safe. But I doubt that it could be possible using the same logic in an application where each unit is asynchronous.
So do I need to make a copy of Decoder output surface and then fed to VPP ?
If yes, please do explain why. Do before copying data we need to do SyncOperation ?
If no , please suggest a method that could ensure the surface buffers between decode , vpp and encode units are thread safe ?
Happy New Year! If I understand your question, you want to create a Decode->VPP->Encode pipeline with asynchronous operations and good performance methods. We have a tutorial that performs the above-mentioned pipeline with async operations and has good comments for understanding, hope you will find that useful. If I misunderstood your question, please let me know and provide more details.
Link to download tutorials: https://software.intel.com/en-us/intel-media-server-studio-support/training
Specific tutorial I am referring to: simple_5_transcode_opaque_async_vppresize
Thanks for the reply. A very Happy New year to you too.
I went through this tutorial, it explains an example where Decode->VPP->Encode calls that happen in single thread. I want that Decode , VPP and Encode to exist in separate threads such that they are independently producing outputs if input is available. Is there any example or guide to do it? If no, please guide how to maintain frame buffers at Decoder/VPP and VPP/ Encode interface.This design I suggest will definitely give better performance in case where some tasks are scheduled on CPU and others on GPU.
Please give your suggestions and if other members have any idea about it please do give there inputs.
Rishab - Hope Nina's pointer is what you're looking for. FYI - You can find the samples package and some information of each of the samples here - https://software.intel.com/en-us/intel-media-server-studio-support/code-samples
Hello Nina and Sravanthi ,
Thanks for the reply.
I went though the sample_multi_transcode and have come out with the following understanding about design :
1) Each transcode thread is unique takes one stream as input and gives one stream as output independently.
2) For Decoder, DecodeFrameAsync calls is made , if sts= MFX_ERR_NONE the output frame is sent to VPP immediately,. VPP unit also sends the frame immediately to Encode in case of success.
3) If Encode calls returns MFX_ERR_MORE_DATA or MFX_ERR_NONE then the bitstream is read further and decode calls happens again.
4) While if VPP calls Returns MFX_ERR_MORE_DATA then then the bitstream is read and decode calls happens again.
5) SyncOperation happens when encode produces outputs equal to AsyncDepth and task is not found.
So I could conclude that each unit will send the output generated immediately to next unit and will always probe the previous unit to get an input. So each unit will wait for the previous unit to generate an output in steady state also.
Please correct me if my understanding is wrong anywhere.
The design I want is like the diagram below:
In the above design Decode unit gives an output to VPP and returns back to make next Decode call. Same pattern is followed by VPP and Encode.This could happen only Decode , VPP , Encode are in different threads and there is buffer queue at each unit's interface.
So I want to confirm does we have to implement each unit as a separate unit and use buffer copy for data transfer or is there any method Media SDK offers that would avoid these buffer copy.
The way Media SDK operates is that each stage is assigned a pool of surface buffers to operate on (depending on the async depth and other parameters). The Async call backs operate on these surface pool, and when the pool is all taken, wait for the sync operation to release surfaces for the next frame processing.
When you have a pipeline (Decode->Vpp->Encode), the output of one stage is the input of the other stage - as you have noticed in our samples and tutorials. If you want to design your pipeline as described above, you can achieve that without buffer copies as long as long as you have enough surfaces allocated (so that output surface of one stage can go as input to the another stage).
Note: You have to ensure that there is enough memory available for the surfaces from the decoder (This is because our decoder performance is much much faster VPP performance; and by the time you have completed one frame processing using VPP, you have many outstanding decoded surfaces waiting to be processed by VPP). The decoder will run way ahead of the VPP and Encoder. You can parallelize the VPP and Encoder stages (and not the decoder) so that the decoder does not run ahead. You have to implement locks to ensure no two stages are reading/writing to the same buffer
Good luck! Will update if we have more useful information for you.