Need help in understanding surface pool management.

squ · ‎09-19-2012

Hi,

I am trying to understand how I should manage the output surface pool when running a decoding session. I have a HW API 1.1 /SW API 1.4 sandy bridge processor. In my application I'd like to decode frames into system memory and post-process the decoded frames. I have no intention to provide external buffer/frame allocator. So I assume the SDK library will allocate and manage its own DPB buffers no matter which API (HW/SW) is used? I do not care about performance penalty when copying from video memory to system memory when HW API is called (I assume).

Based on the information above, my question is should I call MFXVideoDECODE_QueryIOSurf() to obtain the minimum buffers and allocate surfaces accordingly? Since the SDK has its own DPB buffers now, will it lock my externally provided surface for a longer period if it happens to store an IDR frame?

I am running the sample_decode application and I do not fully understand the behavior below:

1. HW API: I can see that the external surface is only locked between the MFXVideoDECODE_DecodeFrameAsync() call and MFXVideoCORE_SyncOperation() call. My take here is that locking is only for filling purpose. During the whole session, only one external surface is always suppiled. This seems confirm my undestanding above.

2. SW API: I can see that the external surfaces are locked in turn and more than one surfaces are in lock. This seems contradict my understanding above.

So is there any difference calling HW and SW API? Is it that external surface pool are used as the DPB buffer pool when calling SW API?

Another question is about sync depth. According one article in the forum on video conferencing, it is recommended to set AsyncDepth to 1 and feed complete frames. Ths is understandable for conference video since B frames are not used.

But what about other main-profile videos where frame reordering definitely happens? Can I still set AsyncDepth to 1 when I want to pull decoded frames ASAP? If I could, what if I feed incomplete frames at a time? Will the output be corrupted? I actually have set AsyncDepth to 1 and fed incomplete frames, and the output seems okay when calling HW API.

I have encountered other API calling failures, such as syncing operation returns -1 (calling HW API) and failure to lock memory error (calling SW API), in my multithreading application, but I hope your clarification will show that those issues are caused by my incorrect management of the surface pool.

Regards,

Petter_L_Intel · ‎09-20-2012

Hi, For the case you're pursuing, using system memory surfaces, you are not required to use an external allocator. If you look at "sample_decode" you can see that "m_bExternalAlloc" flag is not set and therefore no need to lock surface before writing to file. The behavior is the same for both SW and HW decode (when using system memory surfaces) Even though you are not using external allocator you must still allocate your own surface header structures (mfxFrameSurface1) and the actual frame surfaces required when calling DecodeFrameAsync. The number of surfaces you need are provided via the QueryIOSurf call. Regarding AsyncDepth. For lowest latency it is recommended to set AsyncDepth to 1, this will minimize number of internally cached frames. Feeding incomplete frames to DecodeFrameAsync is fine. The function will respond that it needs more data until its ready to deliver a frame (implying the need for more frame data to complete the frame). Regards, Petter

squ · ‎09-20-2012

Hi Petter, Thanks for the clarification. With that in mind, I am puzzled by the error code -1 returned by the MFXVideoCORE_SyncOperation() call. In my application, I have a decoding task D that uses the QS API. I have another post-processing task P, which processes the decoded frames from D. If I run D and P within the same thread, the program works perfectly fine, that might indicate my calling sequence to the QS API is correct? However, once I place D and P into different threads and use a FIFO between them, I notice -1 from the syncing call. Each time when the decodeing call is made, there is an attempt to grab a blank surface from the FIFO, if no blank surface is available, the D thread will be blocked until one is available. Since D runs much faster than P, it will almost immediately be blocked. Interestingly, the spot when the sync error happens correlates very well with the FIFO size. If I allocate 2 surfaces in the FIFO, -1 is returned when syncing for the 2nd output frame. If I allocate 16 surfaces in the FIFO, -1 is returned when syncing around the 18th output frame. If 32 surfaces allocated, then -1 is returned around 38th frame. It almost feels like that the media library does not like to be blocked by unavailability of an output surface since it's running much faster than the task P in another thread. Now the question is what does -1 mean in syncing call? I have enabled trace and attached a dump behind. You can see that up to that -1 return, everything is fine. No bitstream errors and decoding error reported. The FIFO size in this case is 16. Regards,

Petter_L_Intel · ‎09-24-2012

Hi, Not sure what the reason for the SyncOperation error code you get. My guess is that there is something wrong with surface management or lost syncpoint handles in your code. The log unfortunately does not reveal much. I can not see any call to "QueryIOSurf" in the log. Is there a reason you omitted querying for the number of surfaces recommended by decoder? In general there should not be much reason to run decode submission and sync in separate threads. Adding threading may also add unnecessary complexity. The performance you get from the approach as in "sample_decode" provided with the SDK is utilizing the HW to very high degree even though the approach is synchronous. If requiring further asynchronous behavior you can take a look as "sample_encode" which uses tasks (still single threaded!) to achieve better performance vs. a pure synchronous approach. Regards, Petter

squ · ‎09-24-2012

Hi Petter, Unfortunately that's all I could dig up and the returning code along with the document is not very helpful. Is there any other debugging tool that I was unware of? The reason I did not call the "QueryIOSurf" is that an external allocator was not used. As I mentioned in the original post, the library never asked for more than one surface when calling HW API w/o external allocator. Does it have to be called? I did not call it in the single-thread case and everything was fine. To clarify, the decode submission and sync is in one thread. I have another post-processing task in another thread. I doubled-checked again that the sync point was not lost since the submission and sync calls are literally several lines apart, in between are return code checking lines just like the sample_decode application. Could you clarify what you meant by "which uses tasks (still single-threaded!)"? According to Developer's Guide, the session starts with thread-pool creation, and the scheduler manages thread assignment to tasks. Doesn't that mean tasks would run in multiple threads? I have my own thread pool in the application, and the calls to the QS API all happens in one (decoding) thread. Would my thread pool somehow interferes with the thread-pool created by the session initialization? For example, what if the decoding thread goes to sleep when out of output buffer and waken up later when an empty output buffer is available? I would assume that should not be aproblem? Regards,

squ · ‎09-25-2012

Hi Petter, Thanks a lot for your help. Just in case this could be helpful to other people. The root cause has been found. The design in my application is perfectly fine. However, there is a bug in my code that the surface was allocated 1080 lines when I assumed it's 1088 lines. I imagine that in single thread scenario, the D and P tasks are executed separately in time, so that the two tasks won't access the small overlapped 8 lines at the same time. However, when in threading mode, there will be exception when the two threads are acessing the same memory area. Anyway that is still a theory. Once the surface allocation is fixed, the program works perfectly. Regards,

Petter_L_Intel · ‎09-25-2012

Hi, Yes, when explicitly setting AsyncDepth to 1 and using system memory then you will only need one decode surface. However, if you in the future decide to change to other surface type or leave AsyncDepth to a default value it will affect the number of required surfaces. This is why we recommend always calling QueryIOSurf to ensure allocation of enough surfaces. Unfortunately there is no tool that will give you more details beyond the error code you received. Why do you sync between decode and VPP? This approach will impact performance. A more efficient approach is to chain decode and VPP together. I have attached a simplified code sample which does decode + VPP using system memory. Hopefully this will help you hunt down the issue. Regarding your question "still single-threaded!". What I mean is that the application code is single threaded. Media SDK certainly does internal threading, but it's hidden to the developer. There are no known issues with regards to threading interference. Regards, Petter