SyncOperation - a difference between Haswell and Coffee Lake?

Robinson__UK · ‎02-05-2019

I've written some code to do a performance analysis on decoding H264 streams (60 fps 1080p 20Mb/s) and noticed a discrepency between a Haswell box and a Coffee Lake (8400) box. When I aggregate the results I find the Haswell decoding frames at around 6ms per frame (the distribution is not fat tailed so this was an average across 17,000 frames). The exact same setup on the Coffee Lake box results in an average of around 16ms per frame.

I managed to narrow the delay down to the SyncOperation I'm doing on the pipeline. For example:

Frame   Wait (ms)
[851]   16
[852]   1
[853]   16
[854]   17
[855]   1
[856]   16
[857]   17
[858]   1
[859]   16
[860]   17
[861]   1
[862]   16
[863]   17
[864]   1
[865]   16
[866]   16
[867]   1
[868]   16
[869]   17
[870]   1
[871]   16

The pipeline is essentially:

DecodeFrameAsync -> RunFrameVPPAsync -> SyncOperation -> Lock -> Memcpy -> Unlock

I'm using D3D 11 surface allocator doing a NV12 -> YUY2 VPP conversion on the image format. The 16ms/17ms blocking on SyncOperation is suspiciously close to a framerate of 60 fps, but also suspiciously close to Windows scheduler schenanighans. It's not a spinlock because I see no issue with CPU usage.

I wonder if anyone has any insight into what may be going on here? Note my setup is I'm using Quicksync for video encode/decode but I'm using a discrete card for display.

BMart1 · ‎02-05-2019

Does it happen with the intel samples?

Mark_L_Intel1 · ‎02-05-2019

Hi Robinson,

Agree with Bruno, if you can reproduce it with Intel MSDK samples, could you post the command line?

Mark

Robinson__UK · ‎02-06-2019

It doesn't seem to, no. However the sample_decode example (and all of the examples) are a somewhat tangled mess so it's very difficult to follow what's actually going on. I put a stopwatch in there and the SyncOperation is mostly 0. I will say that you always point people towards the samples to fix issues and that's perfectly fine, but the samples are kind-of a dog's breakfast. It wouldn't surprise me if many people just give up and use something else when they encounter problems.

Anyway I can see a difference in the way surfaces are synchronised that seems to have changed through versions of the SDK. There's now an mfxSyncPoint with each individual surface which I don't remember from before. At least it looks different to the SDK from a few years ago when I originally developed the codebase I'm working with today.

I think what I'll do is investigate correct use of mfxSyncPoint and SyncOperation. I propose to do something like the following:

   Fetch a surface from my decode surface pool
   If no available surface
       SyncOperation on the oldest in-use surface's sync point
       if that fails
           game over
       else
           return the frame to the pool and fetch again
   DecodeFrameAsync
   Fetch a surface from my vpp surface pool
   if no available surface
       SyncOperation on the oldest in-use surface's sync point
       if that fails
           game over
       else
           Lock, Memcpy, Unlock, return the frame to the pool and fetch again
   RunFrameVPPAsync

Robinson__UK · ‎02-07-2019

So I spent most of today and a good deal of yesterday refactoring my test project to see what was up. I changed it to be more along the lines of sample_decode, keeping a pool of surfaces to be synchronised and only doing so when I've run out of surfaces to submit with (sample_decode is needlessly complex in this respect, at least compared to my fairly minimal version). My decode timings (each time is elapsed time for 60 frames decoded divided by 60) are:

2.3
2.6
2.2
2.0
2.1
2.2
2.0
1.9
2.4
2.0
2.3
2.7
2.6
2.5
2.5
2.7
2.5
2.5
2.5
2.5

This is much more like it. However I've now got another weird problem concerning MFX_ERR_INCOMPATIBLE_VIDEO_PARAM that I just cannot get my head around. I think I'll post a new question for it though!

Thanks.

Mark_L_Intel1 · ‎02-07-2019

Thanks so much for the good investigation,

Although we still expect you could create a reproducer, I will check the Sync usage you posted to see if I can do any help.

Mark

Mark_L_Intel1 · ‎02-13-2019

Hi Robinson,

Thanks so much for the good comment, I have transfer them to dev team so they would understand more from a user point of view.

Here is some advice I hope useful to you:

Media SDK doesn't actually manage a surface pool. (Though that is a good idea…) All it has is a "locked" field for a surface to say if Media SDK is currently using it.

The sync operation is to help manage the async implementation of Media SDK. The state of the output surface or bitstream is unknown before the sync. Surfaces don't have a sync point, but it is possible for each Media SDK operation (decode, VPP, encode) to have its own mfxSyncPoint.

The sequence is more like:

get a free surface index for decode

loop through decode's state machine. This may involve adding many new surfaces before one is ready to use, depending on what is in the bitstream.

get a free surface index for VPP
loop through VPP's state machine. This may involve adding many new surfaces in cases like composition.
sync
application gets access to VPP output

The lock, memcpy, unlock in the customer's sequence is probably adding a lot of inefficiency. What may be better for the customer is to use one pool of video (or opaque) memory for decode and VPP plus another pool of system memory for VPP output.

Understanding the sequence of steps from sample_decode is not easy. The tutorials may be our best bet for now as an example of the recommended steps to use.

You can download the tutorial from this page.

Mark

Tuckerson__Robinson · ‎02-14-2019

Thanks Mark.

You've anticipated a question I was going to ask the other day (do surfaces have sync points). I was under that impression from studying the samples, which seem to marry them together in certain structures. Yesterday I decided to scrap what I'd done and start over. I just have a simple std::vector of unique_ptr to surfaces. To "get_next_free_surface" I just find one that's not locked. My main loop has a sync_point and if it's not nullptr I SyncOperation with it when I've run out of surfaces.

The main loop is extremely simple (more developer guide than samples_decode). I think the samples hide the simplicity of the process. The developer guide/manual should explicitly state that the sync point refers to the component, not a surface though. That was where my original confusion arose.

Thanks again.