- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've written some code to do a performance analysis on decoding H264 streams (60 fps 1080p 20Mb/s) and noticed a discrepency between a Haswell box and a Coffee Lake (8400) box. When I aggregate the results I find the Haswell decoding frames at around 6ms per frame (the distribution is not fat tailed so this was an average across 17,000 frames). The exact same setup on the Coffee Lake box results in an average of around 16ms per frame.
I managed to narrow the delay down to the SyncOperation I'm doing on the pipeline. For example:
Frame Wait (ms)
[851] 16
[852] 1
[853] 16
[854] 17
[855] 1
[856] 16
[857] 17
[858] 1
[859] 16
[860] 17
[861] 1
[862] 16
[863] 17
[864] 1
[865] 16
[866] 16
[867] 1
[868] 16
[869] 17
[870] 1
[871] 16
The pipeline is essentially:
DecodeFrameAsync -> RunFrameVPPAsync -> SyncOperation -> Lock -> Memcpy -> Unlock
I'm using D3D 11 surface allocator doing a NV12 -> YUY2 VPP conversion on the image format. The 16ms/17ms blocking on SyncOperation is suspiciously close to a framerate of 60 fps, but also suspiciously close to Windows scheduler schenanighans. It's not a spinlock because I see no issue with CPU usage.
I wonder if anyone has any insight into what may be going on here? Note my setup is I'm using Quicksync for video encode/decode but I'm using a discrete card for display.
- Tags:
- Development Tools
- Graphics
- Intel® Media SDK
- Intel® Media Server Studio
- Media Processing
- Optimization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does it happen with the intel samples?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robinson,
Agree with Bruno, if you can reproduce it with Intel MSDK samples, could you post the command line?
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It doesn't seem to, no. However the sample_decode example (and all of the examples) are a somewhat tangled mess so it's very difficult to follow what's actually going on. I put a stopwatch in there and the SyncOperation is mostly 0. I will say that you always point people towards the samples to fix issues and that's perfectly fine, but the samples are kind-of a dog's breakfast. It wouldn't surprise me if many people just give up and use something else when they encounter problems.
Anyway I can see a difference in the way surfaces are synchronised that seems to have changed through versions of the SDK. There's now an mfxSyncPoint with each individual surface which I don't remember from before. At least it looks different to the SDK from a few years ago when I originally developed the codebase I'm working with today.
I think what I'll do is investigate correct use of mfxSyncPoint and SyncOperation. I propose to do something like the following:
Fetch a surface from my decode surface pool
If no available surface
SyncOperation on the oldest in-use surface's sync point
if that fails
game over
else
return the frame to the pool and fetch again
DecodeFrameAsync
Fetch a surface from my vpp surface pool
if no available surface
SyncOperation on the oldest in-use surface's sync point
if that fails
game over
else
Lock, Memcpy, Unlock, return the frame to the pool and fetch again
RunFrameVPPAsync
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So I spent most of today and a good deal of yesterday refactoring my test project to see what was up. I changed it to be more along the lines of sample_decode, keeping a pool of surfaces to be synchronised and only doing so when I've run out of surfaces to submit with (sample_decode is needlessly complex in this respect, at least compared to my fairly minimal version). My decode timings (each time is elapsed time for 60 frames decoded divided by 60) are:
2.3
2.6
2.2
2.0
2.1
2.2
2.0
1.9
2.4
2.0
2.3
2.7
2.6
2.5
2.5
2.7
2.5
2.5
2.5
2.5
This is much more like it. However I've now got another weird problem concerning MFX_ERR_INCOMPATIBLE_VIDEO_PARAM that I just cannot get my head around. I think I'll post a new question for it though!
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much for the good investigation,
Although we still expect you could create a reproducer, I will check the Sync usage you posted to see if I can do any help.
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robinson,
Thanks so much for the good comment, I have transfer them to dev team so they would understand more from a user point of view.
Here is some advice I hope useful to you:
Media SDK doesn't actually manage a surface pool. (Though that is a good idea…) All it has is a "locked" field for a surface to say if Media SDK is currently using it.
The sync operation is to help manage the async implementation of Media SDK. The state of the output surface or bitstream is unknown before the sync. Surfaces don't have a sync point, but it is possible for each Media SDK operation (decode, VPP, encode) to have its own mfxSyncPoint.
The sequence is more like:
get a free surface index for decode
loop through decode's state machine. This may involve adding many new surfaces before one is ready to use, depending on what is in the bitstream.
- get a free surface index for VPP
- loop through VPP's state machine. This may involve adding many new surfaces in cases like composition.
- sync
- application gets access to VPP output
The lock, memcpy, unlock in the customer's sequence is probably adding a lot of inefficiency. What may be better for the customer is to use one pool of video (or opaque) memory for decode and VPP plus another pool of system memory for VPP output.
Understanding the sequence of steps from sample_decode is not easy. The tutorials may be our best bet for now as an example of the recommended steps to use.
You can download the tutorial from this page.
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Mark.
You've anticipated a question I was going to ask the other day (do surfaces have sync points). I was under that impression from studying the samples, which seem to marry them together in certain structures. Yesterday I decided to scrap what I'd done and start over. I just have a simple std::vector of unique_ptr to surfaces. To "get_next_free_surface" I just find one that's not locked. My main loop has a sync_point and if it's not nullptr I SyncOperation with it when I've run out of surfaces.
The main loop is extremely simple (more developer guide than samples_decode). I think the samples hide the simplicity of the process. The developer guide/manual should explicitly state that the sync point refers to the component, not a surface though. That was where my original confusion arose.
Thanks again.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page