We're encoding three 20mb streams (1280x720x60fps) to H264 and decoding 6 simultaneously on an 8400 CPU. This works reasonably well. Every now and then (few hours) we encounter a MFX_ERR_DEVICE_FAILED error. After the error QuickSync doesn't work until we reset and recreate all of our objects. We don't get this error if we have only two encode, four decode or one encode, two decode.
Are there theoretical limits to the number of streams across different processors? It feels like some resource is being consumed and sometimes (randomly) is exhausted. But what? Apart from restarting (we'll drop frames during the restart of course), what mitigating action can we take?
Would the logging facility be of any use here, or would it just add to the burden the processor is under?
- Development Tools
- Intel® Media SDK
- Intel® Media Server Studio
- Media Processing
Thanks for the info. From your scenario, it seems a problem and we need more investigation.
In general, I doubt the current logging method would work in this case so we need other method. One way is to use Open Source Media SDK, the other way is to use our sample code. Before we move on, could you answer following questions:
Which OS are your running on?
Which release are you using?
Are you run your own application or our samples?
Last night soak test it fell over once running two encode, four decodes but in the middle of the night, not after one hour. The bug report is from our release build system. We have a test project we intend to get up and running that we can hammer in the same way but won't be able to show results from that until next week sometime. Note that (if it's of any use) each stream encode and decode is running on a different thread. We do not share resources across threads. We are using D3D 11 surfaces. We have a secondary NVIDIA GPU driving the displays (the Intel 630 is used as "headless").
To answer your questions:
We aren't running your samples, we have our own codebase.
We're using Windows 10 1903.
The Intel Display Driver is 22.214.171.12462 (25/09/2019).
I understand you can be of limited help here unless we can reproduce in your samples and/or provide more information. I was just curious as to whether there are known issues.
I have posted a ticket anyway to see if dev team could come up with a suggestions.
I know in open source Media SDK, you could add MFX_DEBUG_TRACE to log informations but not sure how to re-build libraries in Windows.
Are you using 2019R1?
My guess the processor is i5-8400 with UHD630?
Sorry for the late reply. We've been trying out various things. Firstly a mitigation (restart on error) and secondly we've updated to the latest 2019R1 SDK from 2018R2. Since we did the latter so far we've had no errors reported in the log, or restarts. Early days though. We'll have confidence if it lasts until Friday. The main difference here is we're now using Coffee Lake rather than Skylake or Haswell. I don't know if that makes any difference, or if the issue is driver changes downstream requiring changes in the mfx dll, etc. Either is possible (would love a peak at your internal bugtracker!).
Thanks so much,
Good to know it resolved on CoffeeLake and I agree with you, this is not very solid since the issue was not identify.
I also checked tags on Open Source Media SDK, thinking if the tags between two release could give us some clue but I can't find them.
Let me record it now in case something happens later.
If would be appreciated if you can give us the command line when you have the issue with our samples, my guess is you were using sample_multi_transcode?
So, it's not resolved :(. After using the new SDK our PC running one encode and two decodes worked solidly over the weekend. However the system we set up with 3 encode, 6 decode experienced very frequently driver crashes (a couple of hundred in that 48 hour period). We restart QuickSync whenever we detect the failure. We can see it fall over in Windows Event Viewer (graphics driver reset, or something like that). So you have any test setups that can do continuous encode/decode with many streams on the same PC? We're encoding 1280x720 @ 60 with 20mbit. We encode to 1 second GOP files and decode same.
I don't see how I can set up this kind of use in your sample application to give you a repo.
Are you running under same process or different process?
I am thinking to run 3 instances of sample_encode and 6 instances of sample_decode to micmic your case. Could this be closed to your case?
If the encode and decode are running continually for hours, it should replicate the scenario, yes. The encodes are in the same process, different threads. It's a background service. The decodes are in the same process, again different threads but in user space. None of the threads in either front or backend share resources. We use D3D11 for decode (CD3D11Device). If you don't do any decodes, just the 3 encodes, there's no problem and it's perfectly stable, so we think the problem is somewhere in decoding.
Another clue. We do not get this problem on systems with driver 126.96.36.19974. We do with driver 188.8.131.5262. I had missed this previously as our domain controller doesn't give us Windows updates immediately, whereas the soak tests aren't joined to the domain so do get them. I am updating my dev box right now to see if I can replicate the issue with this driver.
Thought I'd update you on this.
We have switched out our decoding threads to use a different decoder (not QuickSync). The encoder is still using QuickSync. Since we've done this the driver crash hasn't been seen. We conclude that decoding was crashing it, not encoding. Remember that this all works OK on Haswell and Skylake boxes, so I believe it's a Coffee Lake specific driver or hardware issue.
Thanks for the update and happy new year,
This is important for this issue, I will update it to the bug report.
From what I can see, it is not investigated yet so let me try to push it little bit more.
Sorry for being late in response.
I had a discussion with dev team today and they prefer a reproducer for this issue.
Could you give me a reproducer or direct me to reproduce this issue? It would be better if we can use the sample_multi_transcode in Media SDK side.
It looks like very hard to reproduce so let's back to your original question since I realized I might miss some of your points:
- For Media SDK,, there is no software limits to the number of streams across different processors. The practical limits is on platform, user expectation, etc.
- I think you are asking: in case of resource exhausting, what would be the better mitigation method to release resources than resetting?
Let me know if this is the correct direction.
It has been a while since my last ask, I am not sure if you are still waiting for this question.
I will wait for a week and if you are not response, I will close the ticket to the development team. But if you ask again, I will re-open it.