I'm writing an server application that records and merges from several IP cams, and renders the result using the Intel media sdk. For this I use a modified version of the sample encoder. However the application sporadically freezes the entire system completely, in such a way that nothing echoes to console, nothing is printed to any system log. More importantly the server becomes unresponsive and has to be rebooted physically turning it on and off. We have two such servers each controlling 11 cameras, our resulting frames are 1920x480, fps 50.
There is also a "light" version of the above where we get the errorcode -17 (DEVICE_FAILURE) in which case we can abort. and reset the server. This is not as bad as the above, but still not something that is acceptable longterm.
We are currently using medisdk 1.6, weve successfully run the modified encoder, I've essentially just replaced parts that had to do with reading and allocating from file to allocating from frames in memory. It usually crashes within the first 50 000 or so frames, shortest i believe was just 5-6000, longest is 266 000.
What we have tried so far in locating the error:
* Removed the lines in the large server program that initialized, sent frames to , stopped and closed the encoder, four lines total, And ran the entire app for well over a million frames, No problems.
* Ran the modified encoder class in a minimalistic program that created frames by drawing simple patterns (lines moving, circles expanding etc), and it seems we can run it more or less indefinitely, however when run on both servers we once got very different filesizes (one server was about 14.4GB, the other 12.5GB).
* Ran the full program, but instead of removing the encoder lines we generated the same frames as in the test-program, although appearing more stable this resulted in several -17 errors
So to try to formulate some questions:
What are potential causes of the -17 error, the docs doesn't say anything more detailed than "You are screwed, shut down the encoder". I had the idea that the frames I fed it could be bad. We do a lot of processing on CUDA and the final merged frame resides in CUDA memory so I converted to NV12 directly before extracting, using http://en.wikipedia.org/wiki/YUV (the part relating to 709). However FourCC has a long text about how such transforms are wrong, Y has to be in [16:235] , e.t.c . so are there any such requirements that I've missed? On successful runs, the resulting videos look great, perhaps slightly to strong colours.
I'm currently running the encoder class inside a separate thread, and the thread responsible for calling and retriving from CUDA places NV12 frames i a queue (threadsafe queue) using preallocated (threadsafe) memory, the encoderthread pops and sends to the encoder. There are several more threads (one for each camera among others), could that be a problem? If so would it help to run the encoder as a separate process instead?
When I run the standalone test program I'm not using near the same resources (memory, cpu) as in the full server application. Could this affect performance?
Has anyone else experienced similar issues?
I'll supply my modified class and test programs if it helps, however as of yet I can't reproduce the error using them. I can't supply the source for the entire program as easily, plus that you need a fairly specific hardware configuration with lots of cameras adhering to particular APIs, so it probably won't be useful anyhow. In addition to the problem barely being reproducible there either
Thanks for your report. As you've mentioned, -17 (DEVICE_FAILURE) can be caused by many things. While I'm hoping we can find more ways in the future to simplify diagnostics so it is easier to root cause, today this cause could be many places in the application->Media SDK library->libva/drm->driver stack.
We'd like to help get past this error. Ideally a reproducer from you would be best. In the meantime, could you let us know a bit more about your system?
Does the problem occur on several machines or just one?
Is it easily possible to reinstall Linux and Media SDK? This may help with any stack corruption issues.
Thanks for your reply, it happens on two separate machines, both (unless we messed something up during installation) identical both in hardware and software.
Processor: i7-4770K CPU @ 3.50GHz
Linux: Ubuntu 12.04.3 LTS
Kernel: 3.8.0-34-generic x86_64
Media SDK: not sure if this is the correct number, but the file is named 184.108.40.20696
libva seems to be libva.so.1.3400.0 (same number for libva-drm etc)
The attached file includes a program that has caused error -17, although it took about 1.4 million frames. It is very simplistic and only modifies 1 pixel at a time, from a 0 initialized memory. Something worrying about it is that it doesn't throw -17 after the same amount fo frames on reruns.
Thank you for the details and reproducer. One thing I can see right away which might cause problems is your kernel level. Intel Media SDK ships a kernel mode driver (KMD) with patches intended for a very specific kernel/config combination. For 4th Generation Core processors running Ubuntu this is the 220.127.116.11-generic kernel which can be installed via apt-get. The patches are specifically for stability issues like you've described.
Without this kernel the patched i915 kernel-mode driver is not installed. The Media SDK project works with SLES and Ubuntu to get fixes from close to the kernel tip backported to the LTS kernels. What isn't backported for one reason or another is patched into Media SDK's patched i915 module.
Only the configurations detailed in the release notes are supported configurations. However, if for some reason you can't use our patched KMD you may have success working with kernels much closer to the tip. the 3.8.0-34 kernel has the disadvantages of not working with the patched KMD, and being too old to have those fixes included by default.
Please set up the 18.104.22.168 kernel and re-install Media SDK. Hopefully that will clear up the problem. If not, this is a first step toward preparing to submit issues to the dev team.
We changed kernel and reinstalled the SDK. Since we had to reinstall the SDK we also upgraded it to R3.
But our application still crashes. This time it was the kind that frooze the entire machine, i.e it became completely unresponsive, nothing in system log files on crash, no error -17 message. Nothing informative at all to locate the reason. If i wasn't able to run for hours without problem when the encoder is commented out, I wouldn't even have a suspect for the source.