Slower MJPEG decoding using video memory compared to system memory

Jason_C_ · ‎02-03-2017

Media SDK Client or Media Server Studio version installed: Intel Media SDK 2016 R2

Processor Type: Intel i7 6700 with 16 GB of RAM on a Gigabyte Gaming 7 Z170X motherboard

Driver Version: 15.45.10.4542

Operating System: Windows 10 Pro

Media SDK System Analyzer: I have attached the output from the system analyzer.

Concise Description of the Issue:

I have been using the Intel Media SDK 2016 R2 to get a MJPEG decoder working.
I followed the media SDK tutorials code as well as the media SDK samples and got an implementation using system memory up and running.

For my application I need maximum performance thus I now need to get the decoder working using video memory. I have it decoding and producing
valid frames while using video memory, however, it is much slower than it should be... in fact currently using the system memory approach is 3 times faster than when using video memory so clearly something is wrong with my code. I get about 100fps using system memory and only 30 fps when using video memory.

I've looked through other forum posts (there aren't many) and can't see what I am doing wrong.
I've looked through the tracer logs for any warnings about partial acceleration but there aren't any. I have also tried it on another machine (a laptop with an i7 3630QM with integrated HD4400 graphics) and saw the same result, system memory decoding much faster than video memory.

Attached with this post are tracer logs when using the system memory approach and another using video memory.
I have also included the relevant code (I'm trying to have a minimal implementation, some of the functions here directly use some of the functionality
in the MediaSDK tutorials with files such as common_directx11, common_utils, common_utils_windows).
Please let me know if you need anything else in order to help me with this problem. I suspect I am missing something simple but I can't see it.

Your help in this would be greatly appreciated!

Mark_L_Intel1 · ‎02-03-2017

Hi Jason,

I have looked at all the file you attached and like you said, there are not a lot of information related to the slow down. I don't have much information about your application either.

Assume the video memory and the HW decoder works fine, you need to figure out which stage in the pipeline has the most delay. I suggest you to use the injected code to tag each stage to check the delay, for example, the stage of copy the input stream/file to the video memory, the possible color conversion stage and the output stage when copying the video memory to the rendering buffer/output file.

There are some possibility at different stage, what's is your pipeline?

Mark

Jason_C_ · ‎02-04-2017

Hi Mark,

Thanks a lot for looking at this!

I'll provide a bit of context about my application. I am working on making the Intel Media SDK MJPEG decoder available for use in the opensource library libfreenect2. Essentially it is an opensource library for interfacing with the Kinect 2 depth camera (which also has a colour camera that uses MJPEG), you may have heard of it. It has a CPU based MJPEG decoder, and also under Linux it has another MJPEG implementation that uses VAAPI to accelerate MJPEG decoding on Intel hardware. Internally I believe VAAPI either directly uses Media SDK or interfaces with the same hardware in some other way. Either way, it shows how fast the Intel hardware can be for hardware decoding, in the order of ~220fps. I am trying to get the same speeds but under Windows using the Media SDK.

The current speed I get is fine for one camera, but the big thing with libfreenect2 is it supports connecting multiple Kinect cameras to a single PC which is why I need the implementation to decode much faster so it can support decoding multiple streams. However, at present I am just trying to get it decoding a single stream at a fast rate.

I'll provide a bit of context for the code I provided. Libfreenect2 gets the colour camera USB packet data for a single frame and passes it off to the MJPEG implementation, in this case my code with class name "MediaSdkRgbPacketProcessorImpl" that I attached. The initializeMediaSdk function in my code is called which creates the session etc.

When a new frame comes in the "decompress" function is called. For the first header seem the "decodeFirstHeader" function is called which sets up the mfxVideoParam structure, the mfxBitstream structure, decodes the header, sets up the surfaces the decoder will use, and initialises the decoder. The frame is then decoded. A lot of this code closely follows the tutorial samples (from here) called "simple_decode" for system memory and "simple_decode_d3d" for video memory, but these show an example for decoding an H.264 stream so I fear there may be differences causing my performance issues.

In order to test using main memory vs video memory I have a class member variable called "useVideoMemory". The bulk of the code is the same for both approaches, but when doing things such as allocating the surfaces and performing the decode there are differences so the logic is different at those points as you will see in the code.

So to answer your question the pipeline is the same for both cases, at least until it hits the actual decoding. There aren't any differences in terms of operations that are performed on the frames after being decoded.

The thing I wonder is if I am correctly setting up the surfaces in the video memory case, or perhaps the actual decoding loop is incorrect in some way. I get valid frames but perhaps there is something that isn't correct that is leading to slow operation.

Hope this helps,

Jason

Mark_L_Intel1 · ‎02-17-2017

Hi Jason,

Sorry for the late response because there are other projects and I haven't have a chance to look at the whole forum questions. I need to improve this.

I was asking about the pipeline because there are lot of places can causes slow down, for example, if you have the resize function which involves VPP, this can make it more complex.

Anyway, I looked at your code and I understand the code roughly, so I have a question:

In our tutorial, we have a function call ReadBitStreamData() which load the frame continuously in the decode loop, but it is missing in your code, I also noticed the buffer was passed in from decompress() call, are you reading the buffer outside of this call?

You should do it inside the loop because the DecodeFrameAsync() will dispatch the frames to a multi-thread engine to utilize the hardware, if you force it to single thread, it will be much slower.

Mark

Jason_C_ · ‎02-20-2017

Hi Mark,

Thanks for looking at this, I really appreciate your time.

To answer your question regarding the ReadBitStreamData() or equivalent, in my use case given I am decoding MJPEG frames from a camera I get the buffer outside this loop, as you say I pass the buffer into the decompress function. This can't really be done any other way as the other parts of the library have to handle the USB packets and form the compressed frame buffers. I am also processing the frames in realtime so can't feed it multiple frames at the same time (like I could if this was data being read from a file), at least not from a single camera.

I guess I am just trying to understand why the video memory path (where I have the bool useVideoMemory set to true) is so much slower than the system memory path, the only real change between using system vs video memory is how the surface buffers are allocated. At present it takes ~14.4ms to decode a 1080p frame using video memory (about 70fps), using system memory it takes only 7.4ms (about 135fps).

Thanks again,

Jason