Solved: Why is a bitblt operation directly onto a buffer from a mfxFrameSurface1 slow?

Chrizzz · ‎05-13-2021

I am writing this proof of concept program to convince my boss that we can do this on an Intel chip and not need to invest in an expensive GPU. This is the first time I am developing on an Intel platform with the mediaSDK. However I ran into a little issue which while I can work around, I would like to understand why this problem occurs. So I am hoping someone can shed some light on why this happens.

I wrote this little software with the mediaSDK that takes in a webcam and basically what I do is I add some text on the screen and then encode the frame to be pushed.

The text is nothing usual, just bitmaps which I rendered with the freetype library.

What I did at first was to create a buffer in memory, blt the text onto this buffer and then memcpy the buffer onto the mfxFrameSurface1 buffer. This worked fine and I got an easy smooth 30fps(with the J3455 CPU we were using.)

However, as an experiment, I wanted to save some memory and instead of allocating a memory buffer to perform the blt on, I decided to blt directly onto the mfxFrameSurface1 buffer. What I found was the performance tanked immensely.

Has this got something to do with the IOPattern being set to MFX_IOPATTERN_IN_VIDEO_MEMORY | MFX_IOPATTERN_OUT_VIDEO_MEMORY?

Thanks for any help or answers you can give.

JananiC_Intel · ‎05-18-2021

Hi,

Thanks for the update.

Could you let us know whether you are using system memory or video memory?

If you are using video memory then it might be the reason for slowness and as a workaround you can use MFX_IOPATTERN_IN_SYSTEM_MEMORY and MFX_IOPATTERN_OUT_SYSTEM_MEMORY instead of MFX_IOPATTERN_IN_VIDEO_MEMORY and MFX_IOPATTERN_OUT_VIDEO_MEMORY.

Refer the below links for further details

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#iopattern

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#mfxframesurface1

Try this and let us know.

Regards,

Janani Chandran

View solution in original post

JananiC_Intel · ‎05-14-2021

Hi,

Thanks for posting in Intel forums.

We will check on this meanwhile could you share your media sdk version and OS details?

Regards,

Janani Chandran

Chrizzz · ‎05-14-2021

Hi Janani,

thanks for the reply. I believe the mediasdk version is 1.3. I cloned it from github and the commit hash which I built from is d058a653. The OS I am using is Linux Ubuntu 18.04.5.

Best regards,

Chrizzz

JananiC_Intel · ‎05-18-2021

Hi,

Thanks for the update.

Could you let us know whether you are using system memory or video memory?

If you are using video memory then it might be the reason for slowness and as a workaround you can use MFX_IOPATTERN_IN_SYSTEM_MEMORY and MFX_IOPATTERN_OUT_SYSTEM_MEMORY instead of MFX_IOPATTERN_IN_VIDEO_MEMORY and MFX_IOPATTERN_OUT_VIDEO_MEMORY.

Refer the below links for further details

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#iopattern

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#mfxframesurface1

Try this and let us know.

Regards,

Janani Chandran

Chrizzz · ‎05-19-2021

Hi,

yes I am using video memory. I set the IOPattern field to MFX_IOPATTERN_IN_VIDEO_MEMORY | MFX_IOPATTERN_OUT_VIDEO_MEMORY.

The reason I did this was I thought that memory region was going to be used as video memory and should be set to that. Also, the J3455 was supposed to have unified memory between system and video? Unless I misunderstood, should there be any access penalty to accessing the memory area that is assigned for video use?

I will change the IOPattern, test this out and put my results here. Thanks for your help!

Best regards,

Chrizzz

JananiC_Intel · ‎05-20-2021

Hi,

Thanks for the quick response.

Try changing IOPattern and let us know the update.

Regards,

Janani Chandran

Chrizzz · ‎05-21-2021

Hi Janini,

the change of IOPattern to using system memory works and there is no difference in performance. I suppose this is because the bitblt operations I perform are CPU operations and thus marking the IOPattern as system memory access should optimize the code better for writing to the frame buffers.

However, when reading the documentation, it doesn't quite explain why this behavior manifests itself with the J3455? The video memory is supposedly in the same pool as the system memory in one unified memory region. Shouldn't there be no performance penalties accessing any part of the memory pool whether it is assigned as video buffers or system memory?

eg. There should be no PCI bus penalties accessing video memory because the memory is not located on a discrete GPU board, but it's all shared in the system memory on the J3455?

Best regards,

Chris

JananiC_Intel · ‎05-25-2021

Hi,

Thanks for the update.

We will check on this internally and let you know.

Regards,

Janani Chandran

Mark_L_Intel1 · ‎05-27-2021

Hi Chris,

May I learn more about your pipeline?

I think you are adding text to the raw frame buffer, so you should have following pipeline:

webcam-->decoder-->vpp-->encoder.

So I believe the frame buffer you modified is the input of vpp.

The other question about your previous post:

You said "the change of IOPattern to using system memory works and there is no difference in performance." and you also said later "Shouldn't there be no performance penalties accessing any part of the memory pool whether it is assigned as video buffers or system memory?" it seems you got a better performance in system memory.

In our context, "video memory" means a dedicated memory in GPU(no matter if it is integrated GPU or discrete GPU), it must be transferred via hardware if you want to access it. So "The video memory is supposedly in the same pool as the system memory in one unified memory region." is not correct.

Mark

Chrizzz · ‎05-28-2021

Hi Mark,

that pipeline is correct as described. I'm just taking the webcam, decoding it, then taking a vpp surface and modifying the frame buffer and submitting that to the encoder.

And yes, by changing to system memory, I did get better performance. This is the first time I am using this software suite and using the Intel GPU to do this as a means of proving we can use the Intel GPU and not need more expensive solutions so it's a learning curve.

So am I right to assume that if I set the IOPattern to video memory, then access to the memory in the video memory is handled by the GPU, which means I am in effect requesting the GPU to perform the memory access on behalf of the CPU? Which also then implies I really should use Linux VAAPI to access the memory which is allocated as video memory?

Thanks,

Chris

Mark_L_Intel1 · ‎06-02-2021

Hi Chris,

Thanks for the reply. I understood now you are using Media SDK the first time and I am glad to help you for your concerns.

The benefit of video memory is to let the codec/processing hardware on GPU to access the frame locally, so for the pipeline you are using:

webcam-->decoder-->vpp-->encoder

The frame buffer exists between the decoder output and encoder input. Your manually bitblt execution will run in these area, I am not a Windows developer so not quite sure what bitblt does; I assume it can only access the system memory. So if you assign the video memory for the frame buffers, the bitblt operation should do a copying internal. In case the system memory is assign, the decoder will write to system memory which has slower access but avoid the copying.

Your comment about video memory is almost right, although it is not "GPU", in hardware architecture, the hardware codec is located on GPU and shared the memory with general GPU but it is a separate hardware.

If you are using Media SDK on Linux, yes you can use VAAPI directly. Media SDK uses VAAPI internally and it optimizes the hardware codec operation. You can refer to the following code for intel media driver with VAAPI:

https://github.com/intel/media-driver

Here is the media architecture on Linux:

https://01.org/intel-media-for-linux

Here is the hardware diagram about the Intel hardware codec(also called Fixed Function)

https://sudonull.com/post/77698-VP8-VP9-and-H265-Hardware-Acceleration-of-Video-Encoding-and-Decoding-in-6th-Generation-Processors-S

In this diagram, FF and GPGPU shared the L3 cache as the video memory

Mark Liu

Chrizzz · ‎06-09-2021

Hi Mark,

the explanation you gave helped immensely. At least I now better understand how this SDK sits on the Linux system.

One final question about the sharing of the L3 cache though, does the CPU share the same L3 cache on the video memory?

Thanks,

Chris

Mark_L_Intel1 · ‎06-10-2021

No problem, glad to answer your question.

I am not familiar with the detail hardware information, but from driver point of view, the L3 Cache should be a local memory and can't be shared with CPU.

In our new product--oneVPL, we are started to improve on this. In oneVPL, if the memory is managed internal, there will be a mapping operations to map the video memory to system memory space so it can be accessed by application as if it is a system memory.

This product is in release from the end of last year and can be used as CPU only implementation. You can use it for prototyping and evaluation. It will soon to be available for product development.

I will stop monitoring this post, feel free to submit a new post.

For detailed information, you can access our landing page:

intel.com/onevpl (the shortcut for media SDK is intel.com/mediasdk)

This is my webinar last month which answers most of the question about oneVPL:

https://techdecoded.intel.io/essentials/unlock-media-features-on-more-cpus-gpus-and-accelerators/#gs.2zluaa

These are the oneAPI open source pages:

https://github.com/oneapi-src/oneVPL

https://github.com/oneapi-src/oneVPL-cpu

https://github.com/oneapi-src/oneVPL-intel-gpu

https://github.com/oneapi-src/oneAPI-samples/tree/master/Libraries/oneVPL

Again, this product is fully enabled only on CPU part, the GPU part will be enabled within this year.

Mark Liu