Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Kishor_B_
Beginner
203 Views

simple_decode_vmem - GPU to CPU memory copy - ways to optimize

Jump to solution

Hello All,

I profiled and found that copying memory from GPU to CPU is very expensive. I am looking for your inputs to alleviate this performance loss.

While the decoder using video memory gave a performance of 1653 FPS (here the output is not dealt after decoding), copying the decoder's output to system memory after decoding gave just 80FPS (in the simple_decode_vmem application). Such a fall it is, which leaves us with just 2 decodes per processor. I used MFX_GPUCOPY_ON but not to avail any performance benefits.

Note: I am not convinced about the explanation at https://software.intel.com/en-us/forums/intel-media-sdk/topic/557837 as the FPS when using system memory is ~1000FPS for the same clip, which requires the memory to be moved between memories.

Any ideas to deal with data movement between memories?

My setup is: SDK API v 1.16, Core i7-5600U CPU @2.60GHZ, 4 cores, Broadwell, Turbo Disabled, HD graphics 5500, centOS 7.1 

Best Regards, Kishor

0 Kudos
1 Solution
Surbhi_M_Intel
Employee
203 Views

Hi Kishore, 

As explained in the thread you have pointed, simplest and efficient way done in Media SDK is to use IOPattern i.e where the data will be stored. You can o/p the data to system memory by setting IOPattern to be SYSTEM_MEMORY, here Media SDK did the copy from video to system memory in the efficient way. More details on IOPattern can be found in developers guide Pg33. I did a quick run using sample_decode in which if I out to system memory using pattern I am seeing around 25-30% decrease in FPS. 

Try and let us know if this was helpful to you, if not can you please explain your performance concern in detail(provide the no. you are getting and targeting for).

Thanks,
Surbhi

View solution in original post

6 Replies
Surbhi_M_Intel
Employee
204 Views

Hi Kishore, 

As explained in the thread you have pointed, simplest and efficient way done in Media SDK is to use IOPattern i.e where the data will be stored. You can o/p the data to system memory by setting IOPattern to be SYSTEM_MEMORY, here Media SDK did the copy from video to system memory in the efficient way. More details on IOPattern can be found in developers guide Pg33. I did a quick run using sample_decode in which if I out to system memory using pattern I am seeing around 25-30% decrease in FPS. 

Try and let us know if this was helpful to you, if not can you please explain your performance concern in detail(provide the no. you are getting and targeting for).

Thanks,
Surbhi

View solution in original post

Kishor_B_
Beginner
203 Views

Hi Surbhi,

Thank you for quick reply on this thread. I shall try IOPattern and get back to you with my observations.

A 25-30% drop is far better than what we have reported earlier and a good bet for us.

Best Regards, Kishor

 

Kishor_B_
Beginner
203 Views

Surbhi M. (Intel) wrote:

Use IOPattern i.e where the data will be stored.

 

Thanks Surbhi. With system memory, I got similar results. You may want to close this thread.

Best Regards, Kishor

Surbhi_M_Intel
Employee
203 Views

Great, that sounds good! closing this thread, if you have any other query please start a new thread. 

-Surbhi

Roman_T_
New Contributor I
203 Views

Surbhi M. (Intel) wrote:

Hi Kishore, 

As explained in the thread you have pointed, simplest and efficient way done in Media SDK is to use IOPattern i.e where the data will be stored. You can o/p the data to system memory by setting IOPattern to be SYSTEM_MEMORY, here Media SDK did the copy from video to system memory in the efficient way. More details on IOPattern can be found in developers guide Pg33. I did a quick run using sample_decode in which if I out to system memory using pattern I am seeing around 25-30% decrease in FPS. 

Try and let us know if this was helpful to you, if not can you please explain your performance concern in detail(provide the no. you are getting and targeting for).

Thanks,
Surbhi

Hi Surbhi,

I also need not only to show decoded frames on the screen, but to perform some video anatytics.
So I need decoded frames to be in system memory. 

Does your suggestion mean that I have to use sample_decode data flow procedure instead of sample_decode_vmem?

Best regards,
Roman

Surbhi_M_Intel
Employee
203 Views

Hi Roman, 

Simple_decode or simple_decode_vmem are tutorials to show how to set up pipeline using system or video memory. Samples are code samples which shows latest API feature are optimized for a better performance on underlying hardware. Depending upon your pipeline you can choose any, if you want to reuse the code I will recommend using sample_decode for a better performance. 

Thanks,
Surbhi

Reply