Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
Announcements
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

GPUCopy details

OTorg
New Contributor III
1,396 Views

Hi,

 

Could you please explain how exactly GPUCopy feature works in video encoding/decoding tasks.

 

Let's consider a scenario without GPUCopy option at first.
The application has buffers in RAM where it holds the video content (it should be so for architectural reasons).
When the encoder or decoder is initiated, the MFX_IOPATTERN_IN_VIDEO_MEMORY or MFX_IOPATTERN_OUT_VIDEO_MEMORY model is specified.
When it's time to encode the next frame, a LockSurface/Map call is made to mfxFrameSurface1, then video content is copied between the application's RAM and surface, then unlock is made and mfxFrameSurface1 is sent to VPL engine.
Copying is optimized using _mm_stream_load_si128/_mm_stream_si128 instructions. But, nevertheless, the CPU is engaged in this.


Let's go further.
If the GPU has GPUCopy functional, does this mean that:
When initializing the encoder/decoder, we can specify MFX_IOPATTERN_IN_SYSTEM_MEMORY/MFX_IOPATTERN_OUT_SYSTEM_MEMORY and external allocator.
And when filling the next mfxFrameSurface1 structure, we can point directly to the application frame's buffer.
And the GPU itself will copy the content from/to the application's arbitrary memory to/from video memory, the CPU will not be involved in this, double copying will not occur.
Is it correct?


If it matters, then the environment is Windows.

 

0 Kudos
1 Solution
Rahila_T_Intel
Moderator
1,359 Views

Hi,


Thank you for posting in Intel Communities.


Let's consider a scenario where you are building a video processing application (VPP) that performs color space conversion and resizing on a sequence of video frames. 


When 'gpucopy' is enabled:

1. The video frames are first read from the disk and loaded into the system memory (CPU memory) to mfxFrameSurface1.

2. The frames are then copied from the system memory to the GPU memory using a 'gpucopy' operation.

3. The GPU performs color space conversion and resizing on the frames

4. The processed frames are copied back from the GPU memory to the system memory using another 'gpucopy' operation

5. Finally, the processed frames are saved to disk or sent to another part of the system for further processing.


When 'gpucopy' is disabled:

1. The video frame is read from the disk and loaded into the system memory

2. The GPU performs color space conversion and resizing on the frames directly from the system memory, without copying the data to the GPU memory first.

3. The processed frames are saved to disk or sent to another part of the system for further processing directly from the system memory


Disabling 'gpucopy' might seem like a more efficient approach because it avoids the overhead of copying data between the system memory and the GPU memory. However, accessing the system memory directly from the GPU can result in lower performance due to the increased latency and lower bandwidth compared to the GPU's dedicated memory. In many cases, it is more efficient to perform 'gpucopy' operations and work with the data in the GPU memory, even with the additional overhead.


The optimal approach depends on the specific hardware, the size of the data being processed, and the nature of the processing tasks. In some cases, 'gpucopy' might be essential for achieving acceptable performance, while in others, it might be possible to achieve similar performance with or without 'gpucopy'. As a developer, it is essential to profile and optimize your application for the specific target hardware and use case.



It's important to note that the actual performance improvement depends on your specific hardware and use case. Enabling MFX_GPUCOPY_ON might not always result in better performance, so it's essential to test and profile your application with and without the flag to determine the best configuration for your needs.


 DeviceCopy is the equivalent of mfxInitParam::GPUCopy. When enabled, it uses the EUs (CM kernels) for accelerated copy between video memory and system memory. 


If this resolves your query, kindly accept this as a solution as it will help others with a similar query.


Thanks


View solution in original post

0 Kudos
7 Replies
Rahila_T_Intel
Moderator
1,360 Views

Hi,


Thank you for posting in Intel Communities.


Let's consider a scenario where you are building a video processing application (VPP) that performs color space conversion and resizing on a sequence of video frames. 


When 'gpucopy' is enabled:

1. The video frames are first read from the disk and loaded into the system memory (CPU memory) to mfxFrameSurface1.

2. The frames are then copied from the system memory to the GPU memory using a 'gpucopy' operation.

3. The GPU performs color space conversion and resizing on the frames

4. The processed frames are copied back from the GPU memory to the system memory using another 'gpucopy' operation

5. Finally, the processed frames are saved to disk or sent to another part of the system for further processing.


When 'gpucopy' is disabled:

1. The video frame is read from the disk and loaded into the system memory

2. The GPU performs color space conversion and resizing on the frames directly from the system memory, without copying the data to the GPU memory first.

3. The processed frames are saved to disk or sent to another part of the system for further processing directly from the system memory


Disabling 'gpucopy' might seem like a more efficient approach because it avoids the overhead of copying data between the system memory and the GPU memory. However, accessing the system memory directly from the GPU can result in lower performance due to the increased latency and lower bandwidth compared to the GPU's dedicated memory. In many cases, it is more efficient to perform 'gpucopy' operations and work with the data in the GPU memory, even with the additional overhead.


The optimal approach depends on the specific hardware, the size of the data being processed, and the nature of the processing tasks. In some cases, 'gpucopy' might be essential for achieving acceptable performance, while in others, it might be possible to achieve similar performance with or without 'gpucopy'. As a developer, it is essential to profile and optimize your application for the specific target hardware and use case.



It's important to note that the actual performance improvement depends on your specific hardware and use case. Enabling MFX_GPUCOPY_ON might not always result in better performance, so it's essential to test and profile your application with and without the flag to determine the best configuration for your needs.


 DeviceCopy is the equivalent of mfxInitParam::GPUCopy. When enabled, it uses the EUs (CM kernels) for accelerated copy between video memory and system memory. 


If this resolves your query, kindly accept this as a solution as it will help others with a similar query.


Thanks


0 Kudos
OTorg
New Contributor III
1,353 Views

Hi Rahila,

Thank you for the detailed answer.

 

I would like to clarify two more details.

 

1.

Your phrase "frames are copied from the system memory to the GPU memory using a 'gpucopy' operation".

Who performs that 'gpucopy' operation, CPU/driver or GPU/firmware?

 

2.

When I make LockSurface/Map call to hardware surface, I get some pointer to the frame data.

Is it a pointer exaclty to GPU memory, that is mapped into my application's address space (and accesses to it go to the PCIe bus, not to RAM)?

 

0 Kudos
StreaX
Beginner
880 Views
Can u tell me that how to actually disable gpu copy on intel arc gpu (A380)
0 Kudos
StreaX
Beginner
880 Views
because of this gpu copy it always gives me encoder overload in obs no matter what settings i do, on the other hand in nvidia gpu it doesn't use gpu copy while encoding
0 Kudos
Rahila_T_Intel
Moderator
1,320 Views

Hi,


Plesae find the responses to the questions: 


1.Intel VPL Runtime creates CM GPU Kernels for accelerated copy between video memory and system memory.


2.The pointer we get when locking or mapping a hardware surface, point to a region in the system memory that is mapped to the corresponding GPU memory. This allows us to access the frame data from the CPU while VPL runtime manages the necessary synchronization and data transfer between GPU and system memory.


If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. 


Thanks


0 Kudos
OTorg
New Contributor III
1,296 Views

Thank you, Rahila!

0 Kudos
Rahila_T_Intel
Moderator
1,291 Views

Hi,


Glad to know that your query is resolved. Thanks for accepting our solution. 

If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.



Thanks


0 Kudos
Reply