Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Jetsen_W_
Beginner
59 Views

Does VPP is inefficient in linux SDK?

My MSDK version is Linux_16.1.64.1.11164, when I use VPP in my project(yv12->nv12->vpp->264 encode), I found speed is not enough for me.

my project can be descripted briefly as follow:

yv12->nv12->vpp->h264 encode->mpegts

I made some experiences and get some result by using the sample:

1.for 264 encode

./sample_encode_drm h264 -f 25 -b 2048 -w 720 -h 576 -i /home/wdg/video/beijing420p.yuv -o /home/wdg/video/intel.new.h264 -hw -u quality 

./sample_encode_drm h264 -f 25 -b 2048 -w 720 -h 576 -i /home/wdg/video/beijing420p.yuv -o /home/wdg/video/intel.new.h264 -hw -u quality
libva info: VA-API version 0.34.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Intel(R) Media SDK Encoding Sample Version 0.0.000.0000
Input file format YUV420
Output video AVC
Source picture:
Resolution 720x576
Crop X,Y,W,H 0,0,720,576
Destination picture:
Resolution 720x576
Crop X,Y,W,H 0,0,720,576
Frame rate 25.00
Bit rate(Kbps) 2048
Target usage quality
Memory type system
Media SDK impl hw
Media SDK version 1.6
Processing started
Intel(R) Media SDK Encoding Sample Version 0.0.000.0000
Input file format YUV420
Output video AVC
Source picture:
Resolution 720x576
Crop X,Y,W,H 0,0,720,576
Destination picture:
Resolution 720x576
Crop X,Y,W,H 0,0,720,576
Frame rate 25.00
Bit rate(Kbps) 2048
Target usage quality
Memory type system
Media SDK impl hw
Media SDK version 1.6
Frame number: 3000, fps:382.55, spend:7.84s
Processing finished
real 0m7.853s
user 0m1.932s
sys 0m0.920s

the fps is about 382

2.for vpp

./sample_vpp_drm -lib hw -sw 720 -sh 576 -scc yv12 -dw 720 -dh 576 -denoise 32 -i /home/wdg/video/beijing420p.yuv -o /home/wdg/video/out.yuv
libva info: VA-API version 0.34.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Intel(R) Media SDK VPP Sample Version 0.0.000.0000
Input format YV12
Resolution 720x576
Crop X,Y,W,H 0,0,720,576
Frame rate 30.00
PicStruct progressive
Output format NV12
Resolution 720x576
Crop X,Y,W,H 0,0,720,576
Frame rate 30.00
PicStruct progressive
Video Enhancement Algorithms
Denoise ON
VideoAnalysis OFF
ProcAmp OFF
Detail OFF
ImgStab OFF
Memory type system
MediaSDK impl hw
MediaSDK ver 1.6
VPP started
Frame number: 3000
VPP finished
real 0m28.072s
user 0m20.269s
sys 0m3.336s

fps = 3000/28.072=106.86

3.for HD encode

./sample_encode_drm h264 -f 25 -b 4000 -w 1920 -h 1080 -i /home/wdg/video/hd1080p_1000.yuv -o /home/wdg/video/intel.new.hd.h264 -hw -u balanced
Input file format YUV420
Output video AVC
Source picture:
Resolution 1920x1088
Crop X,Y,W,H 0,0,1920,1080
Destination picture:
Resolution 1920x1088
Crop X,Y,W,H 0,0,1920,1080
Frame rate 25.00
Bit rate(Kbps) 4000
Target usage balanced
Memory type system
Media SDK impl hw
Media SDK version 1.6
Processing started
Intel(R) Media SDK Encoding Sample Version 0.0.000.0000
Input file format YUV420
Output video AVC
Source picture:
Resolution 1920x1088
Crop X,Y,W,H 0,0,1920,1080
Destination picture:
Resolution 1920x1088
Crop X,Y,W,H 0,0,1920,1080
Frame rate 25.00
Bit rate(Kbps) 4000
Target usage balanced
Memory type system
Media SDK impl hw
Media SDK version 1.6
Frame number: 1000, fps:95.37, spend:10.49s
Processing finished
real 0m10.568s
user 0m2.924s
sys 0m0.696s

the fps is about 95

4.for HD vpp

./sample_vpp_drm -lib hw -sw 1920 -sh 1080 -scc yv12 -dw 1920 -dh 1080 -denoise 32 -i /home/wdg/video/hd1080p_1000.yuv -o /home/wdg/video/out.yuv
Input format YV12
Resolution 1920x1088
Crop X,Y,W,H 0,0,1920,1080
Frame rate 30.00
PicStruct progressive
Output format NV12
Resolution 1920x1088
Crop X,Y,W,H 0,0,1920,1080
Frame rate 30.00
PicStruct progressive
Video Enhancement Algorithms
Denoise ON
VideoAnalysis OFF
ProcAmp OFF
Detail OFF
ImgStab OFF
Memory type system
MediaSDK impl hw
MediaSDK ver 1.6
VPP started
Frame number: 1000
VPP finished

real 0m38.521s
user 0m23.905s
sys 0m4.044s

fps = 1000/38.521 = 25.96

in all vpp, I just test denoise filter only, and the vpp speed is slower than encode.

In my opinion, h264 encoding is more complicated than vpp(just denoise), so the vpp shoud be faster than encode.

It seems that VPP in this version is inefficient, mybe soft implement?

Is there any mistake I made ?

0 Kudos
5 Replies
Jeffrey_M_Intel1
Employee
59 Views

In a completely synchronous setting, total time for your pipeline could be characterized as vpp+encode. However, Media SDK is asynchronous and many operations can happen simultaneously. In my tests (sorry, code isn't ready to distribute quite yet) an HD pipeline with denoise+encode executed in nearly the same time as encode alone.

Another thing to consider is I/O. With the encode sample, a compressed bitstream is written to disk. With the vpp sample the output is raw frames. It is a lot more data to move and the sample I/O is far from optimized. The time measured with the sample is mostly disk I/O, not denoise.

I understand that evaluating performance is a big part of figuring out if you're going to make the time to write the code. The performance story is clearer for transcode, since here all stages of the pipeline can work together as intended. We're hoping to address this example/documentation gap as soon as possible.

Jetsen_W_
Beginner
59 Views

yes, I forgot to consider the I/O performance.I'll try again

Thank you

Jetsen_W_
Beginner
59 Views

Jeffrey Mcallister (Intel) wrote:

In a completely synchronous setting, total time for your pipeline could be characterized as vpp+encode. However, Media SDK is asynchronous and many operations can happen simultaneously. In my tests (sorry, code isn't ready to distribute quite yet) an HD pipeline with denoise+encode executed in nearly the same time as encode alone.

Another thing to consider is I/O. With the encode sample, a compressed bitstream is written to disk. With the vpp sample the output is raw frames. It is a lot more data to move and the sample I/O is far from optimized. The time measured with the sample is mostly disk I/O, not denoise.

I understand that evaluating performance is a big part of figuring out if you're going to make the time to write the code. The performance story is clearer for transcode, since here all stages of the pipeline can work together as intended. We're hoping to address this example/documentation gap as soon as possible.

I used the sample_encoder to test again.The sample provides the vpp resize if dstw and dsth are not same with src, and result is as follow:

./sample_encode_drm h264 -f 25 -b 2048 -w 720 -h 576 -i /home/wdg/video/beijing420p.yuv -o /home/wdg/video/intel.new.h264 -hw -u quality
libva info: VA-API version 0.34.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Intel(R) Media SDK Encoding Sample Version 0.0.000.0000

Input file format       YUV420
Output video            AVC
Source picture:
        Resolution      720x576
        Crop X,Y,W,H    0,0,720,576
Destination picture:
        Resolution      720x576
        Crop X,Y,W,H    0,0,720,576
Frame rate      25.00
Bit rate(Kbps)  2048
Target usage    quality
Memory type     system
Media SDK impl          hw
Media SDK version       1.6

Processing started
Hello!
Intel(R) Media SDK Encoding Sample Version 0.0.000.0000

Input file format       YUV420
Output video            AVC
Source picture:
        Resolution      720x576
        Crop X,Y,W,H    0,0,720,576
Destination picture:
        Resolution      720x576
        Crop X,Y,W,H    0,0,720,576
Frame rate      25.00
Bit rate(Kbps)  2048
Target usage    quality
Memory type     system
Media SDK impl          hw
Media SDK version       1.6

Frame number: 3000, fps:425.65, spend:7.05s
Processing finished


./sample_encode_drm h264 -f 25 -b 2048 -w 720 -h 576 -i /home/wdg/video/beijing420p.yuv -o /home/wdg/video/intel.new.h264 -hw -u quality  -dstw 640 -dsth 480
libva info: VA-API version 0.34.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Intel(R) Media SDK Encoding Sample Version 0.0.000.0000

Input file format       YUV420
Output video            AVC
Source picture:
        Resolution      720x576
        Crop X,Y,W,H    0,0,720,576
Destination picture:
        Resolution      640x480
        Crop X,Y,W,H    0,0,640,480
Frame rate      25.00
Bit rate(Kbps)  2048
Target usage    quality
Memory type     system
Media SDK impl          hw
Media SDK version       1.6

Processing started
Hello!
Intel(R) Media SDK Encoding Sample Version 0.0.000.0000

Input file format       YUV420
Output video            AVC
Source picture:
        Resolution      720x576
        Crop X,Y,W,H    0,0,720,576
Destination picture:
        Resolution      640x480
        Crop X,Y,W,H    0,0,640,480
Frame rate      25.00
Bit rate(Kbps)  2048
Target usage    quality
Memory type     system
Media SDK impl          hw
Media SDK version       1.6

Frame number: 3000, fps:188.20, spend:15.94s
Processing finished

I hope that the result is almost same or faster(for resolution is lower than src),but it's not.

It seems that the method used in the sample is not good enough, Can you provide some detail for using vpp+encode?

Regards

Jeffrey_M_Intel1
Employee
59 Views

A simple thing to add which will improve performance is the -vaapi flag.

By default, the samples use system memory.  System memory is best for software sessions.  GPU memory (VAAPI surfaces for Linux) is best for hardware sessions.  Here there is an implicit copy to the CPU between VPP and encode without -vaapi, which adds significant overhead. 

The non-transcode samples all share the limitation of having I/O as the main bottleneck, which is why you might not see faster runtimes with resize as you might expect.  The multi-transcode sample is better, but also is provided more as a functional than a performance demo.  This is a recognized gap that we are working on.

Thanks!

  

Jetsen_W_
Beginner
59 Views

I used -vaapi flag, and it improved performance greatly.

Follow the sample, I used system memory in my project, and it's not a good idea for performance.I'll add video memory .

Thanks!

Reply