- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm using latest Intel Media SDK R6 2016 with all patches provided for 3.14.5 kernel on Ubuntu 12.04.4 and libva, libdrm from SDK and investigating performance of hardware transcoding.
For this purposes I'm using sample_multi_transcode binary sample and two type of sources:
- mpeg2 720x576 interlaced 25
- h264 1920x1080 progressive 25
And then I'm running multiple instance of this software in parallel and read and write files to and from tmpfs. I'm running this software on Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
For 1) encoding sessions I'm using the following command:
sample_multi_transcode -i::mpeg2 video.mpeg2 -o::h264 out.h264 -u 7 -b 2000
Output for single command:
Multi Transcoding Sample Version 6.0.16043361.361 libva info: VA-API version 0.99.0 libva info: va_getDriverName() returns 0 libva info: User requested driver 'iHD' libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so libva info: Found init function __vaDriverInit_0_32 libva info: va_openDriver() returns 0 Pipeline surfaces number: 14 MFX HARDWARE Session 0 API ver 1.17 parameters: Input video: MPG2 Output video: AVC Session 0 was NOT joined with other sessions Transcoding started Transcoding finished Common transcoding time is 14.64 sec MFX session 0 transcoding PASSED: Processing time: 14.64 sec Number of processed frames: 7650
For 2) this one:
sample_multi_transcode -i::h264 video.h264 -o::h264 out.h264 -b 6000 -u 7
Output for single command:
Multi Transcoding Sample Version 6.0.16043361.361 libva info: VA-API version 0.99.0 libva info: va_getDriverName() returns 0 libva info: User requested driver 'iHD' libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so libva info: Found init function __vaDriverInit_0_32 libva info: va_openDriver() returns 0 Pipeline surfaces number: 18 MFX HARDWARE Session 0 API ver 1.17 parameters: Input video: AVC Output video: AVC Session 0 was NOT joined with other sessions Transcoding started Transcoding finished Common transcoding time is 39.45 sec MFX session 0 transcoding PASSED: Processing time: 39.45 sec Number of processed frames: 7663
I read in docs for this CPU with integrated Graphics and it should can produce at least 13-16 HD sources in parallel. And here is what I have:
This hardware can transcode 10 sources 2) type with required (to achieve realtime encoding) speed
And I get 28 parallel streams for 1) type.
And that's the problem - I can't figure out why - the frame square of 2) source type 5 times larger than 1) type. So basically, it means that we should transcode if not 5x streams. Am I right? Why it can transcode the same number of streams? Where is bottleneck? How to increase performance? Why so small amout of sources of 1) and 2) type? Why such ratio? It's even without any VPP procedures.
Also what I see that my CPU spent a lot of time with wa while transcoding sessions are running (vmstat 1):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 20 0 25155644 96916 6701844 0 0 0 51 59 56 23 8 21 48 2 26 0 25142528 96916 6711040 0 0 0 0 8001 26713 5 2 25 68 0 27 0 25132948 96916 6719908 0 0 0 0 8587 27786 5 2 43 51 0 28 0 25124520 96916 6727928 0 0 0 0 8380 29270 5 2 28 65 1 26 0 25117160 96916 6735544 0 0 0 0 8566 28418 4 2 25 69 0 26 0 25107272 96916 6743884 0 0 0 0 8222 26939 3 2 32 64 0 25 0 25039424 96916 6810820 0 0 0 0 8051 25865 4 2 29 64
So it seems there we have I/O bottleneck. Maybe with copy from/to system/video memory? But how can I fix it? How can I debug it?
I've also tried to set async parameter with different values. And I see no difference to set async parameter: more on that - my results shows me that value 1 give more performance - I can transcode streams faster: 309.55s per stream for 28 parallel transcoding streams 1) type and async 10 vs 303.78s with async 1
But in docs there is information about >= 5 gives us higher throughput. What does it mean?
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi There,
For better understand your system, can you send output of following command line ?
$ uname -r
$ vainfo
$ ls -l /dev/dri
$ lscpci -nn | grep -i vga
$ cat /proc/cpuinfo | grep -i intel
Thanks,
Zachary
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
$ uname -r 3.14.5-qsv-r6-2016 // vanilla 3.14.5 kernel with ubuntu config and intel patchset // vainfo requires library to run: libva-x11-1_1.0.15-4 and it depends to libxfixes3_5.0-4ubuntu4.4 -- so I've installed them from ubuntu 12.04.4 repository $ vainfo error: can't connect to X server! libva info: VA-API version 0.99.0 libva info: va_getDriverName() returns 0 libva info: User requested driver 'iHD' libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so libva info: Found init function __vaDriverInit_0_32 libva info: va_openDriver() returns 0 vainfo: VA-API version: 0.99 (libva 1.67.0.pre1) vainfo: Driver version: 16.4.4.47109-ubit vainfo: Supported profile and entrypoints VAProfileH264Baseline : VAEntrypointEncSlice VAProfileH264Baseline : <unknown entrypoint> VAProfileH264Baseline : <unknown entrypoint> VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264ConstrainedBaseline: <unknown entrypoint> VAProfileH264ConstrainedBaseline: <unknown entrypoint> VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264Main : <unknown entrypoint> VAProfileH264Main : <unknown entrypoint> VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264High : <unknown entrypoint> VAProfileH264High : <unknown entrypoint> VAProfileMPEG2Simple : VAEntrypointEncSlice VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointEncSlice VAProfileMPEG2Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Simple : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileVP8Version0_3 : VAEntrypointEncSlice VAProfileVP8Version0_3 : VAEntrypointVLD VAProfileVP8Version0_3 : <unknown entrypoint> VAProfileHEVCMain : VAEntrypointEncSlice VAProfileVP9Profile0 : VAEntrypointEncSlice VAProfileVP9Profile0 : VAEntrypointVLD VAProfileVP9Profile0 : <unknown entrypoint> <unknown profile> : VAEntrypointVideoProc VAProfileNone : VAEntrypointVideoProc VAProfileNone : <unknown entrypoint> $ ls -l /dev/dri total 0 crw-rw---- 1 root video 226, 0 Mar 9 13:39 card0 crw-rw---- 1 root video 226, 64 Mar 9 13:39 controlD64 crw-rw---- 1 root video 226, 128 Mar 9 13:39 renderD128 $ lspci -nn | grep -i vga 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:162a] (rev 0a) $ cat /proc/cpuinfo | grep -i intel vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yesterday the forum was down so I can't post an answer rapidly.
Anything else?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vasily. let me explain "async". async N gives you the option to send N commands to hardware independently, not waiting a result of a previous command. hardware put commands in queue and resolves data dependencies. This makes sense when hardware encode/decode units under loaded. for example, when you execute just one transcoding then utilization will be low, and when you look at vtune timeline picture you will see very sparse occupation of Render and Video units. So hardware is almost idle because lacking data. async mode allows you feed hardware with more data. But number 10 is too high. pipelining achieves saturation with 3 to 5 depending on SKU. And all this make sense if you want achieve highest throughput for a single stream transcode with maximized hardware utilization. Can you give a link to doc which says " >= 5 gives us higher throughput" This is not absolutely true and should be fixed, but maybe you got it wrong?
In your case you execute multiple parallel transcoding. each transcoding submits hardware tasks in similar way as you do with async. only the difference they come from different processes but in the end of the days come down to same hardware command queue. so it will be saturated with no need async for each transcoding session.
Thanks, Alexey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
and just few hints to consider in your benchmark.
- output bitrate matters. 2Mbps for SD resolution is not too high?
- progressive and interlaced encode has different speed. Is your output for SD case interlaced?
- for quick check output files you may try Video Pro Analyzer
- you may want to try metrics_monitor from the same MSS package to monitor gfx units utilization in realtime.
Anyway do not expect simple scale for throughput 5 times if picture size 5 times smaller. Hardware throughput efficiency increasing with larger resolutions.
Regards,
Alexey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Alexey!
Thank you for detailed explanation about how async parameter works. Maybe I misunderstood it, I read documentation (readme-multi-transcode.pdf, provided with sample) again and it clearly says:
1. To achieve maximum throughput use –async >= 5 and the –join option when running several transcoding pipelines.
So, if hardware has one queue for all commands then it seems that there is no difference for using this option because all running sessions submit all their commands in parallel and hardware already full of work. Ok.
About benchmark: no 2Mbps is ok for SD stream to achieve desired quality level. I'll try to reduce bitrate and check again but I'm not sure that I got observable performance boost. What correlation between bitrate and workload? Higher bitrate requires more time to encode? Why?
As you can see in my tests I have with HD progressive -> progressive encode and with SD interlaced -> interlaced. There is no deinterlacing and no scaling here. So output has same characteristics as input. But in case of SD its mpeg2 -> h264 and with HD I've tried h264 -> h264. Is there any difference here? Why?
I've tried metrics_monitor while many parallel session encode but it just shows me that:
RENDER usage: 100.00, VIDEO usage: 100.00, VIDEO_E usage: 0.00 VIDEO2 usage: 100.00 RENDER usage: 100.00, VIDEO usage: 100.00, VIDEO_E usage: 0.00 VIDEO2 usage: 100.00 RENDER usage: 100.00, VIDEO usage: 100.00, VIDEO_E usage: 0.00 VIDEO2 usage: 100.00
And so on.
This hardware should run more parallel streams as I can read in documentation for this CPU
And what about CPU wa values? Is it ok?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've benchmark a little more and get more results and they are look confusing. I've provide different samples for input and output with different codecs and also specify different output codecs. And results really dramatically different!
I've run 30 parallel streams with "balanced" compression settings (-u 4) and bitrate 2 Mbit/s with same input and output frame squares: 720x576. So I provide command line and encoding time for one session and calculate average fps:
1) for i in `seq 0 30`; do ./sample_multi_transcode -i::h264 video_sd_es.h264 -o::mpeg2 tmp/out$i.mpeg2 -b 2000 -u 4 & done Processing time: 50.17 sec Number of processed frames: 3133 AVERAGE FPS 62.5
2) for i in `seq 0 30`; do ./sample_multi_transcode -i::mpeg2 video.mpeg2 -o::mpeg2 tmp/out$i.mpeg2 -b 2000 -u 4 & done Processing time: 176.04 sec Number of processed frames: 7650 AVERAGE FPS 43.5
3) for i in `seq 0 30`; do ./sample_multi_transcode -i::h264 video_sd_es.h264 -o::h264 tmp/out$i.h264 -b 2000 -u 4 & done Processing time: 141.53 sec Number of processed frames: 3133 AVERAGE FPS: 22.1 4) for i in `seq 0 30`; do ./sample_multi_transcode -i::mpeg2 video.mpeg2 -o::h264 tmp/out$i.h264 -b 2000 -u 4 & done Processing time: 595.19 sec Number of processed frames: 7650 AVERAGE FPS 12.9 So what is going on here?? Why results are so different?
Here mediainfo on this files:
video.mpeg2
Complete name : video.mpeg2 Format : MPEG Video Format version : Version 2 File size : 214 MiB Overall bit rate mode : Variable Video Format : MPEG Video Format version : Version 2 Format profile : Main@Main Format settings, BVOP : Yes Format settings, Matrix : Custom Format settings, GOP : M=3, N=33 Bit rate mode : Variable Maximum bit rate : 15.0 Mbps Width : 720 pixels Height : 576 pixels Display aspect ratio : 16:9 Frame rate : 25.000 fps Standard : PAL Color space : YUV Chroma subsampling : 4:2:0 Bit depth : 8 bits Scan type : Interlaced Scan order : Top Field First Compression mode : Lossy
video_sd_es.h264
Complete name : video_sd_es.h264 Format : AVC Format/Info : Advanced Video Codec File size : 18.8 MiB Video Format : AVC Format/Info : Advanced Video Codec Format profile : Main@L3.0 Format settings, CABAC : Yes Format settings, ReFrames : 3 frames Format settings, GOP : M=4, N=24 Width : 720 pixels Height : 576 pixels Display aspect ratio : 4:3 Frame rate : 25.000 fps Standard : Component Color space : YUV Chroma subsampling : 4:2:0 Bit depth : 8 bits Scan type : Interlaced Scan order : Top Field First Color primaries : BT.470-6 System B, BT.470-6 System G, BT.601-6 625, BT.1358 625, BT.1700 625 PAL, BT.1700 625 SECAM Transfer characteristics : BT.470-6 System B, BT.470-6 System G Matrix coefficients : BT.470-6 System B, BT.470-6 System G, BT.601-6 625, BT.1358 625, BT.1700 625 PAL, BT.1700 625 SECAM, IEC 61966-2-4 601
So as you can see output to mpeg2 faster. And also really strange result is h264 to mpeg2: fps is enormous large.
So it seems that we can sort operations by speed: decoding h264 -> decoding mpeg2 -> encoding mpeg2 -> encoding h264.
Is it true? Is it ok? Why do I got this results?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page