Strange performance issue

VASILY_V_ · ‎03-10-2016

Hi,
I'm using latest Intel Media SDK R6 2016 with all patches provided for 3.14.5 kernel on Ubuntu 12.04.4 and libva, libdrm from SDK and investigating performance of hardware transcoding.
For this purposes I'm using sample_multi_transcode binary sample and two type of sources:

mpeg2 720x576 interlaced 25
h264 1920x1080 progressive 25

And then I'm running multiple instance of this software in parallel and read and write files to and from tmpfs. I'm running this software on Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz

For 1) encoding sessions I'm using the following command:

sample_multi_transcode -i::mpeg2 video.mpeg2 -o::h264 out.h264 -u 7 -b 2000

Output for single command:

Multi Transcoding Sample Version 6.0.16043361.361

libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Pipeline surfaces number: 14
MFX HARDWARE Session 0 API ver 1.17 parameters: 
Input  video: MPG2
Output video: AVC 

Session 0 was NOT joined with other sessions

Transcoding started

Transcoding finished

Common transcoding time is  14.64 sec 
MFX session 0 transcoding PASSED:
Processing time: 14.64 sec 
Number of processed frames: 7650

For 2) this one:

sample_multi_transcode -i::h264 video.h264 -o::h264 out.h264 -b 6000 -u 7

Output for single command:

Multi Transcoding Sample Version 6.0.16043361.361

libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Pipeline surfaces number: 18
MFX HARDWARE Session 0 API ver 1.17 parameters: 
Input  video: AVC 
Output video: AVC 

Session 0 was NOT joined with other sessions

Transcoding started

Transcoding finished

Common transcoding time is  39.45 sec 
MFX session 0 transcoding PASSED:
Processing time: 39.45 sec 
Number of processed frames: 7663

I read in docs for this CPU with integrated Graphics and it should can produce at least 13-16 HD sources in parallel. And here is what I have:

This hardware can transcode 10 sources 2) type with required (to achieve realtime encoding) speed

And I get 28 parallel streams for 1) type.

And that's the problem - I can't figure out why - the frame square of 2) source type 5 times larger than 1) type. So basically, it means that we should transcode if not 5x streams. Am I right? Why it can transcode the same number of streams? Where is bottleneck? How to increase performance? Why so small amout of sources of 1) and 2) type? Why such ratio? It's even without any VPP procedures.

Also what I see that my CPU spent a lot of time with wa while transcoding sessions are running (vmstat 1):

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2 20      0 25155644  96916 6701844    0    0     0    51   59   56 23  8 21 48
 2 26      0 25142528  96916 6711040    0    0     0     0 8001 26713  5  2 25 68
 0 27      0 25132948  96916 6719908    0    0     0     0 8587 27786  5  2 43 51
 0 28      0 25124520  96916 6727928    0    0     0     0 8380 29270  5  2 28 65
 1 26      0 25117160  96916 6735544    0    0     0     0 8566 28418  4  2 25 69
 0 26      0 25107272  96916 6743884    0    0     0     0 8222 26939  3  2 32 64
 0 25      0 25039424  96916 6810820    0    0     0     0 8051 25865  4  2 29 64

So it seems there we have I/O bottleneck. Maybe with copy from/to system/video memory? But how can I fix it? How can I debug it?

I've also tried to set async parameter with different values. And I see no difference to set async parameter: more on that - my results shows me that value 1 give more performance - I can transcode streams faster: 309.55s per stream for 28 parallel transcoding streams 1) type and async 10 vs 303.78s with async 1

But in docs there is information about >= 5 gives us higher throughput. What does it mean?

Thanks

Jiandong_Z_Intel · ‎03-11-2016

Hi There,

For better understand your system, can you send output of following command line ?

$ uname -r

$ vainfo

$ ls -l /dev/dri

$ lscpci -nn | grep -i vga

$ cat /proc/cpuinfo | grep -i intel

Thanks,

Zachary

VASILY_V_ · ‎03-12-2016

$ uname -r
3.14.5-qsv-r6-2016 // vanilla 3.14.5 kernel with ubuntu config and intel patchset 

// vainfo requires library to run: libva-x11-1_1.0.15-4 and it depends to libxfixes3_5.0-4ubuntu4.4 -- so I've installed them from ubuntu 12.04.4 repository
$ vainfo
error: can't connect to X server!
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.99 (libva 1.67.0.pre1)
vainfo: Driver version: 16.4.4.47109-ubit
vainfo: Supported profile and entrypoints
      VAProfileH264Baseline           :	VAEntrypointEncSlice
      VAProfileH264Baseline           :	<unknown entrypoint>
      VAProfileH264Baseline           :	<unknown entrypoint>
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline:	<unknown entrypoint>
      VAProfileH264ConstrainedBaseline:	<unknown entrypoint>
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointEncSlice
      VAProfileH264Main               :	<unknown entrypoint>
      VAProfileH264Main               :	<unknown entrypoint>
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointEncSlice
      VAProfileH264High               :	<unknown entrypoint>
      VAProfileH264High               :	<unknown entrypoint>
      VAProfileMPEG2Simple            :	VAEntrypointEncSlice
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointEncSlice
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileVC1Advanced            :	VAEntrypointVLD
      VAProfileVC1Main                :	VAEntrypointVLD
      VAProfileVC1Simple              :	VAEntrypointVLD
      VAProfileJPEGBaseline           :	VAEntrypointVLD
      VAProfileJPEGBaseline           :	VAEntrypointEncPicture
      VAProfileVP8Version0_3          :	VAEntrypointEncSlice
      VAProfileVP8Version0_3          :	VAEntrypointVLD
      VAProfileVP8Version0_3          :	<unknown entrypoint>
      VAProfileHEVCMain               :	VAEntrypointEncSlice
      VAProfileVP9Profile0            :	VAEntrypointEncSlice
      VAProfileVP9Profile0            :	VAEntrypointVLD
      VAProfileVP9Profile0            :	<unknown entrypoint>
      <unknown profile>               :	VAEntrypointVideoProc
      VAProfileNone                   :	VAEntrypointVideoProc
      VAProfileNone                   :	<unknown entrypoint>


$ ls -l /dev/dri
total 0
crw-rw---- 1 root video 226,   0 Mar  9 13:39 card0
crw-rw---- 1 root video 226,  64 Mar  9 13:39 controlD64
crw-rw---- 1 root video 226, 128 Mar  9 13:39 renderD128


$ lspci -nn | grep -i vga
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:162a] (rev 0a)


$ cat /proc/cpuinfo | grep -i intel
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz

VASILY_V_ · ‎03-12-2016

Yesterday the forum was down so I can't post an answer rapidly.

Anything else?

Alexey_F_Intel · ‎03-12-2016

Hi Vasily. let me explain "async". async N gives you the option to send N commands to hardware independently, not waiting a result of a previous command. hardware put commands in queue and resolves data dependencies. This makes sense when hardware encode/decode units under loaded. for example, when you execute just one transcoding then utilization will be low, and when you look at vtune timeline picture you will see very sparse occupation of Render and Video units. So hardware is almost idle because lacking data. async mode allows you feed hardware with more data. But number 10 is too high. pipelining achieves saturation with 3 to 5 depending on SKU. And all this make sense if you want achieve highest throughput for a single stream transcode with maximized hardware utilization. Can you give a link to doc which says " >= 5 gives us higher throughput" This is not absolutely true and should be fixed, but maybe you got it wrong?

In your case you execute multiple parallel transcoding. each transcoding submits hardware tasks in similar way as you do with async. only the difference they come from different processes but in the end of the days come down to same hardware command queue. so it will be saturated with no need async for each transcoding session.

Thanks, Alexey

Alexey_F_Intel · ‎03-12-2016

and just few hints to consider in your benchmark.

output bitrate matters. 2Mbps for SD resolution is not too high?
progressive and interlaced encode has different speed. Is your output for SD case interlaced?
- for quick check output files you may try Video Pro Analyzer
you may want to try metrics_monitor from the same MSS package to monitor gfx units utilization in realtime.

Anyway do not expect simple scale for throughput 5 times if picture size 5 times smaller. Hardware throughput efficiency increasing with larger resolutions.

Regards,

Alexey

VASILY_V_ · ‎03-12-2016

Hi, Alexey!

Thank you for detailed explanation about how async parameter works. Maybe I misunderstood it, I read documentation (readme-multi-transcode.pdf, provided with sample) again and it clearly says:

1. To achieve maximum throughput use –async >= 5 and the –join option when running several transcoding pipelines.

So, if hardware has one queue for all commands then it seems that there is no difference for using this option because all running sessions submit all their commands in parallel and hardware already full of work. Ok.

About benchmark: no 2Mbps is ok for SD stream to achieve desired quality level. I'll try to reduce bitrate and check again but I'm not sure that I got observable performance boost. What correlation between bitrate and workload? Higher bitrate requires more time to encode? Why?

As you can see in my tests I have with HD progressive -> progressive encode and with SD interlaced -> interlaced. There is no deinterlacing and no scaling here. So output has same characteristics as input. But in case of SD its mpeg2 -> h264 and with HD I've tried h264 -> h264. Is there any difference here? Why?

I've tried metrics_monitor while many parallel session encode but it just shows me that:

RENDER usage: 100.00,	VIDEO usage: 100.00,	VIDEO_E usage: 0.00	VIDEO2 usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 100.00,	VIDEO_E usage: 0.00	VIDEO2 usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 100.00,	VIDEO_E usage: 0.00	VIDEO2 usage: 100.00

And so on.

This hardware should run more parallel streams as I can read in documentation for this CPU

And what about CPU wa values? Is it ok?

VASILY_V_ · ‎03-13-2016

I've benchmark a little more and get more results and they are look confusing. I've provide different samples for input and output with different codecs and also specify different output codecs. And results really dramatically different!

I've run 30 parallel streams with "balanced" compression settings (-u 4) and bitrate 2 Mbit/s with same input and output frame squares: 720x576. So I provide command line and encoding time for one session and calculate average fps:

1) for i in `seq 0 30`; do ./sample_multi_transcode -i::h264 video_sd_es.h264 -o::mpeg2 tmp/out$i.mpeg2 -b 2000 -u 4 & done

Processing time: 50.17 sec
Number of processed frames: 3133

AVERAGE FPS 62.5

2) for i in `seq 0 30`; do ./sample_multi_transcode -i::mpeg2 video.mpeg2 -o::mpeg2 tmp/out$i.mpeg2 -b 2000 -u 4 & done

Processing time: 176.04 sec 
Number of processed frames: 7650

AVERAGE FPS 43.5

3) for i in `seq 0 30`; do ./sample_multi_transcode -i::h264 video_sd_es.h264 -o::h264 tmp/out$i.h264 -b 2000 -u 4 & done

Processing time: 141.53 sec 
Number of processed frames: 3133

AVERAGE FPS: 22.1


4) for i in `seq 0 30`; do ./sample_multi_transcode -i::mpeg2 video.mpeg2 -o::h264 tmp/out$i.h264 -b 2000 -u 4 & done

Processing time: 595.19 sec 
Number of processed frames: 7650

AVERAGE FPS 12.9


So what is going on here?? Why results are so different?

Here mediainfo on this files:

video.mpeg2

Complete name                            : video.mpeg2
Format                                   : MPEG Video
Format version                           : Version 2
File size                                : 214 MiB
Overall bit rate mode                    : Variable

Video
Format                                   : MPEG Video
Format version                           : Version 2
Format profile                           : Main@Main
Format settings, BVOP                    : Yes
Format settings, Matrix                  : Custom
Format settings, GOP                     : M=3, N=33
Bit rate mode                            : Variable
Maximum bit rate                         : 15.0 Mbps
Width                                    : 720 pixels
Height                                   : 576 pixels
Display aspect ratio                     : 16:9
Frame rate                               : 25.000 fps
Standard                                 : PAL
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Interlaced
Scan order                               : Top Field First
Compression mode                         : Lossy

video_sd_es.h264

Complete name                            : video_sd_es.h264
Format                                   : AVC
Format/Info                              : Advanced Video Codec
File size                                : 18.8 MiB

Video
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : Main@L3.0
Format settings, CABAC                   : Yes
Format settings, ReFrames                : 3 frames
Format settings, GOP                     : M=4, N=24
Width                                    : 720 pixels
Height                                   : 576 pixels
Display aspect ratio                     : 4:3
Frame rate                               : 25.000 fps
Standard                                 : Component
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Interlaced
Scan order                               : Top Field First
Color primaries                          : BT.470-6 System B, BT.470-6 System G, BT.601-6 625, BT.1358 625, BT.1700 625 PAL, BT.1700 625 SECAM
Transfer characteristics                 : BT.470-6 System B, BT.470-6 System G
Matrix coefficients                      : BT.470-6 System B, BT.470-6 System G, BT.601-6 625, BT.1358 625, BT.1700 625 PAL, BT.1700 625 SECAM, IEC 61966-2-4 601

So as you can see output to mpeg2 faster. And also really strange result is h264 to mpeg2: fps is enormous large.

So it seems that we can sort operations by speed: decoding h264 -> decoding mpeg2 -> encoding mpeg2 -> encoding h264.

Is it true? Is it ok? Why do I got this results?