Hello,

Peyo_H_ · ‎12-28-2016

Hello.

hardware is Intel CPU E3-1285L v4 (Intel® Iris™ Pro Graphics P6300). OS is Centos 7.1 for SDK 2016 and 7.2 for SDK 2017.

I'm using it to encode in parallel 18 SD videos to h264.

With SDK 2016, the GPU load is around 70%.

After installing SDK 2017, the GPU load jumps up to 90-95%. That is 20% performance drop. The hardware and the used software are the same, just the SDK version is different. Have someone else with such issues?

Mark_B_Intel1 · ‎12-28-2016

Image attached / below is some perf data from our internal regression tool on 1285 V4's (0x162a), 3.0 / 1.1 GHz parts. The S-curve here (sorted ratio of performance) is between MSS 2017 (CentOS 7.2 - numerator) and 2016 (CentOS7.1 - denominator) running with frequency and power defaults. All workloads below are N:N transcode model. Within measurement noise most workloads are faster on 2017. SD should be the same performance. We haven't seen any cases of a 20% reduction in performance.

So a few questions ...

1) you mention utilization; was performance impacted?

2) Did you collect GPU utilization stats using metrics monitor (which improved absolute results considerably between the two releases) or vTune?

3) can you say more about your workload, e.g. is it progressive to progressive; AVC to AVC, graphcs to graphics memory; any use of VPP

Peyo_H_ · ‎12-29-2016

Hello,

Every input source (Currently 6 sources) is encoded four times to H.264 High profile with "balanced" target usage in different resolutions: 712x576,640x512,480x384,360x288.

In this scenario I have total of 24 encodings in parallel.

> 1) you mention utilization; was performance impacted?

Yes. If I start to encode another source, then the Render usage hits 100% and everything just stops to encode properly.

> 2) Did you collect GPU utilization stats using metrics monitor (which improved absolute results considerably between the two releases) or vTune?

I measure GPU usage with two tools. "intel_gpu_top" and "metrics_monitor" from the SDK. Here are the results with the same hardware and user software:

SDK 2016 intel_gpu_top:

render busy: 69%: █████████████▉ render space: 1548/131072

task percent busy
CS: 68%: █████████████▋ vert fetch: 0 (0/sec)
GAM: 68%: █████████████▋ prim fetch: 0 (0/sec)
TSG: 66%: █████████████▎ VS invocations: 0 (0/sec)
VFE: 35%: ███████ GS invocations: 0 (0/sec)
TDG: 0%: GS prims: 0 (0/sec)
RS: 0%: CL invocations: 0 (0/sec)
VF: 0%: CL prims: 0 (0/sec)
SVG: 0%: PS invocations: 0 (0/sec)
GAFM: 0%: PS depth pass: 0 (0/sec)
SOL: 0%:
CL: 0%:
VS: 0%:
SF: 0%:
GAFS: 0%:
DS: 0%:
HS: 0%:

SDK 2016 metrics_monitor:

RENDER usage: 70.00, VIDEO usage: 70.00, VIDEO_E usage: 0.00 VIDEO2 usage: 65.00
RENDER usage: 72.00, VIDEO usage: 69.00, VIDEO_E usage: 0.00 VIDEO2 usage: 64.00
RENDER usage: 70.00, VIDEO usage: 64.00, VIDEO_E usage: 0.00 VIDEO2 usage: 63.00
RENDER usage: 71.00, VIDEO usage: 62.00, VIDEO_E usage: 0.00 VIDEO2 usage: 63.00
RENDER usage: 68.00, VIDEO usage: 65.00, VIDEO_E usage: 0.00 VIDEO2 usage: 61.00

SDK 2017 intel_gpu_top

render busy: 80%: ████████████████ render space: 56/16384

task percent busy
CS: 80%: ████████████████ vert fetch: 0 (0/sec)
TSG: 75%: ███████████████ prim fetch: 0 (0/sec)
GAM: 49%: █████████▉ VS invocations: 0 (0/sec)
VFE: 39%: ███████▉ GS invocations: 0 (0/sec)
TDG: 0%: GS prims: 0 (0/sec)
RS: 0%: CL invocations: 0 (0/sec)
VF: 0%: CL prims: 0 (0/sec)
SVG: 0%: PS invocations: 0 (0/sec)
SF: 0%: PS depth pass: 0 (0/sec)
GAFS: 0%:
GAFM: 0%:
SOL: 0%:
DS: 0%:
VS: 0%:
GS: 0%:

SDK 2017 metrics_monitor:

RENDER usage: 84.00, VIDEO usage: 7.00, VIDEO_E usage: 0.00 VIDEO2 usage: 1.00 GT Freq: 1150.00
RENDER usage: 83.00, VIDEO usage: 9.00, VIDEO_E usage: 0.00 VIDEO2 usage: 3.00 GT Freq: 1150.00
RENDER usage: 86.00, VIDEO usage: 4.00, VIDEO_E usage: 0.00 VIDEO2 usage: 4.00 GT Freq: 1150.00
RENDER usage: 85.00, VIDEO usage: 6.00, VIDEO_E usage: 0.00 VIDEO2 usage: 1.00 GT Freq: 1150.00
RENDER usage: 84.00, VIDEO usage: 4.00, VIDEO_E usage: 0.00 VIDEO2 usage: 3.00 GT Freq: 1150.00

> 3) can you say more about your workload, e.g. is it progressive to progressive; AVC to AVC, graphcs to graphics memory; any use of VPP

Input is mpeg2 (as shown bellow). I've tried with FFMPEG ver 2.8, 3.05, 3.1.6 and 3.2.2, but the results are the same. In their code I cannot see any Video pre-processing (VPP) usage.

Input video as diagnosed by Mediainfo tool is:

Video
ID                                       : 256 (0x100)
Menu ID                                  : 1 (0x1)
Format                                   : MPEG Video
Format version                           : Version 2
Format profile                           : Main@Main
Format settings, BVOP                    : Yes
Format settings, Matrix                  : Custom
Format settings, GOP                     : M=2, N=12
Format settings, picture structure       : Frame
Codec ID                                 : 2
Duration                                 : 10 s 200 ms
Bit rate mode                            : Constant
Bit rate                                 : 5 623 kb/s
Maximum bit rate                         : 5 467 kb/s
Width                                    : 720 pixels
Height                                   : 576 pixels
Display aspect ratio                     : 16:9
Frame rate                               : 25.000 FPS
Standard                                 : PAL
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Interlaced
Scan order                               : Top Field First
Compression mode                         : Lossy
Bits/(Pixel*Frame)                       : 0.542
Time code of first frame                 : 09:49:11:03
Time code source                         : Group of pictures header
GOP, Open/Closed                         : Open
Stream size                              : 6.84 MiB (92%)

BRS/

DMITRY_R_Intel4 · ‎12-29-2016

Hello,

Which application do you use? That's your internal application or you collect the data with sample_multi_transcode? If that was your application, have you tried to replicate with sample_multi_transcode? any difference?

At least one of the sources you are using is interlaced. What is output: interlaced or progressive? In case you have progressive streams on the output: how many VPP deinterlace components are there in the pipeline? Considering your scenario it should be reasonable to:

Have single VPP deinterlace component right after decoder
Have a split of the data flow after the VPP deinterlace component which will feed 6 VPP scale components

Also, do you have SW memory for any reason between components somewhere in the pipeline?

Dmitry

DMITRY_R_Intel4 · ‎12-29-2016

I would also like to give a comment on the GPU engine usage data between releases:

RENDER usage: 71.00,    VIDEO usage: 62.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 63.00  # for 2016
RENDER usage: 84.00,    VIDEO usage: 7.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 1.00      GT Freq: 1150.00  # for 2017

There are 2 notable things:

Increase in GPGPU (Render) usage from ~70% to 85-90%
Decrease in VDBOX 1 and 2 usages (Video and Video2) from ~65% to ~5-10%

The reason of the VDBOX usage drop is completely different GPU tasks scheduling scheme introduced in MSS 2017 which is capable to manage inter dependencies (kernel mode driver level change). So, the reason of ~65% VDBOX 1 and 2 utilization in MSS 2016 was the fact that VDBOX-es were stalled waiting for the dependencies (executed on GPGPU) to be resolved. In MSS 2017 this was reworked and now you see low VDBOX engines utilization meaning that engines are capable to execute something else.

Being said that I have one more question: when you compared MSS 2016 and MSS 2017, have you configured pipelines to produce data with the fixed output rates or they were permitted to transcode as fast as possible? If the latter is true, can you, please, provide elapsed times and CPU% data comparing MSS 2016 and MSS 2017?

Dmitry.

Peyo_H_ · ‎12-30-2016

Hi,

SDK 2016 "top" output:

top - 10:35:37 up 184 days, 20:52,  1 user,  load average: 4.33, 3.90, 3.97
Tasks: 185 total,   2 running, 183 sleeping,   0 stopped,   0 zombie
%Cpu(s): 33.9 us,  4.2 sy,  0.0 ni, 61.6 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16351768 total, 11743684 free,  1470532 used,  3137552 buff/cache
KiB Swap:  1564668 total,  1564668 free,        0 used. 12257720 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                           
20203 root      20   0 1420144 251660  42168 R  57.7  1.5   2363:14 test                                                                                                                                        
  327 root      20   0 1418096 248568  42472 S  54.7  1.5   8203:06 test                                                                                                                                        
 1084 root      20   0 1418288 257196  43956 S  51.3  1.6   6457:13 test                                                                                                                                        
  329 root      20   0 1417328 249608  42812 S  51.0  1.5   6923:24 test                                                                                                                                        
19963 root      20   0 1286924 206528  33872 S  45.3  1.3   1937:12 test                                                                                                                                        
 1372 root      20   0 1284108 201040  34608 S  43.0  1.2   5008:25 test

SDK 2017 "top" output:

top - 10:37:53 up 6 days, 17:14,  6 users,  load average: 3.31, 3.55, 3.48
Tasks: 220 total,   4 running, 216 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31.1 us,  3.2 sy,  0.0 ni, 65.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 16185324 total,  3822944 free,  1470464 used, 10891916 buff/cache
KiB Swap:  3129340 total,  3129340 free,        0 used. 12694764 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                           
26273 root      20   0 2090420 149288  21432 R  47.5  0.9  16:57.85 test                                                                                                                                        
26009 root      20   0 2089788 148748  21440 R  45.8  0.9  16:59.13 test                                                                                                                                        
25899 root      20   0 2089788 150452  21400 S  45.5  0.9  16:28.78 test                                                                                                                                        
25890 root      20   0 1892292 219528  21552 S  39.2  1.4  13:37.96 test                                                                                                                                        
25896 root      20   0 1712796 132356  18052 R  38.2  0.8  13:25.99 test                                                                                                                                        
25885 root      20   0 1847880 217068  21360 S  34.6  1.3  12:04.93 test

The output is interlaced as the output. The speed is limited per resolution.

As I've mentioned earlier, I'm using ffmpeg for the tests. Here is the example ffmpeg pipeline:

ffmpeg -loglevel info -re   -i 'udp://@239.238.1.0:7000?localaddr=172.18.0.9&fifo_size=100000&timeout=10&overrun_nonfatal=1' 
 -filter_complex [0:v]setdar=ratio=16/9:max=1000,split=3[out1][out2][out3]  
-map [out1]  -vcodec h264_qsv -profile:v  high -preset medium -s 640x512 
-b:v 1700k -minrate 1500k -maxrate 1900k  -bufsize:v 2.8M -pix_fmt nv12 -g 25  
-flags +cgop+ilme -map 0:a:0 -c:a:0 mp2  -b:a:0 192000   -map 0:a:0 -c:a:1 aac  -b:a:1 192000   
-flush_packets 0 -f mpegts -mpegts_flags pat_pmt_at_frames -mpegts_flags resend_headers 
'udp://239.204.1.2:7000?localaddr=10.0.8.36&pkt_size=1316&buffer_size=65536'  
-map [out2]  -vcodec h264_qsv -profile:v  high -preset medium -s 480x384 
-b:v 900k -minrate 800k -maxrate 1100k  -bufsize:v 2.8M -pix_fmt nv12 -g 25  -flags +cgop+ilme 
-map 0:a:0 -c:a:0 mp2  -b:a:0 192000   -map 0:a:0 -c:a:1 aac  -b:a:1 192000   
-flush_packets 0 -f mpegts -mpegts_flags pat_pmt_at_frames -mpegts_flags resend_headers 
'udp://239.204.1.3:7000?localaddr=10.0.8.36&pkt_size=1316&buffer_size=65536'  
-map [out3]  -vcodec h264_qsv -profile:v  high -preset medium -s 720x576 
-b:v 2560k -minrate 1024k -maxrate 3072k  -bufsize:v 2.8M -pix_fmt nv12 -g 25 
-flags +cgop+ilme -map 0:a:0 -c:a:0 mp2  -b:a:0 192000   -map 0:a:0 -c:a:1 aac  
-b:a:1 192000   -flush_packets 0 -f mpegts 
'udp://239.204.1.1:7000?localaddr=10.0.8.36&pkt_size=1316&buffer_size=65536'

Unfortunately I'm unable to use "multi_transcode", because some errors occur:

[root@transcoder-1 x64]# ./sample_multi_transcode -i::mpeg2 -i::../content/test_stream.mpeg2 -o::h264 out.h264
Multi Transcoding Sample Version 7.0.16053497

libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0

Return on error: error code -2, /home/lab_msdk/buildAgentDir/buildAgent_MediaSDK4/git/mdp_msdk-samples/samples/sample_multi_transcode/src/pipeline_transcode.cpp        3372


Return on error: error code -2, /home/lab_msdk/buildAgentDir/buildAgent_MediaSDK4/git/mdp_msdk-samples/samples/sample_multi_transcode/src/sample_multi_transcode.cpp    277

What you mean with "SW memory"?

DMITRY_R_Intel4 · ‎01-04-2017

Hello,

>> Unfortunately I'm unable to use "multi_transcode", because some errors occur

The correct command line would be:

./sample_multi_transcode -i::mpeg2 ../content/test_stream.mpeg2 -o::h264 out.h264 -hw

If you want to replicate your experiment with ffmpeg, then you need to use parfile instead of command line arguments. Something like that:

$ ./sample_multi_transcode -par ffmpeg-1n.par
$ cat ffmpeg-1n.par
-i::mpeg2 ../content/test_stream.mpeg2 -o::sink -hw -async 1
-i::source -o::h264 out_640x512.264 -w 640 -h 512 -hw -async 1
-i::source -o::h264 out_480x384.264 -w 480 -h 384 -hw -async 1
-i::source -o::h264 out_720x576.264 -w 720 -h 576 -hw -async 1

And you will probably want to align other parameters like gop structure, bitrates, etc. Please, refer to sample_multi_transcode help (-? option) and sample manual for the list of the supported encoding options. Let me know if you will encounter any problems.

>> As I've mentioned earlier, I'm using ffmpeg for the tests. Here is the example ffmpeg pipeline:

Unfortunately, I don't have experience with ffmpeg support myself. So, I can't say right away whether behavior you observe is related with some mediasdk/ffmpeg integration specifics or not. I will try to find someone here @ Intel who worked on ffmpeg integration. In a meanwhile it would be helpful if you will check and confirm whether you see the problem on mediasdk sample application - that would greatly narrow down the root cause.

Dmitry.

Peyo_H_ · ‎01-05-2017

Hello,

thank you for the help:), So:

SDK 2017 result:

RENDER usage: 14.00,    VIDEO usage: 2.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 2.00      GT Freq: 1150.00
RENDER usage: 12.00,    VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 1.00      GT Freq: 1150.00
RENDER usage: 14.00,    VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 2.00      GT Freq: 300.00
RENDER usage: 13.00,    VIDEO usage: 2.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00
RENDER usage: 15.00,    VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 2.00      GT Freq: 1100.00
RENDER usage: 13.00,    VIDEO usage: 2.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 1.00      GT Freq: 1150.00
RENDER usage: 15.00,    VIDEO usage: 0.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00
RENDER usage: 13.00,    VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1100.00
RENDER usage: 14.00,    VIDEO usage: 2.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00
RENDER usage: 15.00,    VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00

SDK 2016 result:

RENDER usage: 12.00,    VIDEO usage: 12.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 5.00
RENDER usage: 12.00,    VIDEO usage: 10.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 8.00
RENDER usage: 14.00,    VIDEO usage: 12.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 7.00
RENDER usage: 15.00,    VIDEO usage: 10.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 9.00
RENDER usage: 15.00,    VIDEO usage: 14.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 9.00
RENDER usage: 13.00,    VIDEO usage: 13.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 11.00
RENDER usage: 14.00,    VIDEO usage: 12.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 7.00
RENDER usage: 15.00,    VIDEO usage: 11.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 10.00
RENDER usage: 13.00,    VIDEO usage: 10.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 9.00
RENDER usage: 12.00,    VIDEO usage: 10.00,     VIDEO_E usage: 0.00     VIDEO2 usage: 10.00

Used par file is:

-i::mpeg2 /root/tests/raw.mpeg2 -o::sink -fps 25 -hw -async 1
-i::source -o::h264 out_640x512.264 -w 640 -h 512 -gop_size 25 -b 1700 -hw -async 1
-i::source -o::h264 out_480x384.264 -w 480 -h 384 -gop_size 25 -b 900 -hw -async 1
-i::source -o::h264 out_720x576.264 -w 720 -h 576 -gop_size 25 -b 2500 -hw -async 1

With the ffmpeg pipeline from my previous post (over the same test input file) SDK 1017 (Only one instance):

RENDER usage: 12.00,    VIDEO usage: 0.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 3.00      GT Freq: 1150.00
RENDER usage: 11.00,    VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 2.00      GT Freq: 1100.00
RENDER usage: 10.00,    VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00
RENDER usage: 9.00,     VIDEO usage: 1.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 3.00      GT Freq: 1150.00
RENDER usage: 10.00,    VIDEO usage: 0.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1100.00
RENDER usage: 10.00,    VIDEO usage: 0.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 1.00      GT Freq: 1100.00
RENDER usage: 13.00,    VIDEO usage: 2.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00
RENDER usage: 8.00,     VIDEO usage: 0.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00
RENDER usage: 9.00,     VIDEO usage: 0.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 1.00      GT Freq: 1150.00
RENDER usage: 11.00,    VIDEO usage: 0.00,      VIDEO_E usage: 0.00     VIDEO2 usage: 0.00      GT Freq: 1150.00

Hm....

What is the relation between "GT Freq" and RENDER? I've noticed, that higher Frequency value means lower Render usage % per instance (sometimes).

Peyo_H_ · ‎01-06-2017

Hello,

I found something. Every time, the load of 2017 is significaly higher from the 2016 load, some errors occur on start. Here is example core dump:

(gdb) where full
#0  0x00007fdb0a6bf976 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#1  0x00007fdb0a6c7b3c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#2  0x00007fdb0a6c31b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#3  0x00007fdb0a6c71ab in _dl_open () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#4  0x00007fdb0a4b102b in dlopen_doit () from /lib64/libdl.so.2
No symbol table info available.
#5  0x00007fdb0a6c31b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#6  0x00007fdb0a4b162d in _dlerror_run () from /lib64/libdl.so.2
No symbol table info available.
#7  0x00007fdb0a4b10c1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
No symbol table info available.
#8  0x00007fdac14f84b0 in ?? () from /opt/intel/mediasdk/lib64/iHD_drv_video.so
No symbol table info available.
#9  0x00007fdac14cf8ce in ?? () from /opt/intel/mediasdk/lib64/iHD_drv_video.so
No symbol table info available.
#10 0x00007fdb03d8a168 in va_openDriver () from /lib64/libva.so.1
No symbol table info available.
#11 0x00007fdb03d8b048 in vaInitialize () from /lib64/libva.so.1

Or more frequently just a console message:

libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva error: dlopen of /opt/intel/mediasdk/lib64/iHD_drv_video.so failed: /opt/intel/mediasdk/lib64/iHD_drv_video.so: undefined symbol: clock_getres, version GLIBC_2.2.5
libva info: va_openDriver() returns -1
                                                                                                                                        
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva error: dlopen of /opt/intel/mediasdk/lib64/iHD_drv_video.so failed: /opt/intel/mediasdk/lib64/iHD_drv_video.so: undefined symbol: clock_getres, version GLIBC_2.2.5
libva info: va_openDriver() returns -1

Maybe this is the root cause of my problem.

When no SIGSEGV signal is catched or no "dlopen" error occur, then the load is almost same for SD channels of both SDK versions.

DMITRY_R_Intel4 · ‎01-11-2017

Hello,

Let me try to answer some of your questions.

>> What is the relation between "GT Freq" and RENDER?

GT Freq stands for GPU frequency. RENDER, VIDEO, VIDEO2, VIDEO_E are parts of GPU called GPU engines dedicated to support specific functionality (also the can be referred as GPGPU, VDBOX-1, VDBOX-2, VEBOX). The fact that GPU frequency fluctuates may mean that either you did not pin it to some specific value (recommended during benchmark process for the stable results) or that you meat throttling. Mind that CPU and GPU may not be able to work being both in the turbo frequency ranges. This is highly dependent on the particular tasks you send to GPU. Thus, you need to consider your workload and define best strategy to negotiate between CPU and GPU frequencies. One of the strategies which can be considered for the GPU-bound server workloads is to disable CPU Turbo boost in BIOS.

>> When no SIGSEGV signal is catched or no "dlopen" error occur, then the load is almost same for SD channels of both SDK versions.

The error in dlopen "undefined symbol: clock_getres, version GLIBC_2.2.5" sounds as a environment or installation problem. With such an error driver did not load and I hardly can imagine how any load to GPU can be possible. Maybe you have few HW components in the pipeline and for some of them error happens for some - not. And that's possible that on the error ffmpeg simply falls to the software mode only partly utilizing GPU. Could you, please, pay attention on how you run your workloads. I mean, pay attention 1) under which user you run, 2) which environment variables you have setup, when the workload failed and when it succeeded. I would guess something like: you run is good under root, but it fails under non-privileged user.

>> What you mean with "SW memory"?

I meant system memory. That's why you may have inefficiencies in the pipeline. If you have 2 components one working on CPU another on GPU, they will need to exchange memory. There will be copy operation between system memory and video memory. That's expensive. In the light of your undefined symbol error that can be the reason: for example, if in MSS-2016 both components worked on GPU and in MSS-2017 decoder failed to initialize and felt back to CPU, then we will still see GPU load, but we may meet inefficiency as well.

Dmitry.

Noticeable performance drop in 1.19 (2017) on Centos 7.2