Community
cancel
Showing results for 
Search instead for 
Did you mean: 
TurtleCrazy
Beginner
96 Views

[HD505] profiling


I am checking out the GPU (HD505) encoding performance of a E3950 board that runs a custom Linux image:

  • kernel 4.14 (with i915 patches)
  • Media driver (master branch, HEAD)
  • Media sdk  (master branch, HEAD)

both are build from source (open source github repository).

I am using the 'sample encode' tool to encode 100 YUV frames (4096x2160) into hevc format.

And, compared to another board that ships an HD530 GPU, it takes 4 times longer to encode the file. (~25s vs ~6s)
Profiling the system with the help of  vtune , I saw that most of the time, the 'sample_encode' tool was waiting for the GPU.

To make sure I did not forget anything while building the media stack on the image,
does  those results seems ok on your side ?

I mean, If I was expecting to lose performance on the E3950 board, this difference sounds excessive to me .

 

 

 

0 Kudos
7 Replies
Mark_L_Intel1
Moderator
96 Views

Hi TurtleCrazy,

No, sample_encode should not be that slow. It looks like you have an installation issue.

Yes, E3950 is Apollo Lake, its graphic core should have hardware video codec and it is supports through Media SDK for Embedded Linux release, you can get the release package from the following site:

https://software.intel.com/en-us/media-sdk

But this release requires the old kernel, it should be Kernel 4.1 as I remembered(you can check release notes to confirm), if you want to use the latest Kernel, you can build our open source stack by following the article:

https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack

Mark Liu

TurtleCrazy
Beginner
96 Views

Hi,

 

Liu, Mark (Intel) wrote:

Hi TurtleCrazy,

No, sample_encode should not be that slow. It looks like you have an installation issue.

Ok. Regarding the fact that the GPU/CPU concurrency is 89% GPU, what ' s that I may have missed ?

Vtune drivers are installed, let me now which entries in profiling reports may be relevant in my case.

Yes, E3950 is Apollo Lake, its graphic core should have hardware video codec and it is supports through Media SDK for Embedded Linux release, you can get the release package from the following site:

https://software.intel.com/en-us/media-sdk

But this release requires the old kernel, it should be Kernel 4.1 as I remembered(you can check release notes to confirm),

Not only the kernel is outdated, so are yocto bsps. I installed and checked this release once, to have a quick look, and its sounds it s faster than the opensource stack, but not so much (~16s vs 20s for the same test).

if you want to use the latest Kernel, you can build our open source stack by following the article:

https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack

Mark Liu

That 's what I did, see my previous message. The open source media stack seems running correctly.

Kernel Linux 4.14.35, (with Intel patches for Yocto) with the following patches for i915:

From the intel repo on github

Output:

Linux 4.14.35-intel-pk-standard #1 SMP PREEMPT Thu May 31 15:13:26 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
e3950:~# vainfo 
error: can't connect to X server!
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.1 (libva 2.1.1.pre1)
vainfo: Driver version: Intel iHD driver - 2.0.0
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointFEI
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileVP9Profile0            : VAEntrypointVLD
e3950:~# lsmod
Module                  Size  Used by
intel_rapl             20480  0
pwm_lpss_pci           16384  0
x86_pkg_temp_thermal    16384  0
coretemp               16384  0
pwm_lpss               16384  1 pwm_lpss_pci
igb                   172032  0
spi_pxa2xx_platform    24576  0
i915                 1351680  1
mei_me                 28672  0
mei                    61440  1 mei_me
uio                    16384  0
e3950:~# dmes sg | grep l i915
  0.000000] Kernel command line: BOOT_IMAGE=/bzImage root=PARTUUID=a1e761f4-3445-4474-b25f-213365295c23 rootwait rootfstype=ext4 console=ttyS0,115200 console=tty0 i915.modeset=1 i915.fastboot=1 nopti nokaslr nospectre_v2 spectre_v2=off
[    3.913038] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    3.929474] [drm] Finished loading DMC firmware i915/bxt_dmc_ver1_07.bin (v1.7)
[    5.059890] [drm] Initialized i915 1.6.0 20170818 for 0000:00:02.0 on minor 0
[    5.475086] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device

e3950:~# time /opt/intel/media/sdk/samples/sample_encode h265 -i datatest/v.yuv -o t.h265 -w 4096 -h2160  -i datatest/v.yuv -o t.h265 -w 4096 -h[1@ 
plugin_loader.h :186 [INFO] Plugin was loaded from GUID: { 0x6f, 0xad, 0xc7, 0x91, 0xa0, 0xc2, 0xeb, 0x47, 0x9a, 0xb6, 0xdc, 0xd5, 0xea, 0x9d, 0xa3, 0x47 } (Intel (R) Media SDK HW plugin for HEVC ENCODE)
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Encoding Sample Version 8.3.26.

Input file format YUV420
Output video  HEVC
Source picture:
 Resolution 4096x2160
 Crop X,Y,W,H 0,0,4096,2160
Destination picture:
 Resolution 4096x2160
 Crop X,Y,W,H 0,0,4096,2160
Frame rate 30.00
Bit rate(Kbps) 60928
Gop size 0
Ref dist 0
Ref number 0
Idr Interval 0
Target usage balanced
Memory type system
Media SDK impl  hw
Media SDK version 1.26

Processing started
Frame number: 1
Frame number: 100
Frame number: 100

plugin_loader.h :212 [INFO] MFXBaseUSER_UnLoad(session=0x0x6c3b70), sts=0

Processing finished

real 0m18.367s
user 0m3.096s
sys 0m1.869s

 

See the difference between real and (user+sys) delays.

 

TurtleCrazy
Beginner
96 Views

To complete, Now some additional features into the i915, driver, like HuC and GuC, are enabled:

 

options i915 enable_fbc=1 enable_guc_loading=2 enable_guc_submission=2 enable_rc6=3

 

As a result, from /sys/kernel/debug/dri/0/:

i915_guc_load_status:
GuC firmware status:
 path: i915/bxt_guc_ver8_7.bin
 fetch: SUCCESS
 load: SUCCESS
 version wanted: 8.7
 version found: 8.7
 header: offset is 0; size = 128
 uCode: offset is 128; size = 140544
 RSA: offset is 140672; size = 256

GuC status 0x800330ed:
 Bootrom status = 0x76
 uKernel status = 0x30
 MIA Core status = 0x3
...

i915_huc_load_status:
HuC firmware status:
 path: i915/bxt_huc_ver01_07_1398.bin
 fetch: SUCCESS
 load: SUCCESS
 version wanted: 1.7
 version found: 1.7
 header: offset is 0; size = 128
 uCode: offset is 128; size = 154048
 RSA: offset is 154176; size = 256

HuC status 0x00006080:


But this doesn't make any difference.

Mark_L_Intel1
Moderator
96 Views

I did the calculation based on your log,

So you have 4096x2160 YUV420 for 100 frame in 25 second, which is about 4FPS.

This should be closed to what we saw with our testing.

You can improve it by tuning some of the parameters, I suggest you try following argument in the command line:

sample_encode h265 -hw -i datatest/v.yuv -o t.h265 -h 2160 -w 4096 -vaapi -u speed -async 3

Let me know how much it can improve?

Mark Liu 

TurtleCrazy
Beginner
96 Views

Liu, Mark (Intel) wrote:

I did the calculation based on your log,

So you have 4096x2160 YUV420 for 100 frame in 25 second, which is about 4FPS.

This should be closed to what we saw with our testing.

Ok, thanks. There is the main idea behind the question. I want to make sure that the distribution,and especially the encoding stack  I built for this board, runs correctly. I was questioning about this because of the great gap between this board and the other one.

You can improve it by tuning some of the parameters, I suggest you try following argument in the command line:

sample_encode h265 -hw -i datatest/v.yuv -o t.h265 -h 2160 -w 4096 -vaapi -u speed -async 3

Let me know how much it can improve?

Thanks, I do not access to the board for now, I ll check this later on.

A single comment, what 's the "-vaapi" options stands for ?

BTW, now, I ll release the board to encoding to  experts that will be happy to  tweak the process according to their needs. :)

Regards,

 

Mark_L_Intel1
Moderator
96 Views

These are the tweaks which targeting on the improvement:

-vaapi: specified video memory instead of system memory during encoding, this avoid the extra copying

-u: target usage set to to speed(7) which might decrease some of the quality but we get speed up.

-async: the whole MSDK API is an async model, it will keep sending the video frame to the hardware without block but it requires a sync operation. this argument specified how many frame can be in the queue before the sync operation. 3 or 4 would be optimal, too many will not improve.

Mark Liu

TurtleCrazy
Beginner
96 Views

Liu, Mark (Intel) wrote:

You can improve it by tuning some of the parameters, I suggest you try following argument in the command line:

sample_encode h265 -hw -i datatest/v.yuv -o t.h265 -h 2160 -w 4096 -vaapi -u speed -async 3

Let me know how much it can improve?

Mark Liu 

Except for the "u" flag, which may lower quality , I does not make significant differences, regarding encoding duration. Furthermore, turning on the "-vaapi" flag may increase the elapsed time. actually, the process  "user" time was increased a lot in this case.