Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
Announcements
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

bug? muti encode in tutorials-sample encode.

minliang_o_
Beginner
374 Views
    hi,I want to use the media sdk to encode muti stream with hw mode.But maybe I found a Bug In this useage.
    I modify the code to create 18 encode block,each has their own file,session and system surface.
    In my test,I just use one thread to encode,each block encode 1frame and the surface is not lock ,then encode next block's frame.
    The blocks define by code "int nDest=15",you can change this line to (1-18) to see the result.
    my result is strange. List:
    nDest       FPS                  GPU(Process Explorer)
     1-10        500-700             20%
      11          (500+ or 300-)    4-20%
     >=12       100-300                4%
     Just like above the result is unstable,if nDest<10,it's >500FPS.but if >12 it go down to 100-300% GPU also go down .
     each block's encode file is success and is the same,in other words the encode method can use.
     I use the sw mode, the result is similar.It change from 200 to 70FPS.
     My purpose is to use hw mode to encode muti stream,but the test is not satisty.I cannot use it in my project.Is't a bug?
    
     PS:
     media sdk version: i try 2014,2015R6 ,2015R7,Now I use R7,R7 is better than before,sometimes nDest=12 has 700FPS,but it's unstable either.
     PC: cpu:i7 4790 memory :8G WIN7.
     My code and test file is here :
     http://yun.baidu.com/s/1sjmUwaD   simple_3_encode_Test.rar
     http://pan.baidu.com/s/1ntiYJot  Screenshot.
     
     how to use:
     1. extract the  simple_3_encode_Test.rar,put it in mediasdk-tutorials-0.0.3 folder;
     2. extract the yuv420.yuv to simple_3_encode_Test;
     3.copy yuv420.yuv 18 times,and rename in yuv420.yuv yuv4201.yuv,yuv4202.yuv -----yuv42017.yuv.total 18 files.
     4.run the simple_encode.sln; remember change to X64.
     5.change nDest to see the diffierent.
     
    I hope you can try my code to see the result.Thank you.
    Thanks for every reply. 
0 Kudos
5 Replies
Bjoern_B_Intel
Employee
374 Views

Hi,

Thanks for providing the source code here and the instructions on how to use.

If I got you right here, you try to encode multiple streams simultaneously and you observe different behavior based on the number of encodes you are doing. In some kind of “time slicing” / round-robin fashion you are encoding video streams.

The answer here is quite complex and depends how the system got setup. Only an intensive performance tuning session on that system would give an answer here. A trace file is needed to fully understand whats happening. You might want to look at the following two tools which will give you more insights:

  1. Intel® VTune Amplifier

  2. Intel® Extreme Tuning Utility

Not having access to the same platform you are using, I reran your code on a core i7-4650U Win8.1 system with the following performance figure:

MultiEncodePerf.jpg

Given the way your SW is architected, I don’t see anomalies here. The tutorial sample you are using here is to get started with Intel® Media SDK. For best performance you want to look at the code samples here which we also use for performance measurements.

You are measuring performance including disk I/O which might have an effect on what you see. Remove writing the elementary stream to disk and try to use just one RAW media stream in memory. This way you limit disk I/O influence. As a general guideline and to start with, you want to ensure that you are using GPU HW acceleration and that the GPU frequency is always running at maximum frequency. Joining sessions, use async acceleration structure with opaque memory will improve performance demonstrated in the additional encoding tutorial samples. Keep in mind that the CPU and the GPU are sharing the same TDP and that Turbo Boost effects are influencing the performance behavior.

As you are looking for best platform throughput you want to look at the Multi-Transcoding-Sample for best performance as well.

Best Regards

Bjoern

 

0 Kudos
minliang_o_
Beginner
374 Views
Bjoern,
    I appreciate for your reply.I am sorry for the abnormality cannot reppear and make you busy.Thanks again.
    Yes,You have got my meaning..I want to encode more than 20 streams at the same time.
    In your opinion,I want to express my trying.
    1. io operation:I had try to remove the io,nothing change.So I left the io in the sample.
    2. I also down the new Sample ,But I find the the last funtion is also :EncodeFrameAsync,So I think it would be the same.I will spend time to learn it.
 
    For your suggestion,I will spend time to try the follow :
    1.Run in win8.1.
    2.Use the two tools.
    3.Find the way to ensure the GPU can run in the frequency.
    4.learn the new sample code and the Multi-Transcoding-Sample.
 
    Could you tell me how can I find the information or source of "use async acceleration structure with opaque memory will improve performance demonstrated in the additional encoding tutorial samples"?
 
Best Regards
minliang ou.
0 Kudos
Bjoern_B_Intel
Employee
374 Views

Hi Minliang,

Sure, please take a look at the following samples:

  1. Tutorial: simple_3_encode_vmem

  2. Tutorial: simple_3_encode_vmem_async – go for async of 2 for best performance

  3. Samples: sample_multi_transcode

Those samples have additional optimizations integrated which you don’t want to miss integrating.

One more thing I observed: You are allocating resources for all your 18 streams in the beginning. You also do that for lower number of encodes. Make sure you only acquire resources you are actually using. Do not oversubscribe the HW resources as the driver might handle it but will lose performance.

Once you reached maximum throughput, bundle streams up in junks of work. In the above graph, your code works best running 3 encodes in parallel (adding optimizations above, you can do more!). So only add an additional stream if one encode completed. This way you are always running at maximum throughput.

Let me know how it worked out for you.

Best,

Bjoern

0 Kudos
minliang_o_
Beginner
374 Views

Bjoern,

     Thank you so much.

     1.I allocate 18 streams at the beginning,because I want to see whether this will affect the result.I test again in R7,It did affect,I modify the resource to 12,then sometimes it can run 12 streams in 500+FPS,but sometimes it go down to 200FPS.It just Reduce the proportion of problem.I doutbt that I just allow the resoure in HW mode and allow the surface in memory,does it affect the performance?

     2.I just encode the sterams in one thread in the while(){}. I encode one frame then wait  the result by :SyncOperation.After that ,I encode next frame.It means I just encode one frame at the same time.No matther how resoure I allow,the GPU only encode 1 frame at the same time.The performance should change little,does it wrong?

     3.I want to konw In your project or other way,how many streams can encode at the same time ?Can I use other way to encode more than 20 streams ?I want to use the media sdk in real time monitor,I cannot encode one by one.

    4."Once you reached maximum throughput, bundle streams up in junks of work." That means when the nDest >12,It did reach the maximum throughput?

    Now,I should take time to see the codes you mention,Thanks again.I observe your time is:10/19/2015 - 23:14 ,  how hard you work,take care.good night.

Best wishes

minliang ou.

 

0 Kudos
Bjoern_B_Intel
Employee
374 Views

Hi Minliang,

Let me close this thread here.

Best,

Bjoern Bruecher

0 Kudos
Reply