- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have just upgraded my rather old code from IPP 5.3 to 7.1.1. Turned out to be a huge job due to the API changes, but that's OK, it happens. And my code ended up being much smaller and cleaner as many features I had to try and emulate are now in the sample code.
My problem now is that I am getting very high CPU usage, and very slow frame rates, the two being, of course, closely related. First, I am using the "max slice size" option as I am trying to send RFC 3984 compliant packetisation mode zero RTP packets. Second, I am encoding a y4m file to minimise any possible interactions with cameras etc. Finally, I am using contant bit rate set to 2Mbps. My test code (effectively) takes a YUV420P frame from the file, feeds it through the codec, then splits it up into separate RTP/NALU's by searching for the start codes etc, and finishes by throwing away the result. The build is using VS2012, 32 bit.
A CIF sized image maxes out the single thread and yields 10fps. HD720 is about 4fps and HD1080 is about 2fps.
That is seriously non-linear for a start. The HD stuff, I can sort of understand eating CPU for breakfast, but really, CIF should be able to do 30fps, with time to spare. Even in one thread.
I have played with the num_slices and m_iThreads parameters, as well as resolutions and CBR bit rates, and nothing seems to makes a lot of difference.
Can anyone think of something I am doing wrong?
Oh yeah, this is on a realtively old i7, but I got 10 times this performance with my old code, and IPP 5.3, two years ago.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I had a very similar problem while migrating from IPP 6.1 to IPP 7.1. Performance decreased 4 times.
The solution was to set manually quantity of threads inside IPP to 1
ippSetNumThreads(1)
After this, performance became normal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert,
"max slice size" mode doesn't support threading. So I recommend you to completely disable openmp in project properties. There are some performance problems with openmp with limited threads number.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman T. wrote:
The solution was to set manually quantity of threads inside IPP to 1
ippSetNumThreads(1)
After this, performance became normal.
Unfortunately, this made no difference, thanks anyway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pavel V.Vlasov (Intel) wrote:
"max slice size" mode doesn't support threading. So I recommend you to completely disable openmp in project properties. There are some performance problems with openmp with limited threads number.
Thank you for the suggestion, that certainly helps. CIF now gives me 30fps and uses 50% of one core. However, HD720 is still only 7.5fps and HD1080 is only up to 3fps. At least the ratios look a bit more linear. :-)
This, along with observations from Task Manage, implies that we are now doing the full encoding inside one thread. Well, to get HD you will need to use multiple threads, CPU is just not fast enough in raw clock. There must be a way to do both. Someone out there must be doing HD encoding to RTP, surely!
A further question: how does one turn off OpenMP when using the build.pl script? To test the above, I just did it manually in the VS2012 IDE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...Finally, I am using contant bit rate set to 2Mbps. My test code...
>>
>>...The build is using VS2012, 32 bit.I understood that you're using Microsoft C++ compiler and could you post command line options for a review?
I use what is generated by the IPP samples supplied script: perl build.pl --cmake=audio-video-codecs,ia32,vc2012,d,mt,release
/GS /TP /analyze- /W3 /Zc:wchar_t /I"C:/Program Files (x86)/Intel/Composer XE 2013/ipp/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/codec/video/h264/enc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/codec/video/common/cc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/io/umc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/core/umc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/core/vm/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/core/vm_plus/include" /Zi /Gm- /Od /Fd"C:/Work/IPP-Codecs/ipp-samples/__cmake/audio-video-codecs.ia32.vc2012.d.mt/__lib/debug/h264_enc.pdb" /fp:fast /D "WIN32" /D "_WINDOWS" /D "_DEBUG" /D "IA32" /D "WINDOWS" /D "_SBCS" /D "_WIN32" /D "_WIN32_WINNT=0x501" /D "CMAKE_INTDIR=\"debug\"" /errorReport:prompt /WX- /Zc:forScope /GR /Gd /Oy- /MDd /openmp- /Fa"debug" /EHsc /Fo"h264_enc.dir\debug\" /Fp"h264_enc.dir\debug\h264_enc.pch"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This, along with observations from Task Manage, implies that we are now doing the full encoding inside one thread. Well, to get HD you will need to use multiple threads, CPU is just not fast enough in raw clock. There must be a way to do both. Someone out there must be doing HD encoding to RTP, surely!
Our encoder threading implementation relies majorly on slicing and slice size limiter implementation is in conflict with it.
A further question: how does one turn off OpenMP when using the build.pl script? To test the above, I just did it manually in the VS2012 IDE.
perl build.pl --cmake=audio-video-codecs,ia32,vc2012,d,st,release - in cmake script it tied with st/mt key for threaded libs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pavel V.Vlasov (Intel) wrote:
This, along with observations from Task Manage, implies that we are now doing the full encoding inside one thread. Well, to get HD you will need to use multiple threads, CPU is just not fast enough in raw clock. There must be a way to do both. Someone out there must be doing HD encoding to RTP, surely!
Our encoder threading implementation relies majorly on slicing and slice size limiter implementation is in conflict with it.
So, what you appear to be telling me is, it is impossible to do RFC 3984 compliant packetisation, except for low resolution. That is extremely disappointing!
While the IPP sample code may not allow it, it should be possible to mix the two modes. I cannot think of any reason why you could not divide the video frame into mega-slices, X scan lines each, and each creates a sequence of NAL "real" slices to the max size limit. The mega-slices could be encoded in parallel. Sure, you might sometimes get a tiny slice, one macrobock, at the end of one mega-slice, but that is small price to pay for it to be possible.
Pavel V.Vlasov (Intel) wrote:
A further question: how does one turn off OpenMP when using the build.pl script? To test the above, I just did it manually in the VS2012 IDE.
perl build.pl --cmake=audio-video-codecs,ia32,vc2012,d,st,release - in cmake script it tied with st/mt key for threaded libs
I saw that, but assumed it would then link with single threaded run time, as in completely single threaded. No mutexes, nothing. I don't want that. If you are saying that only disables OpenMP, then very good.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apologies, I should have made it clear. Of course I have tried the release, fully optimised version. It makes some difference, maybe 10-20%, but I need a factor of 10!
Turing on OpenMP made is slower. Especially on smaller images.
Here is the optimised command line, stripped of the /I arguments to make it smaller:
/GS /TP /analyze- /W3 /Zc:wchar_t /Gm- /O2 /fp:fast /D "WIN32" /D "_WINDOWS" /D "IA32" /D "WINDOWS" /D "_SBCS" /D "_WIN32" /D "_WIN32_WINNT=0x501" /D "CMAKE_INTDIR=\"release\"" /errorReport:prompt /WX- /Zc:forScope /GR /Gd /Oy- /MD /openmp- /Fa"release" /EHsc"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, what you appear to be telling me is, it is impossible to do RFC 3984 compliant packetisation, except for low resolution. That is extremely disappointing!
I can only recommend to tune encoding parameters through par file, such as iQuality, iRefFramesNum, bEntropyMode, iSearchX/Y and others. Decreasing quality and ref frames number can give significant speed up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried all those compiler flags and few more besides. No significant effect. The only compiler flag that made a difference was turning OFF openmp which gave a five fold speed up. Now that is totally non-intuitive! Actually, it smells like a bug ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pavel V.Vlasov (Intel) wrote:
I can only recommend to tune encoding parameters through par file, such as iQuality, iRefFramesNum, bEntropyMode, iSearchX/Y and others. Decreasing quality and ref frames number can give significant speed up.
So, m_iQuality is not used by anything other than mpeg2. Perhaps you mean m_QualitySpeed? It was zero which is the fastest. I tried several of those fields and some made no difference, some made small differences. But nothing like the order of magnitude needed.
Now here is a simple comparison: Using x264, I get VGA at 30fps and use 7%-9% of total CPU on my 2.6GHz i7.Doing HD720 starts to hit limits getting 26fps and around 30% of the CPU. On the same hardware, for VGA, IPP can't even get 5fps and uses 12% of the CPU, basically maxing out one core.This cannot be right.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You appear to be telling me I have to debug the Intel proprietary code?
Right.
Wading through thousands of lines of unfamiliar, and I have to say, not very "pretty" code is not going to happen.
I will tell my boss that IPP is a bust and we should negotiate a license for x264.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page