Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

IPP 7.0 Beta :SSE2 on IPP 7.0

jblogh
Beginner
2,020 Views
Hi,

You state that SSE2 optimization layers (t7/m7) and 32-bit SSE3 optimization layer (w7) have been removed, but also state that the base 32-bit optimization layer of the library (px) has been compiled for higher performance and now requires a processor that conforms to the SSE2 processor architecture.

So does that mean that CPUs that only support SSE2 are still supported with the same performance as that achieved with IPP 6.1, or to support these processors which are still widely used, we need to stick with IPP 6.1?

Thanks

Jonathan
0 Kudos
41 Replies
PaulF_IntelCorp
Employee
1,129 Views
Hello Jonathan,

In the 7.0 beta 32-bit version of the library the SSE2 and SSE3 optimizations have been removed. In the 64-bit version of the library the SSE3 optimization has been removed (there never was an SSE2 optimization layer). The base layer of the 32-bit library (px) library has been compiled for SSE2, so SSE2 optimization in that layer is being provided by the compiler. This is consistent with the previously existing situation in the 64-bit version of the library (the "mx" layer).

You may see a reduction in performance for some primitives on SSE2 processors compared to the 6.x version of the library when running 32-bit code. How much reduction, if any, you experience depends on your application and the number and frequency of the IPP functions you use. Some functions will see little or no reduction in performance.

Other than the base level optimizations ("px" and "mx"), the lowest level optimization in both the 32-bit and 64-bit versions of the library are now tuned for SSSE3, which corresponds to the Atom and Core 2 processors.

Your feedback regarding target processor platforms for 2011 and beyond are welcome.

Regards,

Paul
0 Kudos
oxydius
New Contributor I
1,129 Views
Hi Paul,

Isn't that change intentionally crippling all AMD processors? A bleeding edge Phenom II hexa-core CPU is detected as ippCpuSSE3 and you're saying this hand-tuned code path was removed in favor of a compiler-generated SSE2 code path. All 64-bit AMD processors support only SSE2 and SSE3 and there is no more hand-tuned code for them in the newest IPP.

We use IPP so our software works best on the latest Intel processors, but supporting the latest Intel offerings shouldn't so drastically reduce performance for older or competitive offerings. This leaves us with the option of making slower processors even slower, or making faster processors (SSE4, AVX) even faster. Slower processors need the speed boost the most so we're stuck with IPP 6, meaning we'll never be able to encourage the adoption of newer Intel processors through competitive performance-enhancing instruction sets.

Couldn't you merge the existing hand-tuned SSE2/SSE3 code to the px/mx baseline layer to at least let older/other processors perform at their best? Until now, fairness in IPP performance on all x86 processors is what kept us investing in IPP development. It was kind of nice knowing the library would get the most of Nehalem while also getting the most of a VIA C7 or any other 3rd-party CPU supporting only SSE3. The lowest level optimization (SSSE3) is now only supported by Intel processors. :(
0 Kudos
PaulF_IntelCorp
Employee
1,129 Views

Thanks foryour feedback regarding SSE2 and SSE3 support. I will forward it to the appropriate individuals.

Can youtell mewhat sort of IPP applications you are creating and what are the target platforms for which you will build IPP applications in the 2011 time frame and beyond?

Please reply with a private thread if you do not wish to share such information publicly.

Paul

0 Kudos
Thomas_Jensen1
Beginner
1,129 Views
Since my endusers are all using any mix of processors, from any source (Dell, street shops, A-brands), I strongly disagree with any decision to drop IPP support for SSE2 and SSE3, as that would partly remove the advantage of using IPP at all.

I understand that Intel must be careful to not drag legacy code into the future, but that is why PX exists. PX should support legacy, and some other (SX?) should support SSE2 and SSE3.
0 Kudos
jblogh
Beginner
1,129 Views
Hi Paul

Thank you for the clarification, but that's not what I wanted to hear!

If we purchase 7.0 can we still legally use 6.1?

We need to support clients with older CPUs, and as another poster has said, they're the ones who need the greatest performance benefit! IMO the best compromise would be to make the base level (px) the SSE2 optimised code rather than rely on the compiler's optimised version. This would appear to maintain the same level of CPU support but with maximal optimisation for all CPUs.

There are still an awful lot of P4 class CPUs out there unfortunately and we're using the IPP to extract the absolute best of them, and think that you need to support them for a little while longer!

Regards

Jonathan
0 Kudos
oxydius
New Contributor I
1,129 Views
Paul,

We are creating image processing and video coding applications with IPP, although we really use almost all libraries (ippi, ipps, ippvc, ippsc, ippj, ippdc, etc.). Our target platforms are whatever mainstream customers still use. That means an overwhelmingly large proportion of SSE3-only processors, even among Intel's own processors. While we do ship bleeding-edge Nehalem Xeon systems to some customers, a quick look in our engineering department reveals 99% support for SSE3 but only roughly a quarter with SSE4 support, despite all processors being at least dual-core. We do test and optimize our software evenly on AMD processors with lots of Opterons and Phenoms around and we would really appreciate even performance improvements rather than degradation. Regarding our 2011 targets, please note AMD will still be shipping brand new SSE3-only processors. SSE4a doesn't count.

There was already a big scandal with the Intel compilers generating optimized code paths only for Intel CPU's in the past. In fact, it was settled only 6 months ago.

2.3 TECHNICAL PRACTICES

Intel shall not include any Artificial Performance Impairment in any Intel product or require any Third Party to include an Artificial Performance Impairment in the Third Partys product. As used in this Section 2.3,
Artificial Performance Impairment means an affirmative engineering or design action by Intel (but not a failure to act) that (i) degrades the performance or operation of a Specified AMD product, (ii) is not a consequence of an Intel Product Benefit and (iii) is made intentionally to degrade the performance or operation of a Specified AMD Product. For purposes of this Section 2.3, Product Benefit shall mean any benefit, advantage, or improvement in terms of performance, operation, price, cost, manufacturability, reliability, compatibility, or ability to operate or enhance the operation of another product.

http://download.intel.com/pressroom/legal/AMD_settlement_agreement.pdf

Removing SSE3 optimizations is not a failure to act but would be seen as a design action reducing 3rd-party performance, so I'm really hoping this was done to reduce library size rather than to give Intel an unfair advantage, as this would mistakenly hurt your own IPP customers. Intel already has the performance crown regardless, so please bring optimized code paths for weaker SSE2/SSE3 processors in the non-beta release.

P.S.: I don't work for AMD. I am just a 3rd-party engineer with no CPU bias whatsoever, as IPP should be.
0 Kudos
Mark_Rubelmann
Beginner
1,129 Views
Wow, sneaky! I think not providing those optimizations would fall under the category of "failure to act" since they're not actively checking to see if they're running on an AMD and turning stuff off. They're degrading performance on their own processors as well. IANAL however. If I'm interpreting this correctly, it also sounds like there will be a performance hit if you're using MS's compiler since I don't think it injects SSE instructions for anything other than doing division. I could be wrong though. Dirty, dirty, dirty.

If this goes through and we measure a performance hit on those platforms, I doubt we'll move from 6.1 to 7.0. If we do, we'll at least try to limit our usage of IPP to the bare minimum.

-Mark
0 Kudos
Vladimir_Dudnik
Employee
1,129 Views
Intel does not sell SSE3processors anymore. I do not think there is any legal obligation to support end-of-lifed products for any company. Otherwise we just will not be able to deliver new technologies like Westmere or AVX processors (which is coming soon).

The functionality you are looking for still be available in IPP 6.1 product.

By the way, the performance oriented customers migrating to the newest platforms. I personally would not considerthosewho use old or even end of lifed platforms as performance oriented customers. If they do not care about performance why anyone else should do?

Regards,
Vladimir
0 Kudos
PaulF_IntelCorp
Employee
1,129 Views
Hello Jonathan,

Yes, you can still use your 6.1 product, even if you purchase the 7.0 product. Once you have purchased the product there is no expiration on your use of that product to build and distribute applications. The expiration of a development license impacts your ability to get access to upgrades and prior versions of the product from our download site, and your access to premier support; a development license expiration does not impact your ability to build or distribute products based on old versions of the library.

I will forward your concerns about the changes in the optimization layers to the appropriate managers.

Regards,

Paul
0 Kudos
oxydius
New Contributor I
1,129 Views
Paul, I just benchmarked IPP 6.1 vs 7.0 to quantify the impact of the potential performance loss you mentioned. I used the H.264 decoder sample as it's a pretty complete code base using multiple IPP primitives.

Intel Xeon X5560 (Nehalem, SSE4.2) : 4% faster
AMD Phenom II X6 1055T (Thuban, SSE3) : 431% slower! (mx vs m7)

Since many additions to IPP 7.0 were results of our feature requests, we would hate to be stuck with 6.1 but such degradation on modern, high-end competing hardware would leave us with no choice. I know you do not sell SSE3 parts anymore, but for your library to be viable in the real world, it must extract the best of the latest Intel parts without crippling the rest. Consider those SSE3 parts will still be around in 5 years and as an ISV we want our software to be competitive on them. If IPP offers such pitiful performance on anything but Intel i7 then IPP 7.0 will see no meaningful adoption outside of technophile walls and we'll all be stuck in 2009, not taking advantage of AVX.

Isn't IPP supposed to be win-win for Intel and ISV's? We used to write our own SSE code and wouldn't initially bother with AVX due to limited market share. Now with IPP we could, but we won't since that'd be shooting ourselves in the foot. Sure there's some AVX support in 6.1, but just replace AVX by whatever comes next in 7.1.

Thank you for considering our concerns.
0 Kudos
Ying_S_Intel
Employee
1,129 Views
Dear Customers,

Thanks for yout inputs here.

We are almost certainly raising more fears than needed here - you are technically correct that the changes may reduce performance on some older generation processors including processors sold by Intel. Thats not what we expect, and thats not why we made the change. We made the change to reduce the size of IPP, which has become a concern that we needed to deal with - and we did reduce IPPs size with this release. Size was a significant challenge we have and one we decided to address. Of course, we will happily make the 6.1 version available as needed if that is needed - but that is not a good long term solution for any of us if that turns out to be necessary. We would benefit from feedback if the changes cause actual reductions in performance and would appreciate understanding if that happens in practice for you. We know it is theoretically possible to see reductions, but we made a choice that reducing the size and complexity of IPP in future releases was more real and important. If weve erred for any of your applications, please help us understand the details and well revisit our decision.

Thanks,
Ying
0 Kudos
Thomas_Jensen1
Beginner
1,129 Views
1. I cannot agree with "changes may reduce performance on some older generation processors".
A Phenom X4 is not an older generation processor. It is a current generation processor that does not have SSE4, so it can only use SSE3. By removing the hand-optimized SSE3 library, you cripple end-users with Phenom processors.

2. I fully agree that the IPP size is something to improve. 275MB for IPP 6.1.5 is a lot of bytes for a graphics library. I have personally attacked this problem in another way; by removing unused IPP functions, by compiling my own custom set of DLL files:
Ipp.dll (main library, detects CPU, loads specific DLL, also contains bunch of lib code, such as IPP core, JPEG, etc).
Ipp_gen.dll
Ipp_sse2.dll
Ipp_sse3.dll
Ipp_sse41.dll
Ipp_ssse3.dll

The size is 23.2MB, and the loaded size = 7.5MB (Main DLL + CPU DLL).
This is very manageble, both for the setup, and for the loading time.

I had to do very hard work to get this setup to work.
I ask Intel to create a new framework that implements this, of course with OMP support.
I can imaging an EXE file, that scans ippi.h and displays all functions, and then it keeps a config file, where it saves checkbox selections of all chosen functions. It also contains a Build-button, that generates proper make files, and then optionally calls the Intel or MS compiler to build DLL files.

If I could do it, Intel can do it also.
0 Kudos
gol
Beginner
1,129 Views

I strongly disagree with this change as well, I was really surprised to see that the next version is supposed to require at least SSE3 (assuming that no SSE3 = generic, FPU-based layer?).
Edit: wait, I read it's worse, is it really

We do mainstream software (I would even say a little higher than mainstream, as it's music sequencing/audio processing), and I can tell you that new instructions take A LOT of time to reach everyone. It's only last year that we started requiring SSE1 for our software (and we still have a few users who can't run it). I don't think we will dare to require SSE2 until a few years, so SSE3?

I also don't understand this layer system. SSE2 brought important new instructions, but SSE3 & 4 are kinda marginal. Shouldn't 3/4 of the functions in IPP be able to use SSE1 & 2 instructions only? I know I never needed SSE3 for my own code. So I'm pretty sure that in your SSE3 layer, most of the functions use SSE1 & 2 instruction only, but they're in the SSE3 layer because it's a "one layer for all functions" system?

Microsoft too seems to have no clue (or doesn't care) about the mainstream market, they keep introducing APIs that (on top of being Windows-only, but that's obvious) are restricted to the latest OS, not understanding that by the time they become usable, they're already obsolete.

0 Kudos
Thomas_Jensen1
Beginner
1,129 Views
If I understand correctly:

SSE2 is now minimum required CPU. Less than this, IPP 7 will not work at all.
SSE2 is placed in PX, only with compiler optimization for SSE2, not with hand-optimized code for SSE2.
SSE3 is not used in IPP 7.

If hand-optimized code is required (and of course it is required!), you need an SSE4 or AVX CPU, because the only hand-optimized libraries are V8 (SSE4) and G9 (AVX).

Intels' choice means that ONLY SSE4 and higher will achieve highest performance.

I strongly hope they at least put back the SSE3 library (T7).

I also hope they read and implement my text earlier in this topic about reducing the size of IPP.

0 Kudos
Vladimir_Dudnik
Employee
1,129 Views
That's not completely right. Let me try to explain more details on cpu-specific code available in IPP 7.0 beta:

PX library (which is 32-bit generic code) support SSE2 instructions set. Thatapplicable to Intel Pentium 4 processors family.
MX library (which is 64-bit generic code) support SSE3 instruction set. The reason is that SSE3 was a miminal instruction set for processors which support Intel64 architecture. That applicable to code named Prescott processor family (brand name was Intel Pentium 4 processor with Hyper Threading Technology)

V8 (32-bit) and U8 (64-bit) library supports SSSE3 (note additional 'S' letter), basically optimized for Intel Core2 processors family (code named Merom architecture)
P8 (32-bit) and Y8 (64-bit)library support SSE4.x instruction set, introduced in code named Penryn,Nehalem and Westmere processors (brand named mostly as Intel Core iX processors, where X might be 3, 5, 7)
G9 (32-bit) and E9 (64-bit) library support AVX instruction set

In dynamic libraries there are also S8 (32-bit) and N8 (64-bit) variants which contains Atom specific code.

Regards,
Vladimir
0 Kudos
Mark_Rubelmann
Beginner
1,129 Views
Wow, now I'm confused again. PX and MX exist in IPP 7, right? And then the kicker is that instead of containing SSE2 and SSE3 instructions written by a human, it's done by the compiler, correct? Does IPP just crash and burn if you try to run it on a P3 or does it have a way of reverting to non-SSE code?

-Mark
0 Kudos
oxydius
New Contributor I
1,129 Views
Vladimir, the main problem with the IPP 7.0 beta is that PX and MX offer a 4x performance decrease compared to previously hand-tuned SSE2/SSE3 code paths (W7, T7, M7) at this time. The compiler auto-vectorization isn't clever enough for all primitives.

That is a substantial regression affecting over half of the CPU market, therefore over half of our own customers. There is no way we can upgrade to IPP 7 under such conditions.

The previous suggestion about library size reduction is a good one. We already use static libraries for that reason, but you should make an easy tool that creates custom libraries/DLL's including X functions/domains with variants for Y requested optimization layers (SSE2, SSE3, SSSE3, SSE4, AVX). Honestly, at this point I would only include SSE2, SSE4 and AVX layers and perfectly cover the whole market (from P4 to i7 and beyond).

Also, if you could merge W7 & T7 into PX and M7 into MX, we wouldn't have to deal with performance regressions in the first place. That might be acceptable for everyone.
0 Kudos
Vladimir_Dudnik
Employee
1,129 Views
That's correct, PX and MX is compiler generated code, with autovectorization done by compiler where it possible with using SSE2 or SSE3 instruction set (not earlier). Hand tuned SSE2 and SSE3 code was removed in IPP 7.0.

And yes, pre Pentium 4 processors are not supported by IPP 7.0 (including Pentium III processor). There will be invalid instruction exception when you run IPP 7.0 based application on Pentium III or earlier processors.


Vladimir
0 Kudos
oxydius
New Contributor I
1,129 Views
Vlad, it would be preferable if MX only included SSE2 code, rather than SSE3. Before Intel adopted AMD-64, there were AMD Athlon processors with SSE2 only. This would cause them to crash.
http://en.wikipedia.org/wiki/Opteron#Opteron_.28130_nm_SOI.29

Since SSE3 really only adds LLDQU, it would be better left in M7, which hopefully comes back!
0 Kudos
oxydius
New Contributor I
1,037 Views
Vladimir, would it be possible to replace the compiler-generated SSE2/SSE3 by the previously hand-tuned code? Wouldn't it result in smaller libraries and higher performance?
0 Kudos
Reply