- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
it looks a bit strange, we use the same code for dynamic and static libraries. The only difference is that dynamic version of IPP use OpenMP threading in some functions, but seems that is not a case. Resize does not contain OpenMP threading. Well, could you also specify:
-what version of IPP, what exactly processor and under which OS did you useto test this function?
- what kind of interpolation did you use in function?
- what sizes of images did you test on?
- did you use aligned memory (allocated with IPP functions)?
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good luck,
Scott
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Richard.
could you please print out output from any ippGetLibVersion function for static linking case? The code like this should be enough (please insert it after ippStaticInit call)
const IppLibraryVersion* ippj = ippiGetLibVersion();
printf("Intel Integrated Performance Primitives
");
printf(" version: %s, [%d.%d.%d.%d]
",
ippi->Version, ippi->major, ippi->minor, ippi->build, ippi->majorBuild);
printf(" name: %s
", ippi->Name);
printf(" date: %s
", ippi->BuildDate);
You can test ippi->Name string to see that static dispatcher chooses right cpu-specific code.
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK. The output from the static linked version is :
Intel Integrated Performance Primitives
version: v5.1, [5.1.217.80]
name: ippiw7l.lib
date: Mar 1 2006
Doing the samefor the dynamic linked version (without ippStaticInit) gives :
Intel Integrated Performance Primitives
version: v5.1, [5.1.217.80]
name: ippiw7-5.1.dll
date: Feb 28 2006
I have the copied the first lines of the disassembly when I call ippiResize_8u_C3R for each case. I might as well include this for info. The static case is :
_w7_ippiResize_8u_C3R@68:
004983B0 push ebp
004983B1 mov ebp,esp
004983B3 and esp,0FFFFFFC0h
004983B6 sub esp,80h
004983BC mov ecx,dword ptr [ebp+8]
004983BF fld qword ptr [ebp+38h]
004983C2 fld qword ptr [ebp+40h]
004983C5 mov edx,dword ptr [ebp+48h]
004983C8 cmp edx,8
004983CB je _w7_ippiResize_8u_C3R@68+0B6h (498466h)
004983D1 fld qword ptr ds:[5411A8h]
004983D7 mov eax,dword ptr [ebp+10h]
004983DA fxch st(2)
004983DC fstp qword ptr [esp+30h]
004983E0 fstp qword ptr [esp+38h]
004983E4 fst qword ptr [esp+40h]
004983E8 fstp qword ptr [esp+48h]
004983EC mov dword ptr [esp],ecx
004983EF mov ecx,dword ptr [ebp+0Ch]
...etc
The dynamic version calls :
00ACDDB8 push ebp
00ACDDB9 mov ebp,esp
00ACDDBB and esp,0FFFFFFC0h
00ACDDBE push edi
00ACDDBF sub esp,7Ch
00ACDDC2 movsd xmm3,mmword ptr [ebp+38h]
00ACDDC7 movsd xmm4,mmword ptr [ebp+40h]
00ACDDCC mov eax,dword ptr [ebp+48h]
00ACDDCF cmp eax,8
00ACDDD2 je 00ACDE6D
00ACDDD8 movsd xmm2,mmword ptr ds:[0D885C0h]
00ACDDE0 mov edx,dword ptr [ebp+0Ch]
00ACDDE3 mov ecx,dword ptr [ebp+10h]
00ACDDE6 mov edi,dword ptr [ebp+14h]
00ACDDE9 movsd xmm0,mmword ptr [ebp+18h]
00ACDDEE movsd xmm1,mmword ptr [ebp+20h]
00ACDDF3 movsd mmword ptr [esp+30h],xmm3
00ACDDF9 movsd mmword ptr [esp+38h],xmm4
00ACDDFF movsd mmword ptr [esp+40h],xmm2
00ACDE05 movsd mmword ptr [esp+48h],xmm2
...etc
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I know this is an old thread, but I didn't see any response to it. I am using IPP 5.1.1, and my application uses IIR and FFT functions from the Signal Processing library. Previously I was using the dynamic DLL-finding linking. But recently I figured out how to do the static linking with the emerged and merged lib files. This is preferable to I don't have to ship so many different DLLs with my app.
But I found that there is a slight performance decrease when I use the static emerged/merged libs, about 10% slower than the DLL version.
I am testing on a Core Duo, and I verified through ippGetLibVersion() that in the static lib case, I am using the t7 version functions.
This was kind of disappointing, I was hoping to use the merged static linkage.
Is there any difference in the performance of the static merged libs vs the DLL libs? Has this been fixed in particular versions?
As a side note, I noticed that at least for my program, which only uses IIR and FFT functions, there was no performance improvement going from the generic px version to t7 version. Is this expected?
Thanks,
Ching-Wei
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
could you please specify which exactly functions do you use? It would be nice if you can provide simple test case which demonstrate performance issue, we expect that difference between PX and T7 variants of these functions should be at least 2..4X for reasonable signal length.
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It hasn't changed much in 5.2.57
_w7_ippiResize_8u_C3R@68:
00: 55 push ebp
01: 8B EC..mov ebp,esp
03: 83 E4..and esp,0FFFFFFC0h
06: 81 EC..sub esp,80h
0C: 8B 4D..mov ecx,dword ptr [ebp+8]
0F: DD 45..fld qword ptr [ebp+38h]
12: DD 45..fld qword ptr [ebp+40h]
15: 8B 55..mov edx,dword ptr [ebp+48h]
18: 83 FA..cmp edx,8
1B: 0F 84..je 000000B2
21: 8B 45..mov eax,dword ptr [ebp+10h]
24: D9 EE..fldz
26: 89 0C..mov dword ptr [esp],ecx
29: 8B 4D..mov ecx,dword ptr [ebp+0Ch]
2C: 89 4C..mov dword ptr [esp+4],ecx
30: 89 44..mov dword ptr [esp+8],eax
34: 8B 45..mov eax,dword ptr [ebp+14h]
37: 89 44..mov dword ptr [esp+0Ch],eax
3B: 8B 4D..mov ecx,dword ptr [ebp+18h]
3E: 8D 44..lea eax,[esp+10h]
42: 89 08..mov dword ptr [eax],ecx
44: 8B 4D..mov ecx,dword ptr [ebp+1Ch]
47: 89 48..mov dword ptr [eax+4],ecx
4A: 8B 4D..mov ecx,dword ptr [ebp+20h]
4D: 89 48..mov dword ptr [eax+8],ecx
50: 8B 4D..mov ecx,dword ptr [ebp+24h]
53: 89 48..mov dword ptr [eax+0Ch],ecx
56: 8B 45..mov eax,dword ptr [ebp+28h]
59: D9 CA..fxch st(2)
5B: DD 5C..fstp qword ptr [esp+30h]
5F: DD 5C..fstp qword ptr [esp+38h]
63: DD 54..fst qword ptr [esp+40h]
67: DD 5C..fstp qword ptr [esp+48h]
which is nearly the same as _a6_'s, and from first looks, is the same as _px_'s. _t7_ and _v8_ is the same, too. x87 code, you know, not sse2. Same with _??_ippiResize_8u_C1R code. I can image there beingmore routines like that. Bummer, dude.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is optimized code for ippiResize function. Of course it depends a lot on parameters you use. So, what were you image size, did you try zooming or decimating, what was interpolation parameter? Did you call ippStaticInit function at the beginning of your application (in case of static linkage)?
Actually, with IPP static libraies you can compare performance of different processor specific code with call ippStaticInitCpu with desired processor type as a parameter. I think you should see performance difference between PX and T7 code.
Please let us know if you still have performance issue
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your response. Unfortunately I don't have time to give you a test case...But I can tell you I am using ippsIIR_64f and ippsIIR_64f_I (both with N=256) and for the FFT I am using ippsFFTFwd_RToCCS_32f with N=2048.
I'm not expecting too much support, since I don't have time to spend diving into this too far. Just wanted to get a quick sense of:
1) Why do I get a slight performance decrease when I use the static linking vs the DLL linking? Is this unexpected?
2) Why don't I see much difference between px and t7 versions? This question may have more to do with my particular code, I may not be using IPP aligned buffers, my N is only 256 in the IIR case, or there could be other bottlenecks etc...But just wondering...
Thanks!!!
-Ching-Wei
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Actually, with IPP static libraies you can compare performance of different processor specific code with call ippStaticInitCpu with desired processor type as a parameter. I think you should see performance difference between PX and T7 code."
I don't think so. Here's the ever lovin' proof that px is a6 is w7 is (t7) is v8 (all static is using the px code, at least in the imaging I've looked at - not much, but so far .000)
ipp 5.2.57 static
_w7_ownpiDecimateSuper:
00: push ebp
01: mov ebp,esp
03: and esp,0FFFFFFC0h
:
90: fldz
92: fld qword ptr [ebp+38h]
95: fcomp st(1)
97: fnstsw ax
99: sahf
9A: jbe 00000995
A0: fld qword ptr [ebp+40h]
A3: fcomp st(1)
A5: fnstsw ax
A7: sahf
A8: jbe 00000995
AE: lea eax,[ecx+edx]
B1: cmp edi,eax
B3: jge 000000B9
B5: mov edx,edi
B7: sub edx,ecx
B9: mov eax,dword ptr [ebp+1Ch]
BC: lea edi,[eax+ebx]
BF: cmp esi,edi
C1: jge 000000C8
C3: mov ebx,esi
C5: sub ebx,dword ptr [ebp+1Ch]
C8: fld qword ptr [_2il0floatpacket.1]
CE: mov dword ptr [esp+78h],edx
D2: fild dword ptr [esp+78h]
and then there is
ipp 5.2.57 w7 dll ( 5,902,336 bytes : ippiw7_5.2.dll )
ownpiDecimateSuper:
E174: push ebp
E175: mov ebp,esp
E177: and esp,0FFFFFFC0h
:
E208: movsd xmm0,mmword ptr [ebp+38h]
E20D: pxor xmm3,xmm3
E211: comisd xmm0,xmm3
E215: jbe 1033EB46
E21B: movsd xmm0,mmword ptr [ebp+40h]
E220: comisd xmm0,xmm3
E224: jbe 1033EB46
E22A: mov eax,dword ptr [ebp+18h]
E22D: movsd xmm1,mmword ptr ds:[10523468h]
E235: mov dword ptr [esp+6Ch],edi
E239: mov edi,dword ptr [ebp+10h]
E23C: mov dword ptr [esp+70h],edx
E240: mov ecx,esi
E242: mov dword ptr [esp+74h],ebx
E246: mov ebx,dword ptr [ebp+20h]
E249: lea edx,[eax+ebx]
E24C: sub ecx,eax
E24E: cmp esi,edx
E250: cmovge ecx,ebx
E253: cvtsi2sd xmm0,ecx
E257: mulsd xmm0,mmword ptr [ebp+38h]
E25C: mov ebx,edi
E25E: addsd
; xmm0,xmm1
E262: mov edx,dword ptr [ebp+24h]
E265: mov esi,dword ptr [ebp+1Ch]
E268: lea eax,[esi+edx]
E26B: sub ebx,esi
E26D: cmp edi,eax
E26F: cmovge ebx,edx
E272: cvtsi2sd xmm2,ebx
E276: mulsd xmm2,mmword ptr [ebp+40h]
In other words, the static imaging library, at least the resize and perhaps more, was compiled for straight x87 math, no SSEx, for a6, w7, t7, and v8. All are the same code as the _px path. Denmark usually doesn't smell rotten, but . . .
Resize is in the critical path so that it's running half speed or worse makes quite a difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I see two issues here
1. There was such an error in IPP 5.1 which cause PX code in static version for ippiResize function. This bug was fixed in IPP 5.2.
2. There is several execution branches inside of ippiResize. Those branches are optimized in a different manner, depending on how much we can get from SSE vs x87. Supersampling interpolation is not optimized with SSE, but other interpolations you should be able to clearly see the difference between PX and T7 code, Did you try other interpolations?
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why does the DLL use SSE2 code while thevery same API in the static library uses x87? The two examples already given should bereason enough to check why this is going on. Super is already very slow so having to always do it in x87 is not what IPP is all about. Is it? The w7 DLL uses SSE2 for Super. Why not the static library? (As well as the any other APIs for that matter.) How can this possibly be the plan? What gain is there in making the w7 DLL have better code than the w7 static lib? Yes, some decimates are SSE2 in the w7 static lib, but most are x87. But even if everything but the resize/super were SSE2, that pair IS x87 in the static lib but SSE2 in the w7 DLL. Why? The very same APIs (nevermind "branches -- this is the same code path -- one to static liband one to DLL. The path is the same. The code is different. Why the code is different is the question. Or better, why not fix the static library to use SSE2 for thoseAPIs where the DLL is already using SSE2-- nevermind why it isn't doing so now.
I know I've said the same thing several times.
Also, the problem that this thread stated with back in May 2006 is still there in 5.2.57, as I have already shown with the disasm. What I've done is to carry on with that same report, because it seems it was not fixed: I have already shown that the original report is as valid today as it was May 2006. OK, then, perhaps this thread will simply die like it did last year. As long as I know why it's slow I can deal with it. Many won't ever know, unless they read this particular topic.
Adios
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Adios,
For the latest IPP5.2 release, what is the performance difference you find now?
Also what are the arguments are used for ippiResize roi, factors, etc. If you have atest code that can show the performance problem, you can submit to premier support website (https://premier.intel.com).
Belloware some performance results we test for the function. Dynamic and static code have very similar performance.
function |
data |
chan |
size |
inter |
factor |
w7,static |
w7,dynamic |
s/d |
ippiResize |
8u |
C1R |
64x64 |
super |
0.66 0.66 |
142 |
144 |
0.99 |
ippiResize |
8u |
C1R |
256x256 |
super |
0.66 0.66 |
138 |
140 |
0.99 |
ippiResize |
8u |
C1R< /P> |
720x480 |
super |
0.66 0.66 |
138 |
141 |
0.98 |
ippiResize |
8u |
C3R |
64x64 |
super |
0.66 0.66 |
86 |
86 |
1.00 |
ippiResize |
8u |
C3R |
256x256 |
super |
0.66 0.66 |
85 |
85 |
1.00 |
ippiResize |
8u |
C3R |
720x480 |
super |
0.66 0.66 |
85 |
85 |
1.00 |
ippiResize |
8u |
C4R |
64x64 |
super |
0.66 0.66 |
85 |
85 |
1.00 |
ippiResize |
8u |
C4R |
256x256 |
super |
0.66 0.66 |
83 |
84 |
0.99 |
ippiResize |
8u |
C4R |
720x480 |
super |
0.66 0.66 |
84 |
84 |
1.00 |
ippiResize |
8u |
AC4R |
64x64 |
super |
0.66 0.66 |
87 |
87 |
1.00 |
ippiResize |
8u |
AC4R |
256x256 |
super |
0.66 0.66 |
85 |
85 |
1.00 |
ippiResize |
8u |
AC4R |
720x480 |
super |
0.66 0.66 |
86 |
86 |
1.00 |
ippiResize |
8u |
P3R |
64x64 |
super |
0.66 0.66 |
142 |
144 |
0.99 |
ippiResize |
8u |
P3R |
256x256 |
super |
0.66 0.66 |
138 |
140 |
0.99 |
ippiResize |
8u |
P3R |
720x480 |
super |
0.66 0.66 |
139 |
141 |
0.99 |
ippiResize |
8u |
P4R |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can't useDLLs for projects. I cannot, and have not, claimed that the DLLs are faster (I would assume well-done SSE2 code is faster than x87 code, however). I have claimed, and proven beyond any doubt, that the DLL code uses SSE2 while the static library code uses x87, for resize/super, and I will venture many other routines.
Why Intel Softwarehas releasedseparate DLLs for different CPUs is a mystery, then, if, as you attempt to show with your table from nowhere, that the performance is the same regardless of the generated code.
I will say it again. The w7 static library is using 100% x87 code for rezize/super (at least those, but likely much more). The w7 DLL is using SSE2 for resize/super. I have SHOWN this to be the case. There is no denying. If you want to throw out a table that claims x87 code runs the same as SSE2, it makes me wonder that you have not run the testing correctly, or are using data from some internal release. It is a simple matter to LOOK at the generated code. Is it that no one available understands disassembly? Find someone and let him consider this problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using IPP 5.2.057 and I experience exactly the same problem. When I link my app with the static libs it runs about 3 times slower than when it is linked with the dynamic libs. Yes, I'm calling ippStaticInit(). My configuration is Win 2003 SP2 32bit, Athlon64.
Is it planed that problem to be fixed?
Smith
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Smith,
we are investigating issue reported for ippiResize function. Could you please provide exact function name and parameters you use when you see 3X performance difference between dynamic and static version of IPP?
Keep in mind that IPP dynamic libraries incorporate internal threading in some of functions which is not a case for static libraries, so you can see different performance on multi-core systems.
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the quick response. It turned out that the problem is on my side. One of my tests was running slowly (3x) because it was not calling ippStaticInit() :-(. I fixed it and now it is running only about 20% slower that with dynamic libs. I cannot says why exactly it is running 20% slower because there are a lot of IPP calls (video transcoding application). Maybe it is because of the threading, 20% is ok.
Smith
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks all for pointing to the issue, we are trying to investigate the reason for that, but unfortunately we were not able to reproduce it.
For your reference I've attached simple test case for ippiResize function and results we got on Intel Core 2 Duo system.
We would appreciate if you can reproduce performance issue with that simple tests (in that case please provide us as much details as possible to help us identify the reason).
You can find in attachment simple command line application which call ippiResize function and prints out information about IPP version and performance.
So, performance measured for DLL is like
PX - 240 processor clocks per output image pixel
W7 - 96 processor clocks per output image pixel
and for IPP static libraries
PX - 241 processor clocks per output image pixel
W7 - 96 processor clocks per output image pixel
Please see attached report for the more details.
Regards,
Vladimir

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page