Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Puzzled by IPP performance

dreamcast
Beginner
1,149 Views
I was puzzled by IPP performance test ( version 5.1 ). I wrote a very simple program, it reads a 1024x768 size bitmap, then flips it.
The flip function is ippiMirror_8u_C3IR. I timed the function's execution, the code is like below:

int myGetTime(void)
{
LARGE_INTEGER ticks, freq; QueryPerformanceCounter(&ticks);
QueryPerformanceFrequency(&freq);
return (int)(ticks.QuadPart * 1000000 / freq.QuadPart);
}

void myFunc()
{
...
IppiSize iSize;
iSize.width = 1024;
iSize.height = 768;
lineStep = ((1024 * 24 + 31) / 32) * 4;

int startTime, endTime;
for (int i = 0;i 10;i++)
{
startTime = myGetTime();
ippiMirror_8u_C3IR(bmpData, lineStep, iSize, ippAxsHorizontal);
endTime = myGetTime();
printf("Time: %d ", endTime - startTime);
}
...
}

I used CPU specific definitions to test different performance. To my surprise, on my pc, a6(PIII SSE) gives the best result, it is twice as fast as the rest 3 ( px, w7, t7 ), those 3 are almost the same performance. But I am using PentiumD 830!

Can anybody tell me what I did wrong on this? How can I get the best performance regarding this issue?

Thanks.
0 Kudos
6 Replies
Vladimir_Dudnik
Employee
1,149 Views

Hello,

yes, it is interesting results. Let's try to understand what happened. Could you please specify what linkage did you use for that test DLL or static libraries? Did you call ippStaticInit (or ippStaticInitCpu) in case of static linkage?

Regards,
Vladimir

0 Kudos
dreamcast
Beginner
1,149 Views
I attached a small example including all source code and project files ( for VS.Net 2003 ), to run this, you need to put a 1024x768 bitmap under debug directory.

If you just compile the project, it will work as PIII/a6 mode, if you define TEST_PRESCOTT, it will work as P4+Prescott/t7.

In my D830 environment, a6 works twice as fast as t7 mode.
0 Kudos
dreamcast
Beginner
1,149 Views
is there any update on this?
0 Kudos
Vladimir_Dudnik
Employee
1,149 Views

Hello,

we analized that issue.The reason of performance degradation for T7 optimized code is unaligned data access. It was noted in IPP documentation that you need to organize your data in such fashion to provide 16-byte aligned memory addresses where it is possible. In this case Intel architecture allow to access the data more efficiently and IPP functions were optimized with taking of care about that feature. Please take a look at attached modified sample, which eliminates this performance issue and provides you the best possible performance.

Regards,
Vladimir

0 Kudos
Vladimir_Dudnik
Employee
1,149 Views

By the way, you used static libraries without static dispatching of code. Nothing wrong with this but it seems for such purpuses using of static dispatching with forcing particular cpu-specific code at run time is more convenient. You can have only one executable and you can control at run time which code to use (PX, A6, W7, T7). To do this you need to link *emerged.lib, which contains static dispatcher itself and at the beginning of your program you need to call ippStaticInitCpu() function which takes as a parameter IppCpuType enumerator to provide you with explicit control of which cpu-specific code to dispatch.

PS
In case you do not need in explicit control you can call ippStaticInit() function which detects your cpu type at run time and dispatches the best appropriate code.

Vladimir

0 Kudos
dreamcast
Beginner
1,149 Views
Thank you for your reply. I use different CPU type just for testing purpose and that result scared me. Now I will be careful to use aligned memory use.

Thanks again.
0 Kudos
Reply