Puzzled by IPP performance

dreamcast · ‎04-25-2006

I was puzzled by IPP performance test ( version 5.1 ). I wrote a very simple program, it reads a 1024x768 size bitmap, then flips it.
The flip function is ippiMirror_8u_C3IR. I timed the function's execution, the code is like below:

int myGetTime(void)
{
LARGE_INTEGER ticks, freq; QueryPerformanceCounter(&ticks);
QueryPerformanceFrequency(&freq);
return (int)(ticks.QuadPart * 1000000 / freq.QuadPart);
}

void myFunc()
{
...
IppiSize iSize;
iSize.width = 1024;
iSize.height = 768;
lineStep = ((1024 * 24 + 31) / 32) * 4;

int startTime, endTime;
for (int i = 0;i 10;i++)
{
startTime = myGetTime();
ippiMirror_8u_C3IR(bmpData, lineStep, iSize, ippAxsHorizontal);
endTime = myGetTime();
printf("Time: %d ", endTime - startTime);
}
...
}

I used CPU specific definitions to test different performance. To my surprise, on my pc, a6(PIII SSE) gives the best result, it is twice as fast as the rest 3 ( px, w7, t7 ), those 3 are almost the same performance. But I am using PentiumD 830!

Can anybody tell me what I did wrong on this? How can I get the best performance regarding this issue?

Thanks.

Vladimir_Dudnik · ‎04-30-2006

Hello,

yes, it is interesting results. Let's try to understand what happened. Could you please specify what linkage did you use for that test DLL or static libraries? Did you call ippStaticInit (or ippStaticInitCpu) in case of static linkage?

Regards,
Vladimir

dreamcast · ‎05-01-2006

I attached a small example including all source code and project files ( for VS.Net 2003 ), to run this, you need to put a 1024x768 bitmap under debug directory.

If you just compile the project, it will work as PIII/a6 mode, if you define TEST_PRESCOTT, it will work as P4+Prescott/t7.

In my D830 environment, a6 works twice as fast as t7 mode.

dreamcast · ‎05-03-2006

is there any update on this?

Vladimir_Dudnik · ‎05-04-2006

Hello,

we analized that issue.The reason of performance degradation for T7 optimized code is unaligned data access. It was noted in IPP documentation that you need to organize your data in such fashion to provide 16-byte aligned memory addresses where it is possible. In this case Intel architecture allow to access the data more efficiently and IPP functions were optimized with taking of care about that feature. Please take a look at attached modified sample, which eliminates this performance issue and provides you the best possible performance.

Regards,
Vladimir

Vladimir_Dudnik · ‎05-04-2006

By the way, you used static libraries without static dispatching of code. Nothing wrong with this but it seems for such purpuses using of static dispatching with forcing particular cpu-specific code at run time is more convenient. You can have only one executable and you can control at run time which code to use (PX, A6, W7, T7). To do this you need to link *emerged.lib, which contains static dispatcher itself and at the beginning of your program you need to call ippStaticInitCpu() function which takes as a parameter IppCpuType enumerator to provide you with explicit control of which cpu-specific code to dispatch.

PS
In case you do not need in explicit control you can call ippStaticInit() function which detects your cpu type at run time and dispatches the best appropriate code.

Vladimir

dreamcast · ‎05-04-2006

Thank you for your reply. I use different CPU type just for testing purpose and that result scared me. Now I will be careful to use aligned memory use.

Thanks again.