Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

ippiFilterBox_8u on IPP 6.1

zab
Beginner
489 Views
Is this function lack any SSE/2/3 optimization? I found that it realization do not use anySIMDinstruction even though it have different implementation for windows with areas less than8100 pixels and above. It hard tobelievethat box filter function couldn't utilizeSIMDpower.
0 Kudos
9 Replies
Vladimir_Dudnik
Employee
489 Views
Hello,

I'm not sure how you realize what instructions this function does use and what it does not. FilterBox function do have processor specific code and you may notice that it has different performance on different processors.

Regards,
Vladimir
0 Kudos
zab
Beginner
489 Views
In Debug mode in disassembler it never fell to any sse instruction even though it comes from *t7.dll... may be it actually use sse3 but in some conditions on image size and on window size but I don'twannacheck every conditionalbranchin this library. Can you name the case in which it rely on sse3 optimization?
0 Kudos
Vladimir_Dudnik
Employee
489 Views
We provide IPP functions performance data with each IPP release. These data obtained on different supported processors and in case of image processing functions you may see difference in performance of the same function when it runs on different processors.

Vladimir
0 Kudos
zab
Beginner
489 Views
Where can I find this performance data in my IPP release?
0 Kudos
PaulF_IntelCorp
Employee
489 Views
One way to compare the differences in performance for a specific function is to use the perfsys tool in the tools directory. You can trick perfsys into using a specific SSE instruction set by temporarily removing all the DLL files except those that match the architecture of interest from the path. For example, if you want to test the performance ofippiFilterBox on the px architecture, remove the p8, s8, t7, and w7 versions of theDLL and then run the perfsys test.Repeat for each of theother architectures. Remember to leave the unadorned DLL in place (the one without an architecture designator) since that is theentry point that then dispatchesto the architecture-specific version of the library.
0 Kudos
zab
Beginner
489 Views
So... I use this tool with this command lines
ps_ippi.exe -f=ippiFilterBox_8u_C1R -YHIGH -dMethod=Manual -dImageSize=1920x1080
-dNumLoops=50000 -N1 -TPP -RPP.csv
ps_ippi.exe -f=ippiFilterBox_8u_C1R -YHIGH -dMethod=Manual -dImageSize=1920x1080
-dNumLoops=50000 -N1 -TSSE3 -RSSE3.csv
first one is for simple assembler optimization and the second for P4 with SSE3.
The result for first one is
ippiFilterBox,8u,C1R,1920x1080,3x3,-,-,-,-,nLps=5000,18.2,pxch,1.16e+004,-
ippiFilterBox,8u,C1R,1920x1080,5x5,-,-,-,-,nLps=5000,18.2,pxch,1.16e+004,-
and for the second
ippiFilterBox,8u,C1R,1920x1080,3x3,-,-,-,-,nLps=5000,8.75,pxch,5.56e+003,-
ippiFilterBox,8u,C1R,1920x1080,5x5,-,-,-,-,nLps=5000,20.5,pxch,1.31e+004,-
for Box filter with window size 3x3 *t7.dll actually run faster: 8.75 ticks per pixel compare to 18.2 tpp. But filtering with bigger windows even loose some speed 20.5 tpp vs 18.2 tpp!
I dont know how to set a window bigger than 90x90 in this tool, looking in disassembler I found three different modes for windows of size 3x3 for windows less than 90x90 and for bigger windows sizes.
Can youpleaseask an actual developer of this IPP Image library to clarify in which cases forwhichwindow sizes this function are optimized. I need to know this!
If Intel do not optimize this function for windowsbiggerthan 3x3 I will try to find a faster implementation for box filter or even try to make my own implementationwhichwill utilize the power of SSE instructions.But if Inteltriedto make it faster and failed in this task, than I will not try to speed this part of filter for now and willfocuson the other bottleneck of a program.
0 Kudos
Vladimir_Dudnik
Employee
489 Views
Hello,

ippiFilterBox_8u_C1R has low level optimization for kernels 3x3, 5x5 and 7x7. Note that cases for kernels 5x5 and 7x7 are optimized for Core 2 processors and higher.

Regards,
Vladimir
0 Kudos
zab
Beginner
489 Views
Ive tested it onCore 2 Quad processor:
-TPP:
ippiFilterBox,8u,C1R,1920x1080,3x3,-,-,-,-,nLps=5000,7.31,pxch,5.42e+003,-
ippiFilterBox,8u,C1R,1920x1080,5x5,-,-,-,-,nLps=5000,7.31,pxch,5.41e+003,-
-TSSE3:
ippiFilterBox,8u,C1R,1920x1080,3x3,-,-,-,-,nLps=5000,3.72,pxch,2.75e+003,-
ippiFilterBox,8u,C1R,1920x1080,5x5,-,-,-,-,nLps=5000,11.3,pxch,8.36e+003,-
-TSSE41:
ippiFilterBox,8u,C1R,1920x1080,3x3,-,-,-,-,nLps=5000,2.20,pxch,1.63e+003,-
ippiFilterBox,8u,C1R,1920x1080,5x5,-,-,-,-,nLps=5000,11.5,pxch,8.5e+003,-
so... 5x5 still slower in modern optimization libraries... how could it be? uve never tested it? too bad for intel...
Anyway I need to use box filter with windows bigger than 7x7, box filter could be done in linear time and one part of a filter need to sum up andsubtracta row, Ibelievethis could be done faster with SIMD.
0 Kudos
Vladimir_Dudnik
Employee
489 Views

I think Core2 optimization for this function was added in IPP 7.0 beta. You are right, it make sense to usedifferent algorithm for relatively bigfilter kernels, to provide lienar computation time. We consider that improvement for future versions of IPP. For now, you may consider combination of computation integral image and simple arithmetic operation on kernel corner points to compute filter box with linear time for big kernels.

Regards,
Vladimir

0 Kudos
Reply