тема ok, ippiConv with "Valid" ROI в Intel® Integrated Performance Primitives

ippConv; Separable filter

C_W_ — Mon, 02 Nov 2015 23:38:00 GMT

Does ippiConv internally perform a separable filter if the kernel parameters allow it?

I have implemented convolution using both ippiConv and ippiFilterRowBorderPipeline_32f_C1R, ippiFilterColumnPipeline_32f_C1R. I have implemented convolution using the above as both a single threaded version and multi-threaded (by breaking the convolution up into chunks).

In all cases ippiConv is faster than the by calling the ippiFilterRow/Column pair.

I didn't expect ippiConv to handle the separable case. I expected the ippiFilterRow/Column pair to be faster.

I am wondering if I did something wrong, or this is expected (outputs are numerically the same so raw implementation is correct).

I'm using IPP 8. Convolutions are perhaps 512x512 pixels, float. 4 core i7 CPU.

Thanks.

Hello,

Igor_A_Intel — Thu, 05 Nov 2015 08:01:23 GMT

Hello,

Could you be more specific: OS, IPP static or dynamic, multi or single threaded, ia32 or x64, other parameters of convolution - size of both convolved images (are they both 512x512?), data type, number of channels, Full or Valid (better - full function name, all parameters used and output from ippiGetLibVersion: const IppLibraryVersion* lib = ippcvGetLibVersion(); printf(“%s %s %d.%d.%d.%d\n”, lib->Name, lib->Version, lib->major, lib->minor, lib->majorBuild, lib->build);).

ippiConv in 8.x internally uses some complex criterion and switches between 2 implementations - direct and based on convolution theorem (FFT). FFT-based version is implemented by chunks (if size of kernel is significantly less than image). And I think that if kernel size is rather small (3x3 - 11x11) it's better to use ippiFilter function.

regards, Igor

Windows 7, IPP dynamic

C_W_ — Thu, 05 Nov 2015 21:53:35 GMT

Windows 7, IPP dynamic (custom dll), single threaded, x64.

Approx image parameters; source image; various between 256x256 and 512x512 (however, aspect is not necesarily square, but both x and y are at least mod 8. Convolution kernel is square between 3x3 and 7x7. Kernels are guaranteed to be separable and square. Type is float-32, 1 channel of data. Valid convolution.

IPP Info; ippCV AVX (e9) 8.2.1 (r44077) 8.2.1.44077

I am using ippAlgDirect as I found that ippAlgFFT is slower.

For example; I am using

ippiConv_32f_C1R(x, 516 * sizeof(float), {516, 77}, x, 5 * sizeof(float), {5,5}, x, 516 * sizeof(float), ippiROIValid | ippAlgDirect | ippiNormNone, x);

-----------

An update; I have tried using ippiFilter and the timing results are almost identical to ippiConv. This would suggest that ippiConv is not internally determining separability. This would suggest that there is some optimizations here that could be made and as such I thought that by using ippiFilterRowBorderPipeline_32f_C1R i might see some improvement.

--------

I have a stand alone project that I could clean up and provide if you think it would be helpful.

Thanks,

ok, ippiConv with "Valid" ROI

Igor_A_Intel — Fri, 06 Nov 2015 14:51:00 GMT

ok, ippiConv with "Valid" ROI uses the same code as ippiFilter for "direct" case as both perform absolutely the same things. FFT-based convolution begins to be faster than direct for kernel sizes greater than ~20x20 (depends on arch). You are right - in your case the "separable" approach must be faster than direct 2D. I'll try to check the separable row-column algorithm for your sizes to see if there are any problems.

regards, Igor

Ok. Here is how I use the

C_W_ — Fri, 06 Nov 2015 15:26:22 GMT

Ok. Here is how I use the separable functions. Row then column. When I multithread these, I separate the filter processing into 1 line segments (ie roiSize = {512, 1}), i do all the rows first, then do all the columns.

ippiFilterRowBorderPipeline_32f_C1R(
       pfImgIn,
       iXRes * sizeof(float), // 512 * 4
       ppfBufferRow, // Ipp32f **ppfBufferRow
       roiSize, // ie {512, 512}
       m_pfKernelRow, // ie {1,2,3,4,5}
       m_iFilterSize, // 5
       iAnchor, // 2
       IppiBorderType::ippBorderConst,
       0,
pbtTempBuffer);

status = ippiFilterColumnPipeline_32f_C1R(
       ppfBufferRow,
       pfImgOut,
       iXRes * sizeof(float),
       roiSize,
       m_pfKernelColumn,
       m_iFilterSize,
       pbtTempBuffer);