Is there any plan to add the 32f_C1PV datatype for the ipprFilter function? In particular, I apply large (9x9x9) Gaussian filters to volume data extensively, and currently do it by a two-step call to ippiFilterGaussianBorder_32f_C1R, once for XY-planes, and once for Z-planes. This incurs a lot of data transfer overhead, especially for filtering in Z. A single call to a ipprFilter_32f_C1PV, or a ipprFilterGaussianBorder_32f_C1PV would be a huge help.
IPP filter functions are one and two dimensional only.
The two-step approach as mentioned is working and is efficient on small data volumes. When your volume is large, you might run into memory bandwidth issues. You can quickly check this with tools like Intel VTune™ here. If you find out that your algorithm is memory bandwidth limited, consider changing the algorithm.
Small volume data means data size up to the size of the processors cache size. So for best performance you need to make sure that your data fits into the cache. To do so you need to split your volume space into multiple smaller data cubes where the data of each of those cubes fit into the processors cache. You then apply your two-step filter to the first cube before moving to the next cube. This way you significantly reduce memory traffic on the bus as you load data only once from memory. Algorithms taking into account the physical limitation of the processor cache are often called cache-blocking-algorithms. You will find more on this when doing some research.
From personal experience you might easily get a performance boost of 30% or more (depending on data, algorithm, and compute complexity). When doing so, pay special attention to the cube boarders during calculation. And check the processors cache size of the target system at ark.intel.com as they are significantly different across the processor segments and SKUs (IPP functions are helping you here also by returning the cache size of the processor).
Thanks & Best,