How to set a 'rec' structuring element in ippiMorphCloseBorder_8u_C1R



I am using IPP 9.0 Intel 64 to try to reducing processing time in a morphology closing operation on a 2048x2048 image.  I have used perfsys and am seeing a significant reduction in the processing time when Parm5 is 'rec'.  I'm assuming that this is a rectangular structure element, which is what I'm using. 

function Parm1 Parm2 Parm3 Parm4 Parm5 Parm6 Parm7 Parm8 Comment Clocks per Time (usec)
ippiMorphCloseBorder 8u C1R 2048x2048 9x9 gen repl - - nLps=4 26.3 pxch 3.94E+04
ippiMorphCloseBorder 8u C1R 2048x2048 9x9 ell repl - - nLps=4 26.4 pxch 3.96E+04
ippiMorphCloseBorder 8u C1R 2048x2048 9x9 crs repl - - nLps=4 26.2 pxch 3.92E+04
ippiMorphCloseBorder 8u C1R 2048x2048 9x9 rec repl - - nLps=4 2.2 pxch 3.29E+03


Here is the C++ code that I'm using in my test program.  It is taking 0.039 seconds which indicates a non 'rec' Parm5.  I'm using Visual Studio 2013 compiling x64 and testing on Window7 64 bit.  My test HW is Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz, 2801 Mhz, 2 Core(s), 4 Logical Processor(s):


        IppiMorphAdvState* pSpec = NULL;
        Ipp8u* pBuffer = NULL;
        IppiSize roiSize = { src.cols, src.rows };
        int specSize = 0, bufferSize = 0;
        IppiBorderType borderType = ippBorderRepl;
        Ipp16u borderValue = 0;
        Ipp8u pMask[9 * 9] = { 0 };
        IppiSize maskSize = { 9, 9 };
        ippiSet_8u_C1R(1, pMask, 9, maskSize);

        IppStatus status = ippiMorphAdvGetSize_8u_C1R(roiSize, maskSize, &specSize, &bufferSize);
        if (status != ippStsNoErr) return status;

        pSpec = (IppiMorphAdvState*)ippsMalloc_8u(specSize);
        pBuffer = (Ipp8u*)ippsMalloc_8u(bufferSize);

        status = ippiMorphAdvInit_8u_C1R(roiSize, pMask, maskSize, pSpec, pBuffer);
        if (status != ippStsNoErr) {
            return status;

        t = (double)getTickCount();

        status = ippiMorphCloseBorder_8u_C1R(, detected_edges.step,
  , joined.step, roiSize, borderType, borderValue, pSpec, pBuffer);

        t = ((double)getTickCount() - t) / getTickFrequency();


The source and destination images are OpenCV Mat types which were created with pointer/step that was allocated with ippiMalloc_8u_C1.

How can I modify my code to give similar times to the perfsys for 'rec' type?  

Thank you,

