topic Hi Tony in Intel® oneAPI Math Kernel Library

vslsConvExecX performance

Beckett__Tony — Fri, 06 Jul 2018 19:21:33 GMT

Using this function vslsConvExecX verses the IPP function IppFilter,. the performance is 10x slower. Does this seem correct?

Hi Tony

Ying_H_Intel — Mon, 09 Jul 2018 01:57:49 GMT

Hi Tony

Thank you a lot for reporting the problem.
if it is possible, could you please tell some background, like your test cpu type, vector size etc. how do you link MKL and IPP etc? one small reproduce case may helpful! If it is private, could you please submit those information to Intel online service center: http://supporttickets.intel.com/

Thanks
Ying

processor : 0

Beckett__Tony — Mon, 09 Jul 2018 12:42:52 GMT

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 142
model name	: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
stepping	: 9
cpu MHz		: 2904.004
cache size	: 4096 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx
 rdrand hypervisor lahf_lm abm 3dnowprefetch avx2 rdseed clflushopt
bogomips	: 5808.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual

#define IPP_VERSION_STR "2018.0.3"

#define INTEL_MKL_VERSION 20180002

    libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f986c843000)
   libmkl_gnu_thread.so => /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so (0x00007f986b130000)
   libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007f9867126000)

libippcore.so => /opt/intel/ipp/lib/intel64/libippcore.so (0x00007f529092b000)
   libippcc.so => /opt/intel/ipp/lib/intel64/libippcc.so (0x00007f5290710000)
   libippch.so => /opt/intel/ipp/lib/intel64/libippch.so (0x00007f529050a000)
   libippcv.so => /opt/intel/ipp/lib/intel64/libippcv.so (0x00007f52902e4000)
   libippdc.so => /opt/intel/ipp/lib/intel64/libippdc.so (0x00007f52900dc000)
   libippi.so => /opt/intel/ipp/lib/intel64/libippi.so (0x00007f528fe2a000)
   libipps.so => /opt/intel/ipp/lib/intel64/libipps.so (0x00007f528fbe0000)
   libippvm.so => /opt/intel/ipp/lib/intel64/libippvm.so (0x00007f528f9c9000)

partial code
 
    const int x_stride[2] = { 256,    1 };             
    const int y_stride[2] = {  8, 1 };                 
    const int z_stride[2] = { 256,    1 };    

status = vslsConvNewTaskX(&task,   
                                                VSL_CONV_MODE_AUTO, 
                                        
                                             ? VSL_CONV_MODE_DIRECT                 
                                            
                                         2, 
                                         x_shape,                                    
                                         y_shape,                                     
                                         z_shape,                                     
                                         x,                                         
                                         x_stride);                                  
     
    const int conv_start[2] = { (anchor.y == -1) ? (y_shape[0] - 1) / 2 : anchor.y,    
                                (anchor.x == -1) ? (y_shape[1] - 1) / 2 : anchor.x }; 
                                                                                  
    status = vslConvSetStart(task, conv_start);                                    
    
    status = vslsConvExecX(task,      
                                      y,                                           
                                      y_stride,                                     
                                      z,                                             
                                      z_stride);                                      
     
    status = vslConvDeleteTask(&task);

Hi Tony,

Ying_H_Intel — Mon, 23 Jul 2018 05:38:23 GMT

Hi Tony,

What is your input and how was your IPP filter parameter?

Best Regards,

Ying

Hi Tony,

Ying_H_Intel — Fri, 27 Jul 2018 05:29:49 GMT

Hi Tony,

We discussed the issue internally. As you saw, that there are two convolution in MKL, IPP and IPP have better performance than the vslsConvExecX. And we even have one popular library MKL-DNN for convolution : https://github.com/intel/mkl-dnn. So we are interested in how and what kind of application you are working, could you tell some background?

Best Regards,
Ying

We are doing image analysis.

Beckett__Tony — Fri, 27 Jul 2018 19:16:41 GMT

We are doing image analysis. Currently we are using Linux as the OS. We can compile using either OpenCV or MKL/IPP . In this case for the 2D filter function the OpenCV is 30% faster and we thought that the Intel libraries should be faster. So we are confused.

You are saying that for a 8x8 kernel on 1024x1024 the IPP should be faster?

Hi Tony,

Ying_H_Intel — Tue, 31 Jul 2018 08:58:17 GMT

Hi Tony,

Yes, IPP conv is faster than the functions of vslsConvExecX. and what do you mean the openCV is 30% faster? I supposed OpenCV is optimized by IPP by default. ? could you please provide us a small test case?

I attached one we did for IPP test.

Best Regards,
Ying

int main(void)
{
double time;
clock_t t;
IppStatus status = ippStsNoErr;
Ipp32f* pSrc1 = NULL, *pSrc2 = NULL, *pDst = NULL; /* Pointers to source/destination images */
int srcStep1 = 0, srcStep2 = 0, dstStep = 0; /* Steps, in bytes, through the source/destination images */
IppiSize dstSize = { 1031, 1031 }; /* Size of destination ROI in pixels */
IppiSize src1Size = { 1024, 1024 }; /* Size of destination ROI in pixels */
IppiSize src2Size = { 8, 8 }; /* Size of destination ROI in pixels */
int divisor = 2; /* The integer value by which the computed result is divided */
Ipp8u *pBuffer = NULL; /* Pointer to the work buffer */
int iTmpBufSize = 0; /* Common work buffer size */
int numChannels = 1;
IppEnum funCfgFull = (IppEnum)(ippAlgAuto | ippiROIFull | ippiNormNone);

pSrc2 = ippiMalloc_32f_C1(src2Size.width, src2Size.height, &srcStep2);
pSrc1 = ippiMalloc_32f_C1(src1Size.width, src1Size.height, &srcStep1);
pDst = ippiMalloc_32f_C1(dstSize.width, dstSize.height, &dstStep);

check_sts( status = ippiConvGetBufferSize(src1Size, src2Size, ipp32f, numChannels, funCfgFull, &iTmpBufSize) )

pBuffer = ippsMalloc_8u(iTmpBufSize);

for (int i = 0; i < 1048576; ++i) {
pSrc1 = 1;
}
for (int i = 0; i < 8 * 8; ++i) {
pSrc2 = 1;
}
t = clock();
for (int j = 0; j < 100; ++j) {
check_sts(status = ippiConv_32f_C1R(pSrc1, srcStep1, src1Size, pSrc2, srcStep2, src2Size, pDst, dstStep, funCfgFull, pBuffer))
}
t = clock() - t;
time = (double)t / CLOCKS_PER_SEC;
printf("%f \n", time);
system("pause");

return 0;