topic MKL speed performance vs. IPP in Intel® oneAPI Math Kernel Library

MKL speed performance vs. IPP

sgwood — Mon, 19 Apr 2010 21:44:43 GMT

I have noticed that a number of the vector routines in IPP are significantly faster than their counterparts in MKL. For example on a core i7 processor I have noted that the ippsMul_32fc() function from IPP is about 45% faster than the vcMul() routine from MKL.

Is there a reasonable explanation for this?

BTW: I am comparing MKL 10.2.2 and IPP 6.1.2

-Simon

MKL speed performance vs. IPP

TimP — Mon, 19 Apr 2010 23:21:28 GMT

A more definite example might be required. For example, you probably are calling IPP by C interface and getting faster results (at least for short vectors) than you could get by a Fortran-compatible call to MKL, certainly faster than a cblas wrapper.

MKL speed performance vs. IPP

Gennady_F_Intel — Tue, 20 Apr 2010 06:30:33 GMT

Simon.

the main question in such cases are - the input size?

--Gennady

MKL speed performance vs. IPP

Thomas_B_3 — Tue, 20 Apr 2010 08:56:02 GMT

Hello,

I have a very similar question regarding the performance of the eigenvalue and eigenvector calculation (IPPM vs. MKL). Which library is recommended for an input size of the matrix between 20 and 40?

Thank you and best regards,
Tom

MKL speed performance vs. IPP

Gennady_F_Intel — Tue, 20 Apr 2010 10:53:40 GMT

Tom,

for similar sizes I would recommend to use IPP (non-threaded version) first of all.

--Gennady

MKL speed performance vs. IPP

Thomas_B_3 — Tue, 20 Apr 2010 12:45:14 GMT

Gennady,

thank you for your quick reply.

Tom

MKL speed performance vs. IPP

sgwood — Tue, 20 Apr 2010 13:04:20 GMT

Gennady,
The input size in my example is 32768 complex elements. Here is an example of how my code looks.

=======================================================================

complex x1[32768];
complex x2[32768];
complex y1[32768];
complex y2[32768];

// Fill the x1 and x2 arrays with random data, Uniformly distributed on [-10000,10000].
// This is similar to the MKL benchmark test vectors.
// Also note that I have a version of rand() that returns a unfirom RV on [0,1].

for(int ii=0; ii<32768; ii++)
{
r1I = -10000 + 20000 * rand();
r1Q = -10000 + 20000 * rand();
r2I = -10000 + 20000 * rand();
r2Q = -10000 + 20000 * rand();
x1[ii] = complex(r1I,r1Q);
x2[ii] = complex(r2I,r2Q);
}

// Now make the call to IPP
// The first call "warms" up the cache. Time the second call
ippsMul_32fc((Ipp32fc *)(&x1[0]), (Ipp32fc *)(&x2[0]), (Ipp32fc *)(&y[0]), 32768);

// your "tic" timer here
ippsMul_32fc((Ipp32fc *)(&x1[0]), (Ipp32fc *)(&x2[0]), (Ipp32fc *)(&y[0]), 32768);
// your "toc" timer here

//
// Now repeat the above for the MKL vector multiply routine
//
vcMul(32768,x1,x2,y);

// your "tic" timer here
vcMul(32768,x1,x2,y);
// your "toc" timer here

=======================================================================

The above code in put in a main() routine. The compilation takes the form:

icpc test.cpp -O3 -L/opt/intel/ipp/6.1.2.051/ia32/sharedlib -lipps -lippcore ...

-Simon

MKL speed performance vs. IPP

Ilya_B_Intel — Wed, 21 Apr 2010 10:53:09 GMT

Simon,

MKL function vcMul() gives more accurate result by default than ippsMul_32fc().

In order to enable less accurate but faster function in MKL you can call vmlSetMode(VML_EP) before the call to vcMul().

In order to use more accurate functions in IPP you can check Fixed-Accuracy Arithmetic Functions domain. ippsMul_32fc_A24() will be as accurate as vcMul() in VML_HA (default) mode.

- Ilya

MKL speed performance vs. IPP

sgwood — Fri, 30 Apr 2010 15:32:12 GMT

Ilya,

Excellent! Thanks for the explanation. I over looked that difference.

-Simon