MKL speed performance vs. IPP

sgwood · ‎04-19-2010

I have noticed that a number of the vector routines in IPP are significantly faster than their counterparts in MKL. For example on a core i7 processor I have noted that the ippsMul_32fc() function from IPP is about 45% faster than the vcMul() routine from MKL.

Is there a reasonable explanation for this?

BTW: I am comparing MKL 10.2.2 and IPP 6.1.2

-Simon

TimP · ‎04-19-2010

A more definite example might be required. For example, you probably are calling IPP by C interface and getting faster results (at least for short vectors) than you could get by a Fortran-compatible call to MKL, certainly faster than a cblas wrapper.

Gennady_F_Intel · ‎04-19-2010

Simon.

the main question in such cases are - the input size?

--Gennady

Thomas_B_3 · ‎04-20-2010

Hello,

I have a very similar question regarding the performance of the eigenvalue and eigenvector calculation (IPPM vs. MKL). Which library is recommended for an input size of the matrix between 20 and 40?

Thank you and best regards,
Tom

Gennady_F_Intel · ‎04-20-2010

Tom,

for similar sizes I would recommend to use IPP (non-threaded version) first of all.

--Gennady

Thomas_B_3 · ‎04-20-2010

Gennady,

thank you for your quick reply.

Tom

sgwood · ‎04-20-2010

Gennady,
The input size in my example is 32768 complex elements. Here is an example of how my code looks.

=======================================================================

complex x1[32768];
complex x2[32768];
complex y1[32768];
complex y2[32768];

// Fill the x1 and x2 arrays with random data, Uniformly distributed on [-10000,10000].
// This is similar to the MKL benchmark test vectors.
// Also note that I have a version of rand() that returns a unfirom RV on [0,1].

for(int ii=0; ii<32768; ii++)
{
r1I = -10000 + 20000 * rand();
r1Q = -10000 + 20000 * rand();
r2I = -10000 + 20000 * rand();
r2Q = -10000 + 20000 * rand();
x1[ii] = complex(r1I,r1Q);
x2[ii] = complex(r2I,r2Q);
}

// Now make the call to IPP
// The first call "warms" up the cache. Time the second call
ippsMul_32fc((Ipp32fc *)(&x1[0]), (Ipp32fc *)(&x2[0]), (Ipp32fc *)(&y[0]), 32768);

// your "tic" timer here
ippsMul_32fc((Ipp32fc *)(&x1[0]), (Ipp32fc *)(&x2[0]), (Ipp32fc *)(&y[0]), 32768);
// your "toc" timer here

//
// Now repeat the above for the MKL vector multiply routine
//
vcMul(32768,x1,x2,y);

// your "tic" timer here
vcMul(32768,x1,x2,y);
// your "toc" timer here

=======================================================================

The above code in put in a main() routine. The compilation takes the form:

icpc test.cpp -O3 -L/opt/intel/ipp/6.1.2.051/ia32/sharedlib -lipps -lippcore ...

-Simon

Ilya_B_Intel · ‎04-21-2010

Simon,

MKL function vcMul() gives more accurate result by default than ippsMul_32fc().

In order to enable less accurate but faster function in MKL you can call vmlSetMode(VML_EP) before the call to vcMul().

In order to use more accurate functions in IPP you can check Fixed-Accuracy Arithmetic Functions domain. ippsMul_32fc_A24() will be as accurate as vcMul() in VML_HA (default) mode.

- Ilya

sgwood · ‎04-30-2010

Ilya,

Excellent! Thanks for the explanation. I over looked that difference.

-Simon