Community
cancel
Showing results for 
Search instead for 
Did you mean: 
sgwood
Beginner
157 Views

MKL speed performance vs. IPP

I have noticed that a number of the vector routines in IPP are significantly faster than their counterparts in MKL. For example on a core i7 processor I have noted that the ippsMul_32fc() function from IPP is about 45% faster than the vcMul() routine from MKL.

Is there a reasonable explanation for this?

BTW: I am comparing MKL 10.2.2 and IPP 6.1.2

-Simon
0 Kudos
8 Replies
TimP
Black Belt
157 Views

A more definite example might be required. For example, you probably are calling IPP by C interface and getting faster results (at least for short vectors) than you could get by a Fortran-compatible call to MKL, certainly faster than a cblas wrapper.
Gennady_F_Intel
Moderator
157 Views

Simon.
the main question in such cases are - the input size?
--Gennady
Thomas_B_3
Beginner
157 Views

Hello,

I have a very similar question regarding the performance of the eigenvalue and eigenvector calculation (IPPM vs. MKL). Which library is recommended for an input size of the matrix between 20 and 40?

Thank you and best regards,
Tom
Gennady_F_Intel
Moderator
157 Views

Tom,
for similar sizes I would recommend to use IPP (non-threaded version) first of all.
--Gennady
Thomas_B_3
Beginner
157 Views

Gennady,

thank you for your quick reply.

Tom
sgwood
Beginner
157 Views

Gennady,
The input size in my example is 32768 complex elements. Here is an example of how my code looks.

=======================================================================

complex x1[32768];
complex x2[32768];
complex y1[32768];
complex y2[32768];

// Fill the x1 and x2 arrays with random data, Uniformly distributed on [-10000,10000].
// This is similar to the MKL benchmark test vectors.
// Also note that I have a version of rand() that returns a unfirom RV on [0,1].

for(int ii=0; ii<32768; ii++)
{
r1I = -10000 + 20000 * rand();
r1Q = -10000 + 20000 * rand();
r2I = -10000 + 20000 * rand();
r2Q = -10000 + 20000 * rand();
x1[ii] = complex(r1I,r1Q);
x2[ii] = complex(r2I,r2Q);
}

// Now make the call to IPP
// The first call "warms" up the cache. Time the second call
ippsMul_32fc((Ipp32fc *)(&x1[0]), (Ipp32fc *)(&x2[0]), (Ipp32fc *)(&y[0]), 32768);

// your "tic" timer here
ippsMul_32fc((Ipp32fc *)(&x1[0]), (Ipp32fc *)(&x2[0]), (Ipp32fc *)(&y[0]), 32768);
// your "toc" timer here

//
// Now repeat the above for the MKL vector multiply routine
//
vcMul(32768,x1,x2,y);

// your "tic" timer here
vcMul(32768,x1,x2,y);
// your "toc" timer here

=======================================================================

The above code in put in a main() routine. The compilation takes the form:

icpc test.cpp -O3 -L/opt/intel/ipp/6.1.2.051/ia32/sharedlib -lipps -lippcore ...

-Simon
Ilya_B_Intel
Employee
157 Views

Simon,

MKL function vcMul() gives more accurate result by default than ippsMul_32fc().

In order to enable less accurate but faster function in MKL you can call vmlSetMode(VML_EP) before the call to vcMul().

In order to use more accurate functions in IPP you can check Fixed-Accuracy Arithmetic Functions domain. ippsMul_32fc_A24() will be as accurate as vcMul() in VML_HA (default) mode.

- Ilya

sgwood
Beginner
157 Views

Ilya,

Excellent! Thanks for the explanation. I over looked that difference.

-Simon
Reply