First test show bad performace - what's wrong?

akerlund · ‎03-12-2008

I just installed MKL 9.1.027 and wanted to try it out with c++, visual studio 2005.

First I made this simple wrapper namespace:

namespace mkl
{
struct vec3f
{
vec3f(const float x, const float y, const float z)
{
e[0] = x;
e[1] = y;
e[2] = z;
}
float e[4];
};

inline void sqrt(const vec3f &in, vec3f &out)
{
vsSqrt(3, in.e, out.e);
}
}

Then I also wrote this to compare with:
void oldSqrt(mkl::vec3f &in, mkl::vec3f &out)
{
out.e[0] = sqrtf(in.e[0]);
out.e[1] = sqrtf(in.e[1]);
out.e[2] = sqrtf(in.e[2]);
}

And this is the testing code (I changed the function call and timed the different runs):
mkl::vec3f v(9.0f, 0.0f, 100.0f);
mkl::vec3f v2(0.0f, 0.0f, 0.0f);

for (unsigned int i = 0; i < 10000000; ++i)
mkl::sqrt(v,v2);

Now, the run time for the standard sqrtf was 0.0003 seconds, but for the MKL version I had to wait 1.4 seconds! Why is this? These are my additional dependencies:
mkl_c_dll.lib
mkl_ia32.lib
libguide40.lib

Andrey_G_Intel2 · ‎03-13-2008

akerlund,

you are trying to calculate vsSqrt on very short vector. For most cases you will not receive performance gain from VML usage for such small vectors. Try biggervectors - with 100 elements or more.

Andrey

akerlund · ‎03-24-2008

Only somewhere between 100k and 500k floats, I see that vsSqrt runs faster. Is this right? Here is the new code I am testing with:

int howMany;
cin >> howMany;
float *numbersIn = new float[howMany];
float *numbersOut = new float[howMany];
for (int i = 0; i < howMany; ++i)
numbersIn = (1.0f / RAND_MAX) * rand();

Timer tm;

//vsSqrt(howMany, numbersIn, numbersOut);
for (int i = 0; i < howMany; ++i)
numbersOut = sqrtf(numbersIn);

float a = 0.0f;
for (int i = 0; i < howMany; ++i)
a += numbersOut;

tm.Now();
printf("Time: %f | a: %f ", 1000.0f * tm.TimeElapsed(), a);

TimP · ‎03-24-2008

Your un-optimized sum reduction will take a significant part of the time, as well as likely producing insufficient accuracy, for such long vectors. Certainly, it would take rather long vectors before VML sqrt() could compete with optimized source code. If you are interested in performance on such code, you should consider SSE parallel intrinsics, or a vectorizing compiler.

levicki · ‎03-25-2008

Regarding your first test case — I am not sure how MKL handles floating point exceptions but that may as well be the cause of the slowdown. Try picking numbers so as to avoid denormals after repeatedly calculating square root for many iterations.