Why is my code faster than MKL???

Michael_Stauffer · ‎11-20-2007

OS X 10.4.10
Intel Core Duo 1.66 GHz
GCC 4.0.1
MKL 10.0.012

Hi,

I'm just trying out MKL. I'm doing some simple comparisons of VML with a simple complex vector math lib I wrote. My test program crunches a 1000-element complex vector 500k times in a simple loop, first using VML, then my lib.

My lib simply loops through a complex input array and performs element-by-element addition, absolute value or multiplication, depending on the function. My library is doing complex-number multiplication and absolute value.

Here are the timing results:
______VML___My Lib
add.....1.43....1.90
mul...21.51....3.40
abs...17.18....2.21

As you can see, VML is slightly faster for addition, but *much* slower for multiplication and absolute value.

I'm compiling with gcc (from within Matlab) using these options:

-fno-common -no-cpp-precomp -fexceptions -march=pentium4 -O3 -ftree-vectorize -mfpmath=sse -msse -msse2 -Wall -pipe -DHAVE_INLINE -Wno-long-long

I figure gcc 4.0 is doing loop vectorization, so my simple code could be vectorized, but why would it be faster than VML?

I've tried building with different architecture and optimizations, but those don't effect the VML timings, which makes sense.

I've messed around with the various MKL options like accuracy modes and # of threads, but saw no changes.

Does anyone have any thoughts on this? Thanks, I'm very confused.

Cheers,
Michael

TimP · ‎11-20-2007

VML would require rather long vectors to be competitive with compiler auto-vectorization. It caters more to those who continue to use compilers without auto-vectorization. If you have several operations in a single loop, where you would require several VML calls, the way you are testing, the time required probably is at least proportional to the number of cache to register moves. If a single loop does the job of 2 VML calls, it should be at least twice as fast.
Speaking of being confused, if you are so interested in performance, why are you using such a slow CPU and such an old gcc? How can we trust your timing method? When you say "from within Matlab," maybe that should have been a signal to ignore this?

Michael_Stauffer · ‎11-21-2007

Tim, thanks for the reply.

I'm using a slow CPU cuz it's all we can afford for now in our startup. I hope to get a new Mac Pro early next year, fingers crossed. For now I'm working on this MacMini to try and optimize number-crunching intensive portions of a signal processing prototype built in Matlab. My partner does the actual scientific research using a Mac Pro where he runs large simulations. This is my first time working on vectorizing code. My previous optimization challenges have been around real-time code and memory management. If anyone has a good suggestion for where to start reading up on optimizing computationally heavy code, I'd like to look at it. What I've read so far has been fragmented.

I'm building "mex files" which are C dylibs that can be called directly from Matlab. By porting the numerical integration portion of our prototype from Matlab to C I already got a 5-10x speed increase without any particular thought yet to advanced speed optimizations. Although, it seems likely gcc is already verctorizing my code.

I'm using gcc 4.0 cuz thats what XCode 2.4.1 has (and it's still the same in XCode 3, it seems). Apple has their own version of gcc so I guess the numbering may be different, although I know Apple's gcc 4.0 is based on gnu gcc 4.0.1.

I'll build a standalone C app for my tests instead of running them via dylibs with Matlab as a gui interface. That will be cleaner and isolate if something strange is happening with that setup.

As for the length of vectors that would make MKL competitive with auto-vectorization: I've tested a little more, using vectors with 1 million elements, but with less looping to keep the overall number of elemental calculations the same. It's interesting. The times for VML vzAbs and vzMul functions are practically flat. VML vzAdd and my custom routines slow down by up to 4x. But my routines are still faster than VML by up to 4x. That seems strange to me, so could there be something wrong with my testing methodology?

My test code looks like this:

start clock
for n=0; n vdMul(testVector)
stop clock
report time

start clock
for n=0; n MyComplexMul(testVector)
stop clock
report time

Does this kind of test even make sense?

Cheers and thanks,
Michael

TimP · ‎11-22-2007

I got the impression from your earlier posting that you were performing multiply and add tests, where the compiler would have the advantage of "chaining" the operations (in the terminology of 25 years ago) rather than incurring cache misses between the operations.
If you select the best timers, you should be able to get adequate timing down to loop lengths of 10000. I don't know the details of what is available on MacOS, but you could use rdtsc low level timers even without specific support for them.
If you have EIST, you would need to disable it in the BIOS for these timing tests.
If you would upgrade to a current gcc or Intel compiler, you would have available the omp_wtime timer, which probably uses gettimeofday() or the like, which you should have already.

Michael_Stauffer · ‎11-26-2007

Tim,

Yes, I think my original post wasn't that clear. I'm testing each operatin in its own loop.

From what you say, assuming that I have an accurate enough timer for the # of loop iterations I run, the basic testing methodology should be sound. Correct?

If I'm just using stdlib's 'clock' routine or gettimeofday(), and running a 100k-element vector with 5000 iterations in a loop, would you guess the timing results are accurate enough? The differences I'm getting now between my lib and VML's vzmul routine are a factor of 3, and that must be way beyond any timer error.

I discovered my previous test code was a little buggy, in that I was reusing my test vector for the different loops, each testing a different function. This seems like it resulted in some different optimizations as the vector's values changed. So now my results are a little different and my lib and vml are closer:

comparison of time (sec) for 5000 iterations over a 100k-element complex vector:

---------VML----my lib
vzAdd__4.97___5.41
vzMul_24.61___8.00
vzAbs_17.8___17.24

It still baffles me that vzMul is 3x slower than my lib in my tests.

Thanks for all the help.

Cheers,
Michael

Eugeny_G_Intel · ‎11-28-2007

Hello Michael,

I tried to reproduce the mentioned performance results on the system similar what you have.
Not sure, whether you use single or double precision complex data,
but I could not reproduce similar results for both types.
(It's hard to reproduce also because we don't know the code you use for the functions in "my lib".)

But because your algorithms for Mul/Abs are faster than MKL ones,
I can guess that their accuracy is more relaxed than in current MKL functions.

For example, accuracy-relaxed algorithm for multiplication could be:
for(i=0;i r.real=a.real*b.real - a.imag*b.imag;
r.imag=a.imag*b.real + a.real*b.imag;
}
This algorithm can produce totally inexact result in some cases.
For example for single precision complex type, when arguments are (1+2^-23)+i*(1+2^-22) and (1+2^-23)+i*1.0,
it results in zero real part of result instead of approximately 1.421085e-14.
For example for double precision complex type, when arguments are (1+2^-52)+i*(1+2^-51) and (1+2^-52)+i*1.0,
it results in zero real part of result instead of approximately 4.93038065763132e-32.

If you would like to have this inexact but fast algorithm in MKL, we could treat your desire as a feature request
and consider relaxed accuracy vcMul/vzMul functions in future MKL releases.

There is also possibility to implement relaxed accuracy for vcAbs/vzAbs functions.

Thank you,

Eugeny

g_f_thomas · ‎11-28-2007

Eugeny,

"The (Reference) manualaddresses programmers proficient in computational mathematics and assumes a working knowledge of the principles and vocabulary of linear algebra, mathematical statistics, and Fourier transforms."

I trust that you're not suggesting that future releases of MKL be dumbed down from current high standards. Given the choice between accuracy and speed I'd reckon that most MKL users would prefer the former, as now delivered. Here's hoping that it remains so.

Gerry

TimP · ‎11-28-2007

I guess there is no point in providing a fast vectorized multiply which doesn't protect against cancellation, when most compilers can do the same. As OP is using gcc, he might also try timing with cast to force long double arithmetic, to see if that matches the MKL timing.

Michael_Stauffer · ‎12-04-2007

Hi Eugeny,

Thanks for your reply and taking the time to try and reproduce my findings. Sorry for not replying sooner, I've been on deadline on another project.

I'm using double precision complex data in my tests. And yes, I'm doing a straight-forward algorithm like you figured:

#define _A a
#define _B a[n+1]
#define _C b
#define _D b[n+1]

for( n = 0; n r = (_A*_C) - (_B*_D);
i = (_B*_C) + (_A*_D);
result = r;
result[n+1] = i;
}

(Interestingly, this runs almost 2x faster to use intermediary variables, rather than to directly assign to result[].)

I ran my tests using the accuracy-testing values you suggested, and yes, my real parts end up as 0. So it's an accuracy issue, that's very good to know. For my current project, I'm working with an error tolerance of 1e-6 anyway, so I'll probably be fine, but will look more closely at that.

My tests of VML speed cycles through the three VML accuracy modes, and I don't see any difference in speed. Does that make sense? For each function, I'm just timing how long it takes to call the function N times in a loop. Each accuracy mode gets timed separately. When you suggested a feature request for a faster, relaxed accurate versions, I guess they'd be different than changing the existing accuracy modes?

I quickly tried what Tim suggested, to cast my calcs to use long double math. I don't see a change in my multiplication routine, but my abs value routine slowed down by 30%. But I'm not sure it's really doing long double math, I'll have to look closer when I'm less sleepy.

Cheers,
Michael

TimP · ‎12-05-2007

Use of local scalar intermediates is one way to maintain performance without using C99 restrict qualifiers. You still haven't showed us enough to make intelligent guesses.
Certain CPUs will be relatively slow in calculating sqrtl(), but you still may find
sqrtl((long double)real*real + (long double)imag*imag)
faster and more accurate than code which protects against under/overflow while using only doubles. As data movement between SSE and x87 registers also is quite slow, the context may make quite a difference.