- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel Core Duo 1.66 GHz
GCC 4.0.1
MKL 10.0.012
Hi,
I'm just trying out MKL. I'm doing some simple comparisons of VML with a simple complex vector math lib I wrote. My test program crunches a 1000-element complex vector 500k times in a simple loop, first using VML, then my lib.
My lib simply loops through a complex input array and performs element-by-element addition, absolute value or multiplication, depending on the function. My library is doing complex-number multiplication and absolute value.
Here are the timing results:
______VML___My Lib
add.....1.43....1.90
mul...21.51....3.40
abs...17.18....2.21
As you can see, VML is slightly faster for addition, but *much* slower for multiplication and absolute value.
I'm compiling with gcc (from within Matlab) using these options:
-fno-common -no-cpp-precomp -fexceptions -march=pentium4 -O3 -ftree-vectorize -mfpmath=sse -msse -msse2 -Wall -pipe -DHAVE_INLINE -Wno-long-long
I figure gcc 4.0 is doing loop vectorization, so my simple code could be vectorized, but why would it be faster than VML?
I've tried building with different architecture and optimizations, but those don't effect the VML timings, which makes sense.
I've messed around with the various MKL options like accuracy modes and # of threads, but saw no changes.
Does anyone have any thoughts on this? Thanks, I'm very confused.
Cheers,
Michael
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Speaking of being confused, if you are so interested in performance, why are you using such a slow CPU and such an old gcc? How can we trust your timing method? When you say "from within Matlab," maybe that should have been a signal to ignore this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using a slow CPU cuz it's all we can afford for now in our startup. I hope to get a new Mac Pro early next year, fingers crossed. For now I'm working on this MacMini to try and optimize number-crunching intensive portions of a signal processing prototype built in Matlab. My partner does the actual scientific research using a Mac Pro where he runs large simulations. This is my first time working on vectorizing code. My previous optimization challenges have been around real-time code and memory management. If anyone has a good suggestion for where to start reading up on optimizing computationally heavy code, I'd like to look at it. What I've read so far has been fragmented.
I'm building "mex files" which are C dylibs that can be called directly from Matlab. By porting the numerical integration portion of our prototype from Matlab to C I already got a 5-10x speed increase without any particular thought yet to advanced speed optimizations. Although, it seems likely gcc is already verctorizing my code.
I'm using gcc 4.0 cuz thats what XCode 2.4.1 has (and it's still the same in XCode 3, it seems). Apple has their own version of gcc so I guess the numbering may be different, although I know Apple's gcc 4.0 is based on gnu gcc 4.0.1.
I'll build a standalone C app for my tests instead of running them via dylibs with Matlab as a gui interface. That will be cleaner and isolate if something strange is happening with that setup.
As for the length of vectors that would make MKL competitive with auto-vectorization: I've tested a little more, using vectors with 1 million elements, but with less looping to keep the overall number of elemental calculations the same. It's interesting. The times for VML vzAbs and vzMul functions are practically flat. VML vzAdd and my custom routines slow down by up to 4x. But my routines are still faster than VML by up to 4x. That seems strange to me, so could there be something wrong with my testing methodology?
My test code looks like this:
start clock
for n=0; n vdMul(testVector)
stop clock
report time
start clock
for n=0; n MyComplexMul(testVector)
stop clock
report time
Does this kind of test even make sense?
Cheers and thanks,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you select the best timers, you should be able to get adequate timing down to loop lengths of 10000. I don't know the details of what is available on MacOS, but you could use rdtsc low level timers even without specific support for them.
If you have EIST, you would need to disable it in the BIOS for these timing tests.
If you would upgrade to a current gcc or Intel compiler, you would have available the omp_wtime timer, which probably uses gettimeofday() or the like, which you should have already.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I think my original post wasn't that clear. I'm testing each operatin in its own loop.
From what you say, assuming that I have an accurate enough timer for the # of loop iterations I run, the basic testing methodology should be sound. Correct?
If I'm just using stdlib's 'clock' routine or gettimeofday(), and running a 100k-element vector with 5000 iterations in a loop, would you guess the timing results are accurate enough? The differences I'm getting now between my lib and VML's vzmul routine are a factor of 3, and that must be way beyond any timer error.
I discovered my previous test code was a little buggy, in that I was reusing my test vector for the different loops, each testing a different function. This seems like it resulted in some different optimizations as the vector's values changed. So now my results are a little different and my lib and vml are closer:
comparison of time (sec) for 5000 iterations over a 100k-element complex vector:
---------VML----my lib
vzAdd__4.97___5.41
vzMul_24.61___8.00
vzAbs_17.8___17.24
It still baffles me that vzMul is 3x slower than my lib in my tests.
Thanks for all the help.
Cheers,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Michael,
I tried to reproduce the mentioned performance results on the system similar what you have.
Not sure, whether you use single or double precision complex data,
but I could not reproduce similar results for both types.
(It's hard to reproduce also because we don't know the code you use for the functions in "my lib".)
But because your algorithms for Mul/Abs are faster than MKL ones,
I can guess that their accuracy is more relaxed than in current MKL functions.
For example, accuracy-relaxed algorithm for multiplication could be:
for(i=0;i
r.imag=a.imag*b.real + a.real*b.imag;
}
This algorithm can produce totally inexact result in some cases.
For example for single precision complex type, when arguments are (1+2^-23)+i*(1+2^-22) and (1+2^-23)+i*1.0,
it results in zero real part of result instead of approximately 1.421085e-14.
For example for double precision complex type, when arguments are (1+2^-52)+i*(1+2^-51) and (1+2^-52)+i*1.0,
it results in zero real part of result instead of approximately 4.93038065763132e-32.
If you would like to have this inexact but fast algorithm in MKL, we could treat your desire as a feature request
and consider relaxed accuracy vcMul/vzMul functions in future MKL releases.
There is also possibility to implement relaxed accuracy for vcAbs/vzAbs functions.
Thank you,
Eugeny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Eugeny,
"The (Reference) manualaddresses programmers proficient in computational mathematics and assumes a working knowledge of the principles and vocabulary of linear algebra, mathematical statistics, and Fourier transforms."
I trust that you're not suggesting that future releases of MKL be dumbed down from current high standards. Given the choice between accuracy and speed I'd reckon that most MKL users would prefer the former, as now delivered. Here's hoping that it remains so.
Gerry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply and taking the time to try and reproduce my findings. Sorry for not replying sooner, I've been on deadline on another project.
I'm using double precision complex data in my tests. And yes, I'm doing a straight-forward algorithm like you figured:
#define _A a
#define _B a[n+1]
#define _C b
#define _D b[n+1]
for( n = 0; n r = (_A*_C) - (_B*_D);
i = (_B*_C) + (_A*_D);
result
result[n+1] = i;
}
(Interestingly, this runs almost 2x faster to use intermediary variables, rather than to directly assign to result[].)
I ran my tests using the accuracy-testing values you suggested, and yes, my real parts end up as 0. So it's an accuracy issue, that's very good to know. For my current project, I'm working with an error tolerance of 1e-6 anyway, so I'll probably be fine, but will look more closely at that.
My tests of VML speed cycles through the three VML accuracy modes, and I don't see any difference in speed. Does that make sense? For each function, I'm just timing how long it takes to call the function N times in a loop. Each accuracy mode gets timed separately. When you suggested a feature request for a faster, relaxed accurate versions, I guess they'd be different than changing the existing accuracy modes?
I quickly tried what Tim suggested, to cast my calcs to use long double math. I don't see a change in my multiplication routine, but my abs value routine slowed down by 30%. But I'm not sure it's really doing long double math, I'll have to look closer when I'm less sleepy.
Cheers,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use of local scalar intermediates is one way to maintain performance without using C99 restrict qualifiers. You still haven't showed us enough to make intelligent guesses.
Certain CPUs will be relatively slow in calculating sqrtl(), but you still may find
sqrtl((long double)real*real + (long double)imag*imag)
faster and more accurate than code which protects against under/overflow while using only doubles. As data movement between SSE and x87 registers also is quite slow, the context may make quite a difference.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page