MKL 8.0 performance (or lack of)

darin1 · ‎10-13-2005

I'm using a Pentium D (with dual core and hyperthreading activated). I'm testing/evaluating the MKL 8.0 performance with a C++ application (which is using opencv and ipp 5.0 beta) on Windows XP.

Previously I was using clapack compiled in release mode using visual studio 2005 beta. I replaced that library with the mkl using the ia32 libraries. This conversion was simly changing the library includes for the linker.

Note: I tried using the emt64 versions but I was getting an error during some C++ initialization code in the microsoft libraries (i.e. creating the application context before calling my applications main). This I need to investigate futher since the Pentium D supports emt64. Perhaps I need to change some compiler option.

What I found running this is that the clapack is performing faster the the mkl library. For the mkl my application takes approximately 8000 seconds, while for clapack it is taking 6500 seconds, for a difference 1500 seconds (or 25 minutes). This seems to be a significant performance penalty for a optimized library.

The only answer that I can find to explain this difference is that the clapack is using imprecise floating point while mkl is using precise floating point.

I have reviewed the mkl documentation and it says that I don't have to define any environmental variables (for windows), that the number of threads will be determined automatically.

So currently I'm hard pressed to justify the cost of MKL based on this, i.e. spend money to get less performance.

Can someone identify how to improve performance with MKL.

For reference I'm using the dll version of mkl, since opencv is using libguide40. I think there should only slight difference between using a dll vs static library.

Is it possible that mkl is selecting the wrong processor type and supported features (i.e. not using sse3). How would I verify this?

TimP · ‎10-13-2005

MKL defaults to 1 thread. If you want it to choose the number of threads, you must set the maximum number of threads in the environment variable OMP_NUM_THREADS.
I generally find best performance with HT disabled, OMP_NUM_THREADS=2. You are welcome to experiment.
I'm confused about which compiler you are using, and whether you are using 32- or 64-bit mode. For ICL, you would probably want options -QxP -Qansi_alias -Qopenmp.
SSE3 would likely make a difference in complex functions (ZGEMM et al.).
I don't think you have given a clue about where you got your super clapack. If you compiled all the BLAS used by clapack yourself, using the public source, MKL should equal or beat its performance. MKL should work in combination with clapack. Some of the MKL functions will not be optimized beyond what you should get by building them yourself.

darin1 · ‎10-13-2005

So for the Pentium D with HT enabled would I use OMP_NUM_THREADS=2 or OMP_NUM_THREADS=4?

For the Pentium D with HT disabled it seems obvious that OMP_NUM_THREADS=2.

The compiler I'm using is Visual Studio 2005 Beta.

I just researched into the 32 vs 64 bit this and realized that em64t requires windows xp x64 which I'm not using. I'm using standard windows xp with ia32 libraries.

I'm using the open source clapack that I have downloaded from the internet and built using Visual Studio 2005 beta. I tried building and using atlas for the blas routines, but I have come across several issues that keeps it from building, in P4 mode it can't get timining information so it aborts during the build, and with the P4E mode it creates makefiles and source code that can't be built. So as far as this test is concerned it is using the freely available clapack/blas libraries.

When I was testing, I was using clapack/blas or mkl exclusively (either one or the other). I would expect equal performance, that is what is so surprising.

One of the main clapack functions that I'm using is sgesvx, sgesvd and to a lesser extend sgelsd and sgelss. Hopefully these functions are optimized in MKL over clapack.

Intel_C_Intel · ‎11-01-2005

Let's talk about CLAPACK. CLAPACK is Fortran LAPACK that has been run through the tool f2c, nothing more. So whatever operations exist in LAPACK now exist in a straightforward translated version of the code. The only reason CLAPACK exists is because not everyone has a Fortran compiler. In general, I would expect CLAPACK to perform worse than LAPACK when each is compiled with a similar compiler with similar optimization switches.

We will check on the performance of sgesv. While generally MKL has high performance on BLAS-based codes, we have tended to spend less time on single precision optimizations compared to double precision and it might be that this is what you are seeing here.

Bruce

darin1 · ‎11-01-2005

I wouldn't expect that the c version of lapack would be much slower. The fortran language is pretty simple and seems to have a pretty good correspondence/translation from fortran to c. After all how many ways are there to translate for loops, gotos, arrays and array accessors, and parameter passing?

The way performance is usually optimized is to perform some loop unrolling and converting the for loops into some vector cpu instructions, and some reordering of the instructions/data to prevent pipeline stalls or cache misses.

But this is to me is not the issue. The issue is that performing a simple library swap on the linker line, and running an app I got a 1500 seconcd performance decrease.

Mucking around with the thread count for mkl has increased performed to be better than with clapack. Luckily I had a dual core processor that this worked, although I'm not clear why this worked nor do I see much change in my cpu utilization.

I haven't performed extensive research into which operations are faster/slower, I just identified some that I know are called alot of times.

When one uses an tuned library, one expects to have performance equal or better than a non-tuned library. Clapack, as you were pointing out, would seem to qualify as non-tuned, while mkl would be the tuned library.

Perhaps the apis that I'm using aren't tuned very much, as you are indicating. I would have to consider that as part of my evaluation. I doubt that the double precision are going to be faster than the single precision to justify a change there.

Maybe I should review and identify more closely which api's that mkl has improved performance and verify that I'm using those api's.

Intel_C_Intel · ‎11-01-2005

SGESV should perform well. The key functions are strsm and sgemm. All other functions are subordinate to them.

While there are things that can be done at the LU factorization level, namely, using utiliizing recursive code in the column factorization, the main performance comes for the BLAS functions I had mentioned above.

We admittedly have not spent as much time optimizing single precision code as we have on double precision because the latter are used so much more extensively.

We will review this code and evaluate the performance on it.

Bruce