Re: BLAS Lev.1/2:MKL/Atlas/... calls vs compilation

kus · ‎11-19-2002

I've compared performance of f77 test program
which:
a) calls BLAS1 (daxpy) and BLAS2
(dgemm) subroutines
b) do (theoretically) the same work, but with
explicit coding of f77 loops instead of library calls.
(I used for compilation ifc -O3 -tpp6 on my old
Celeron 433 w/RH 6.2).
I didn't find any significant performance
increase at MKL (and Atlas also) library calls
in comparison w/direct compilation of f77 loops:-(
(it was for measured for some PIII based systems).

I understand that "BLAS3 calls" works much more
better and will give speed-up at using of DGEMM
(for example), but it's BLAS3... Is there some
available data about speed-up (in comparison
w/loops coding and compilers optimization)
at using of BLAS1 or dgemv calls for modern Intel
P4 (w/SSE2) CPUs ?
Or may be like data for IA-64 ?

Mikhail Kuzminsky
Zelinsky Inst. of Organic Chemistry
Moscow
~

Martyn_C_Intel · ‎11-26-2002

In general, MKL is optimized for large problems and efficient use of cache, but may have more overhead than straightforward compilation for small problems, especially for level 1 BLAS. If you compile for Pentium III or Pentium 4/Xeon using the SSE instructions with -xK or -xW, the compiler uses the short vector math library and can generate pretty efficient code. The advantage of MKL becomes evident for big problems and especially with BLAS3.

You can find some performance data at the software tools website http://www.intel.com/software/products, e.g. http://www.intel.com/software/products/mkl/mkl52/specs.htm

Incidentally, another option for small problems is to use the Intel Performance Primitives, see the above URL.

Martyn