- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've compared performance of f77 test program
which:
a) calls BLAS1 (daxpy) and BLAS2
(dgemm) subroutines
b) do (theoretically) the same work, but with
explicit coding of f77 loops instead of library calls.
(I used for compilation ifc -O3 -tpp6 on my old
Celeron 433 w/RH 6.2).
I didn't find any significant performance
increase at MKL (and Atlas also) library calls
in comparison w/direct compilation of f77 loops:-(
(it was for measured for some PIII based systems).
I understand that "BLAS3 calls" works much more
better and will give speed-up at using of DGEMM
(for example), but it's BLAS3... Is there some
available data about speed-up (in comparison
w/loops coding and compilers optimization)
at using of BLAS1 or dgemv calls for modern Intel
P4 (w/SSE2) CPUs ?
Or may be like data for IA-64 ?
Mikhail Kuzminsky
Zelinsky Inst. of Organic Chemistry
Moscow
~
which:
a) calls BLAS1 (daxpy) and BLAS2
(dgemm) subroutines
b) do (theoretically) the same work, but with
explicit coding of f77 loops instead of library calls.
(I used for compilation ifc -O3 -tpp6 on my old
Celeron 433 w/RH 6.2).
I didn't find any significant performance
increase at MKL (and Atlas also) library calls
in comparison w/direct compilation of f77 loops:-(
(it was for measured for some PIII based systems).
I understand that "BLAS3 calls" works much more
better and will give speed-up at using of DGEMM
(for example), but it's BLAS3... Is there some
available data about speed-up (in comparison
w/loops coding and compilers optimization)
at using of BLAS1 or dgemv calls for modern Intel
P4 (w/SSE2) CPUs ?
Or may be like data for IA-64 ?
Mikhail Kuzminsky
Zelinsky Inst. of Organic Chemistry
Moscow
~
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In general, MKL is optimized for large problems and efficient use of cache, but may have more overhead than straightforward compilation for small problems, especially for level 1 BLAS. If you compile for Pentium III or Pentium 4/Xeon using the SSE instructions with -xK or -xW, the compiler uses the short vector math library and can generate pretty efficient code. The advantage of MKL becomes evident for big problems and especially with BLAS3.
You can find some performance data at the software tools website http://www.intel.com/software/products, e.g. http://www.intel.com/software/products/mkl/mkl52/specs.htm
Incidentally, another option for small problems is to use the Intel Performance Primitives, see the above URL.
Martyn
You can find some performance data at the software tools website http://www.intel.com/software/products, e.g. http://www.intel.com/software/products/mkl/mkl52/specs.htm
Incidentally, another option for small problems is to use the Intel Performance Primitives, see the above URL.
Martyn

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page