I am using BLAS with my software, especially various GEMM & GEMV routines.
I have used Intel vTune to profile my software, and found out that using my own BLAS library (compiled with Intel Fortran Compiler) I get better performance (run-time) than using Intel MKL by 5-10%.
Does it make sense? Is it possible that taking BLAS sources from www.netlib.org/blas/ and compiling them myself will result in better optimized library than Intel MKL?
Morag Agmon (Intel)
that's not expected from our side. Where do you see 5-10% of MKL's performance gap? is that ?gemm routine? what is the problem size?
why do you use VTune ( did you use hotspot analys?) instead of directly measure execution time of these routines? What is CPU type you are running on?