Solved: cblas_dnrm2 much slower than cblas_ddot

Bernd_Doser · ‎07-22-2015

Dear all,

I run benchmarks on a sandy-bridge Intel processor (E5-4620) using Intel MKL 11.1. Here, I have found that cblas_dnrm2 is significantly slower (3.4 s) than the corresponding cblas_ddot call (0.5 s) using one thread. This is very surprising for me, because if I use cblas_ddot to calculate the 2-Norm it is faster (0.3 s) than cblas_dnrm2.

I have compiled with gcc-4.8.3 with following flags:

CXXFLAGS += -O3 -I${MKLROOT}/include

LDLIBS += -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${INTEL_CC_HOME}/compiler/lib/intel64/libiomp5.a -Wl,--end-group -ldl -lpthread -lm

The files are attached. Is there any known issue for nrm2?

Best regards,

Bernd

mecej4 · ‎07-23-2015

You can see the details in the source code for DNRM2: http://www.netlib.org/lapack/explore-3.1.1-html/dnrm2.f.html or the f2c translation result, http://www.netlib.org/clapack/cblas/dnrm2.c . Note that processing each non-zero element of the vector requires two comparisons and one division. In other words, each such element has to be scaled before multiplication and accumulation, and the scale factor may itself need to be updated.

Fortran 2008 provides a new intrinsic function, NORM2.

View solution in original post

mecej4 · ‎07-22-2015

Your conclusions hold as long as the vector does not simultaneously contain numbers of O(1) and very large/small numbers (I have not looked at your code, just your comments). DNRM2 is designed to handle those pathological cases and the extra code needed to detect and manage such cases slows down DNRM2. Consider, for example, that if the result value from DNRM2 is 1D200 the value returned by DDOT would be +INF, and if the result from DNRM2 is 1D-200 the value returned by DDOT would be 0.

The comparison is between "robust but slow" and "fast but only correct most of the time".

Bernd_Doser · ‎07-23-2015

Dear mecej4,

Thank you very much for your answer. That would mean that for handling pathological cases nrm2 is more than 10 times slower than dot. This sound for me as a not so good solution to divide in robust and fast algorithms and I have nothing found about this in the documentation.

Thanks again and best regards,

Bernd

mecej4 · ‎07-23-2015

You can see the details in the source code for DNRM2: http://www.netlib.org/lapack/explore-3.1.1-html/dnrm2.f.html or the f2c translation result, http://www.netlib.org/clapack/cblas/dnrm2.c . Note that processing each non-zero element of the vector requires two comparisons and one division. In other words, each such element has to be scaled before multiplication and accumulation, and the scale factor may itself need to be updated.

Fortran 2008 provides a new intrinsic function, NORM2.

Bernd_Doser · ‎07-23-2015

Ok! I see it. Looking into source code is always a good idea ;)

Thanks again very much. Your answer was very helpful.

Murat_G_Intel · ‎07-27-2015

In addition to the implementation difference that mecej mentioned, dnrm2 is not threaded in MKL currently whereas ddot is multithreaded. We are currently looking into improving dnrm2 serial/multithreaded performance for the upcoming MKL releases.