topic In addition to the in Intel® oneAPI Math Kernel Library

cblas_dnrm2 much slower than cblas_ddot

Bernd_Doser — Wed, 22 Jul 2015 12:39:50 GMT

Dear all,

I run benchmarks on a sandy-bridge Intel processor (E5-4620) using Intel MKL 11.1. Here, I have found that cblas_dnrm2 is significantly slower (3.4 s) than the corresponding cblas_ddot call (0.5 s) using one thread. This is very surprising for me, because if I use cblas_ddot to calculate the 2-Norm it is faster (0.3 s) than cblas_dnrm2.

I have compiled with gcc-4.8.3 with following flags:

CXXFLAGS += -O3 -I${MKLROOT}/include

LDLIBS += -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${INTEL_CC_HOME}/compiler/lib/intel64/libiomp5.a -Wl,--end-group -ldl -lpthread -lm

The files are attached. Is there any known issue for nrm2?

Best regards,

Bernd

Your conclusions hold as long

mecej4 — Wed, 22 Jul 2015 14:49:00 GMT

Your conclusions hold as long as the vector does not simultaneously contain numbers of O(1) and very large/small numbers (I have not looked at your code, just your comments). DNRM2 is designed to handle those pathological cases and the extra code needed to detect and manage such cases slows down DNRM2. Consider, for example, that if the result value from DNRM2 is 1D200 the value returned by DDOT would be +INF, and if the result from DNRM2 is 1D-200 the value returned by DDOT would be 0.

The comparison is between "robust but slow" and "fast but only correct most of the time".

Dear mecej4,

Bernd_Doser — Thu, 23 Jul 2015 07:39:34 GMT

Dear mecej4,

Thank you very much for your answer. That would mean that for handling pathological cases nrm2 is more than 10 times slower than dot. This sound for me as a not so good solution to divide in robust and fast algorithms and I have nothing found about this in the documentation.

Thanks again and best regards,

Bernd

You can see the details in

mecej4 — Thu, 23 Jul 2015 11:29:00 GMT

You can see the details in the source code for DNRM2: http://www.netlib.org/lapack/explore-3.1.1-html/dnrm2.f.html or the f2c translation result, http://www.netlib.org/clapack/cblas/dnrm2.c . Note that processing each non-zero element of the vector requires two comparisons and one division. In other words, each such element has to be scaled before multiplication and accumulation, and the scale factor may itself need to be updated.

Fortran 2008 provides a new intrinsic function, NORM2.

Ok! I see it. Looking into

Bernd_Doser — Thu, 23 Jul 2015 11:35:32 GMT

Ok! I see it. Looking into source code is always a good idea ;)

Thanks again very much. Your answer was very helpful.

In addition to the

Murat_G_Intel — Mon, 27 Jul 2015 18:08:26 GMT

In addition to the implementation difference that mecej mentioned, dnrm2 is not threaded in MKL currently whereas ddot is multithreaded. We are currently looking into improving dnrm2 serial/multithreaded performance for the upcoming MKL releases.