- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I run benchmarks on a sandy-bridge Intel processor (E5-4620) using Intel MKL 11.1. Here, I have found that cblas_dnrm2 is significantly slower (3.4 s) than the corresponding cblas_ddot call (0.5 s) using one thread. This is very surprising for me, because if I use cblas_ddot to calculate the 2-Norm it is faster (0.3 s) than cblas_dnrm2.
I have compiled with gcc-4.8.3 with following flags:
CXXFLAGS += -O3 -I${MKLROOT}/include
LDLIBS += -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${INTEL_CC_HOME}/compiler/lib/intel64/libiomp5.a -Wl,--end-group -ldl -lpthread -lm
The files are attached. Is there any known issue for nrm2?
Best regards,
Bernd
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can see the details in the source code for DNRM2: http://www.netlib.org/lapack/explore-3.1.1-html/dnrm2.f.html or the f2c translation result, http://www.netlib.org/clapack/cblas/dnrm2.c . Note that processing each non-zero element of the vector requires two comparisons and one division. In other words, each such element has to be scaled before multiplication and accumulation, and the scale factor may itself need to be updated.
Fortran 2008 provides a new intrinsic function, NORM2.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your conclusions hold as long as the vector does not simultaneously contain numbers of O(1) and very large/small numbers (I have not looked at your code, just your comments). DNRM2 is designed to handle those pathological cases and the extra code needed to detect and manage such cases slows down DNRM2. Consider, for example, that if the result value from DNRM2 is 1D200 the value returned by DDOT would be +INF, and if the result from DNRM2 is 1D-200 the value returned by DDOT would be 0.
The comparison is between "robust but slow" and "fast but only correct most of the time".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear mecej4,
Thank you very much for your answer. That would mean that for handling pathological cases nrm2 is more than 10 times slower than dot. This sound for me as a not so good solution to divide in robust and fast algorithms and I have nothing found about this in the documentation.
Thanks again and best regards,
Bernd
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can see the details in the source code for DNRM2: http://www.netlib.org/lapack/explore-3.1.1-html/dnrm2.f.html or the f2c translation result, http://www.netlib.org/clapack/cblas/dnrm2.c . Note that processing each non-zero element of the vector requires two comparisons and one division. In other words, each such element has to be scaled before multiplication and accumulation, and the scale factor may itself need to be updated.
Fortran 2008 provides a new intrinsic function, NORM2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok! I see it. Looking into source code is always a good idea ;)
Thanks again very much. Your answer was very helpful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition to the implementation difference that mecej mentioned, dnrm2 is not threaded in MKL currently whereas ddot is multithreaded. We are currently looking into improving dnrm2 serial/multithreaded performance for the upcoming MKL releases.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page