Would I be better off using a MKL dot product call or relying on the ICC to optimise a dot product function
I have some code that is spending most of its time in dot product calls. From a performance perspective would I be better off replacing these dot-product calls witha MKL dot product call or relying on the ICC to optimisethe dot product function? The dot-product code is very simple andcould have restricts put on it. The target CPU supports the SSE4 instructions so can make use of compiler vectorisation.
In principle, if the array sizes aren't large enough to benefit from a combination of vector and threaded parallel reduction, the compiler's in-line optimization could out-perform MKL dot product. For array sizes around 1000, I would expect similar performance either way. Smaller problems should run faster with the compiler's in-line code. Unfortunately, with the standards compatibility options such as "icc -fp-model source" compiler optimization of dot product is disabled, so then you would be more likely to consider MKL. Also, you must take care in how the source code is written so as to enable the compiler to optimize. You may require the source code to be written so as to accumulate in a local scalar, or possibly the use of restrict qualifiers, to eliminate aliasing concerns. A BLAS function call implicitly prevents aliasing. STL inner_product(), if applicable, eliminates the time which the BLAS function would spend checking which method would be appropriate, as it supports only unity strides. SSE4 would be needed only for non-unity strides. I don't know whether MKL would implement both unity and non-unity strided vectorized versions (taking additional time to choose among them).