Vector math library speedup

kt_mic · ‎01-12-2006

Hello,

I have for evaluation purposes downloaded the vector math libray, where I am particularly interested in exp and log for vectors of length about 20. From the information I found on the Intel website I had expected considerable speed-up but achieved only marginal results compared to the conventional scalar compiler library functions (Intel Fortran v. 9.0) when the compiler was using SSE2 instructions (option -QxW). Can this be correct, or am I missing something?

I do get a significant speedup compared to compilation to default P4.

Michael

TimP · ‎01-12-2006

ifort auto-vectorization should invoke automatically the svml library functions, so you would not expect much difference in performance compared to a direct call to the same library. If the Fortran is a little slower, it might be that you have not arranged for it to recognize aligned data, so that run-time versioning with remainder loops is employed.
By default P4, do you mean compiling for processors early than P4?

kt_mic · ‎01-12-2006

Thank you, that explains. By default, I just mean the default settings of the Fortran compiler , i.e. (from the command line)

ifort xxx.for yyy.lib

wherease the fast version is

ifort xxx.for yyy.lib -QxW

Michael

Andrey_N_Intel · ‎01-13-2006

I have for evaluation purposes downloaded the vector math libray, where I am particularly interested in exp and log for vectors of length about 20.
I dont think that VML will bring you a real advantage on vector lengths about 20. To have a real advantage you should think about code/loop modifications so that VML functions are called on the lengths ~100 or even larger. (This is typically done by buffering).

From the information I found on the Intel website I had expected considerable speed-up
Perhaps the information you are talking about is a peak VML performance compared with conventional scalar math library. The peak performance is achieved on sufficiently large vectors. For more information I encourage you to look at http://www.intel.com/software/products/mkl/data/vml/functions/_listfunc.html
(click on the function of interest and see the graphs of the dependence of VML performance on vector length).

but achieved only marginal results compared to the conventional scalar compiler library functions
I believe that marginal improvement is because the compiler was able to vectorize your loop. As soon as the compiler vectorizes the loop with a math function, it calls internal vectorized math library SVML (VML-like) rather than conventional scalar compiler library. The vectorizer (and SVML in particular) gives substantial speedup compared with conventional (scalar) loop. In particular, the vectorizer was invoked when you compiled with /QxW switch.

Having that in mind, I can comment on SVML and VML differences. Due to different design requirements SVML and VML performance may be comparable on moderately small vector lengths (loop counts). The peak VML performance is clearly better than peak SVML performance (again due to design requirements). For example, the high accuracy single precision VML logarithm takes 15.7 cycles per result whereas SVML logarithm works 17 cycles. For the reference, VML low accuracy log takes 12.5 cycles. VML low accuracy functions are comparable in accuracy with SVML functions (the design requirement is 4 ulp, or roughly 2 incorrect least significant bits).Thus a comparison 17 vs. 12.5 is fairer. By default high accuracy flavor is set in VML. To change the accuracy flavor you should call a special service routine. For details I refer you to the MKL Reference Manual http://www.intel.com/software/products/mkl/docs/mklman.htm.

So, to summarize, yes in your particular case VML performance can be comparable with vectorized loop performance. You need to decide whether to modify your code so that VML calls lead to the quasi-peak performance or continue to use SVML.

kt_mic · ‎01-14-2006

Thank you, this is useful information. Typical vector lengths
in my case is 10 to 30, and I am working on optimizing an application where around 20% of the time is spent on log or exp. Cost of MKL is not an issue, but the time expenditure to change the code is!

What I would like to know more about is where vectorization
will be possible. In other words, will any exp or log be vectorized (in case the compilation options are set), or do I have to move these functions to a separate, short loop like

do i = 1,n
b(i) = exp(a(i))
enddo

in order to obtain the desired result?

Michael

TimP · ‎01-14-2006

ifort and icc look for opportunity to vectorize an entire loop, and for opportunities to split ("distribute") loops to facilitate vectorization. So, if it is easy to do automatically, and analysis shows it is desirable, the compiler could do that bit of work for you.
Default vectorization options usually take the loop 8 iterations at a time, with scalar remainder loops to make up the difference at one or both ends. Adding the -O1 flag cuts unrolling back to the minimum consistent with vectorization, which may prove better for the loop lengths you mention. Also, with fairly short loops, it is important to help the compiler recognize when the data are aligned (on 16-byte boundaries), to avoid run-time alignment checks and adjustments.
The help you get from compiler diagnostics about effectiveness of vectorization is minimal. If you got a LOOP VECTORIZED report for a given loop, that assures you that vectorized code or short vector library calls have been generated for everything in that loop, including math functions. A PARTIAL LOOP VECTORIZED report indicates that the loop has been distributed, with at least one portion vectorized.
If your information about the time spent in math functions comes from profiling, repeating the profiling with a vectorized build would show how much was gained by shifting work from scalar to short vector functions.

Andrey_N_Intel · ‎01-16-2006

A few more cents If you are interested to increase your expertise in IA-32/EM64T Intel compilers (and in vectorizer in particular) as well as be more familiar with IA-32/EM64T optimizations then I would recommend you to read the book by Aart Bik The Software Vectorization Handbook http://www.intel.com/intelpress/sum_vmmx.htm. Aart is an author/ideologist of Intel C/Fortran compiler vectorizer. Be sure that reading this book you will get the information from first hands.