The netlib version is a shar file. On Windows, you could use cygwin sharutils to unpack it.
I show examples of how to optimize, using compiler vectorization and OpenMP where appropriate, in Fortran, C, and C++ (all sharing the same Fortran driver).
In a few cases, threading is possible only at the expense of optimization of individual threads, and more than 4 threads may be needed to overcome the deficit, even on the longest loop length. Both versions are shown.
Makefiles are provided for Intel icpc/ifort (linux), ifort/ICL/MSVC (Windows), and gcc/gfortran. The Intel libiomp5 libraries are recommended, regardless of compiler chosen.
For testing on multiple last level cache machines, or with HyperThreading enabled, affinity environment variables KMP_AFFINITY or GOMP_CPU_AFFINITY are strongly recommended, so as to place adjacent threads on the same cache, and to set 1 thread per core. Note that libiomp5 implements either environment variable, gcc only the latter, and MSVC neither.
There is no base non-optimized C++. As I explained in more detail in my earlier paper, which was removed from the Intel sites (possibly on account of lack of interest), the C++ is derived from automatic translation of the original netlib Fortran. The translation itself already introduces some source optimizations, some of which aren't typical of usual C or C++ practice. Almost by definition, it fits the quip about writing Fortran in any language. Still, it doesn't make sense to compare in all cases the ability of C,C++ and Fortran compilers, nor to demonstrate OpenMP, on literal equivalents of the legacy Fortran, so I try to make the comparisons as fair as possible.
The purpose of the original, sometimes ugly, Fortran, was to test compilers of that era to see how much optimization they could perform. In several cases, you might say I have perverted that purpose, by "improving" style, and sometimes introducing OpenMP. The original authors were kind enough to agree to this use of their benchmark.
I don't have a personal position on whether the introduction of transform() could be considered an optimization. g++ does very well with it, and icpc does well with some cases which g++ misses. My personal position is not in favor of such use of transform() without vector classes, of which I present only a single example. The re-organization of loops so as to fit accumulate(), inner_product(), and transform() is an important optimization, so I present both C and C++ versions of it.
And is there non-optimized Fortran code? It's just relatively difficult to extract the essence of optimization (the reusable pattern) w/o non-optimized case. It's much easier when optimizations are applied incrementally to some base code.
The original Fortran is available at the netlib URL reference.
In a few cases, optimum varies with platform; for example, s112(), s211() and s261() may be optimized for Core i7 and Barcelona at the expense of Core 2.
You could make that comparison to see the general style of translation. I used 'f2c -aAR' for the original translation. f2c itself already applies some optimization during that translation. Addition of restrict keyword is the most often required subsequent optimization needed in the C code.
I am still a bit uncertain about this thing. Am I right in the following?
You get base-line fortran code (it can be seen in, for example, subroutine s211).
Then you convert it to C with f2c - this is the first source of optimization (as compared with situation when we are directly writing C code).
Then you apply some manual optimizations to it (vectorization, parallelization with OpenMP, something else?).
Resulted code can be seen in, for example, function s211_() in 'loopsv.c'.
Also you write: "In a few cases, threading is possible only at the expense of optimization of individual threads, and more than 4 threads may be needed to overcome the deficit, even on the longest loop length. Both versions are shown."
Please, point to some example where 2 versions are shown (function names).
Sorry for so many stupid questions, I just can't get integral view, but it looks worthwhile...
In several cases, the original f2c translation performs fairly well with auto-vector and auto-parallel compilation. While one of the purposes of the original vectors benchmark was to test whether compilers could interchange loops effectively, I have written in most of those optimizations. In particular, should one wish to use the STL optimizations, writing those in requires the loop interchange optimization. s125() and s141() require the explicit calculation of array subscripts to break sequential dependence in the outer loop.
s141(), in the original loop nesting, exhibited problems with reconciliation of single thread optimization and threading. With loop interchange, so as to make the inner loop stride 1, this is resolved.
In a majority of the omp parallel loops, performance with all cores in use does not gain much from vectorization, although the vectorization speedup is large for a single thread. Threaded performance scaling is much higher with MSVC compilation, using Intel OpenMP run-time library, as the single thread performance is much lower with MSVC (lacking auto-vectorization)than with gnu or Intel compilers.
In the case of s125(), this may be associated with a relatively high rate of last level cache misses, even though this benchmark is constructed so that most tests don't see a normal cache miss rate. s114() and s125() show significant gain from the tactic of initializing the arrays (subroutine cs2d) with the same static schedule as in the later timed use.
The loops with schedule(guided) are problematic with respect to affinity, particularly on multi-socket platforms. The static scheduled use of the arrays during initialization and use without guided in other loops doesn't match the uncontrolled affinity of schedule(guided).