On one of my tests, some OpenMP-parallelised loops are running at twice the speed of the near-identical serial code, even on one CPU! That rather implies that there's a optimisation which would be better in the generic section.
I have attached the matrix generation program (just provide the size of matrix as an argument - I used 1000) and the code, which just takes the matrix file name as an argument.
This is clearly a low-priority item :-)
I didn't keep that version of the compiler.
Checking with 13.1.192, (apparently unneeded) auto-inlining can be more aggressive when -openmp is not set.
Even with -fno-inline-functions, the opt-report-file results look complicated. OpenMP prevents some apparently counter-productive loop transformations with -O3 at source line 171 and then helps the compiler recognize dot product optimization at line 181, which is more easily recognized in the style you used at line 163 (no aliasing analysis needed to optimize, and possibly better numerical properties).
Sometimes, inner_product() notation helps, but here it seems sufficient to define a scalar accumulator. The explicit scalar accumulator would be needed in order to apply the OpenMP sum reduction, which the compiler accomplishes without the pragma when -fp-model fast is set. I'll leave it to you to make recommendations.
Thanks. Ugh. That's definitely messy.
The code is actually just the standard LAPACK logic, converted to (originally) modern Fortran and thence to C etc. I use it as a way of checking out and teaching coding paradigms and SIMD parallelism.