I have a code for BiCCG sparse matrix solver which I have tried to parallelise using OpenMP routine. The snippet can be found using gist: https://gist.github.com/data-panda/079cfb076092a5289945c9b3b0881fa9
I tried to use the Intel Advisor and my 55% of the total compute time is spent on the matrix solver, most of whose loops are not vectorized properly due to which I am getting bad scaleup. The most intesive loops which are not vectorised properly lie in the function iterate_hyd_p() : loop 3 (24 %) , loop 1 (20%) , loop 4 (16%) , loop 2 (16 %), loop 6 ( 5%), loop 5 ( 2.5 %, only loop that is autovectorized) . On digging into the diagnostics nearly all loops have a common suggestion of underutilization of FMA instructions which I guess can be addressed using proper compiler flags. Even the vectorized loop also suffers from this. But majorly the second suggestion/problem with all the non vectorized loops are (common to all) all have a reference to line 17 (#pragma omp parallel) -
1. Scalar loop, outer loop was not autovectorized: consider using SIMD directives
2. Vector dependence prevents vectorization, loop was predicate optimized version 6
iterate_hyd_p() is generally called until certain convergence criteria is met as you can see from the main calling function ( convg_criteria)
Any pointers on what to look for ?
Hello Aniruddha P.
I'm not sure, but it seems to me that to vectorize loops you must replace the using of #pragma omp parallel for and #pragma omp for with:
#pragma omp parallel for simd and #pragma omp for simd. Give a try this.