Avoid incrementing pointers.

Jayden_S_ · ‎12-22-2015

The loop is simple

void loop(int n, double* a, double const* b)
{
#pragma ivdep
    for (int i = 0; i < n; ++i, ++a, ++b)
        *a *= *b;
}

I am using intel c++ compiler and using #pragma ivdep for optimization currently. Any way to make it perform better like using multicore and vectorization together, or other techniques?

TimP · ‎12-22-2015

If that loop is long enough (e.g. count > 10000) to benefit from multi-core parallel as well as vectorization, you could try

#pragma omp for simd

(with -qopenmp compile option), or equivalent auto-parallelization options including reducing par-threshold, along with setting appropriate OMP_NUM_THREADS and OMP_PLACES.

Ostensibly, cilk_for simd might do it, although it may not improve performance significantly.

In many realistic situations, nested loops with threaded parallel outer and simd vector inner loops are needed to take advantage of multi-core.

Note that the loop you quote appears eligible for compiler substitution of fast_memcpy, involving run-time selection of aligned nontemporal stores where possible, which you won't see detailed in opt-report.

MKL dcopy() is a more ancient remedies which should do what you request (also using the OMP or MKL environment variables).

jimdempseyatthecove · ‎12-22-2015

Avoid incrementing pointers. Use [subscripts] instead.

void loop(int n, double* a, double const* b)
{
#pragma ivdep
    for (int i = 0; i < n; ++i)
        a *= b;
}

The compiler optimizer prefers this syntax.

If (when) n is large enough to amortize the thread pool setup, then you might consider using a #pragma omp parallel for. The body of your loop has little complexity. For this case the benefit of parallelization may come in with n .gt. 10000.

Look at your reports to assure you attain complete vectorization.

Note, if a and b are known to be aligned, then specifying that they are will provide for additional opportunities of optimization.

Jim Dempsey

Bernard · ‎12-24-2015

If you are sure that both pointers a and b do not overlap then you can add restrict qualifier. In your example b is declared as a const pointer so I am not sure how important for the compiler optimization will be addition of restrict to b.

How to optimize a simple loop?