Avoiding unnecessary OpenMP synchronization

Peter_B_9 · ‎09-13-2013

Say I wish to add a number of vectors:

cblas_daxpy(n, 1.0, a, 1, b, 1);

cblas_daxpy(n, 1.2, c, 1, d, 1);

cblas_daxpy(n, 1.4, e, 1, f, 1);

cblas_daxpy(n, 1.6, g, 1, h, 1);

MKL will use OpenMP to parallelize each of these vector additions internally. However all of the OpenMP threads will sync up between each daxpy call, adding overhead. Since I know that the functions are independent of each other, this synchronization is unnecessary.

I could do

#pragma omp parallel sections

{

#pragma omp section

cblas_daxpy(n, 1.0, a, 1, b, 1);

#pragma omp section

cblas_daxpy(n, 1.2, c, 1, d, 1);

#pragma omp section

cblas_daxpy(n, 1.4, e, 1, f, 1);

#pragma omp section

cblas_daxpy(n, 1.6, g, 1, h, 1);

}

which will parallelize the functions externally, but then I might not use all of my cores, and I won't be able to take advantage of any work balancing if the vectors weren't all the same size, for example.

What's the recommended way to achieve maximum performance for code like this? Is the best practise the same on Phi?

TimP · ‎09-17-2013

I would suppose your expected gain by running multiple MKL calls in parallel depends on n being small enough that your platform performance doesn't scale linearly to all cores when running them individually. Then you would want to divide your cores among the MKL instances and pin each instance to its own group of cores. This may be easier to accomplish by MPI than by nested OpenMP parallel.

That can be a fairly effective scheme for MIC. Note that daxpy isn't among the MKL functions set up for automatic offload, as it's unlikely you could overcome the burden of copying the data between MIC and host.