OpenMP overhead

AndrewC · ‎02-07-2007

I am wondering what the best way to handle problems with openmp overhead. I do lots of matrix operations and I have openmp threaded some of the low level operations (not easily handled by MKL BLAS) such as
M1+=M2, M1=M3-M2 etc
M1 +=scalar

Unfortunately, sometimes I deal with a few large martices, but other parts of the code work with lots of small matrices. Both cases are performance critical.

It all ends up in the same core matrix routines. I have found that the overhead of openmp has a disastrous affect on performance with N=6 (say). OMP_NUM_THREADS=1 is actually about twice as fast as OMP_NUM_THREADS=2 in one case (dual-core Intel Duo) with lots of the small matrices.

Naively, I would assume I will have to change my code to something relatively ugly like (say)
if(N LESS THAN SOME MAGIC NUMBER) then
{ serial block}
else
{ openmp block}
endif

Is there any cleaner way, or some guidelines for the magic number.

Andrew

TimP · ‎02-07-2007

You may be interested in the OpenMP if() option, e.g.

i__2 = *n;
#pragma omp parallel for private(i__3,i__,j) schedule(guided) if(i__2 > 101)
for (j = 2; j <= i__2; ++j)
{
i__3 = j - 1;
for (i__ = 1; i__ <= i__3; ++i__)
aa[i__ + j * aa_dim1] = aa[j + i__ * aa_dim1] + bb[i__ + j *
bb_dim1];
}

so you can replace my 101 with your SOME_MAGIC_NUMBER. As you can see, I have found, in a number of cases, that the threshold for usefulness of parallel execution is fairly high.

When you set options which permit Intel compilers to vectorize sum reduction operations, the threshold where there is an advantage in applying OpenMP parallelism on top of vectorization could be high.

We note that Intel compiler option -parallel does not auto-parallelize a vectorizable loop of length 1000, with default par_threshold setting, and that decision is likely to be correct.

jimdempseyatthecove · ‎02-08-2007

Andrew,

Itoo ran in to similar problems. But I was not too quick to give up on OpenMP. The particulararea of investigation I am interested in is a finite solution problem with multiple objects. Each object has a series of processing functions to pass through.

My first approach to OpenMP was to parallelize the inner loops of each function.This was relatively easy to introduce into the application. To my surprise, the performance improvement was marginal.

The second approach was to remove the inner loop parallelization and modify the code to parallel-ize on the per-object basis. The performance improvement was significant.

On a 4 processor system the application now runs at about 3.75 x that of running with 1 processor.

I should mention, in the early phases of my investigation the number of points per object was small. On the order of a 100 to 1000 points per characteristic (position, acceleration, velocity, gravitational force, tension vectors, and several other characteristics).

When running with these relatively small loop counts, parallelizing from inside out was ineffective. However, parallelizing from the outside in, was effective. Now that I am increasing the number of points per object, and because not all objects require the same number of points, I find idle time (i.e. ~ 3x performance). To regain this I will be re-introducing the inner loop parallelization.

Jim Dempsey

AndrewC · ‎02-12-2007

Hi Jim,
That is exactly what I am looking for....
I will let you know the results

Andrew

AndrewC · ‎02-12-2007

Jim,
Here are some timing results ( lower is better) on a dual core CPU running XP 64..

No OpenMP threading 18s
OpenMP threading, OMP_NUM_THREADS=2, 72s
OpenMP threading, OMP_NUM_THREADS=1, 26s
OpenMP threading, OMP_NUM_THREADS=2, using schedule(guided, n>1000) 28s
OpenMP threading, OMP_NUM_THREADS=1, using schedule(guided, n>1000) 22s

So the schedule(guided) has helped at lot, but there is still a significant performance hit. Obviously some tuning is needed...
Will the Intel compiler still vectorize the loop if a openmp directive is active?

jimdempseyatthecove · ‎02-13-2007

Performance hit is relative to the ratio of useful work vs loop overhead. This is true even with single thread loops. This is why either the programmer or the compiler will unroll a loop. Similar thing with multi-threaded programming only the overhead to start/stop the loop is much higher but depending on the OpenMP method used the per iteration may not be so bad. The cost of the Start/Stop has to be weighed against the benefit of running multiple threads within the loop. Sometimes you win sometimes you loose. The programmer needs to be cognizant about this as well as what else needs to run on the system while the application runs.

>>Will the Intel compiler still vectorize the loop if a openmp directive is active?

Vectorization is different than multi-threaded programming.But this being said, vectorization needs to cooperate with OpenMP (with optional programmer help via directives) such thatthe vectors performed in the iteration loop align in a friendly manner with the fragmentation of the loop into chunks for parallel processing. Then there are issues of Temporal vs non-Temporal execution. e.g. ifA(i) has to be calculated before or after A(i+1). These too can be influenced with directives.

Jim Dempsey

Intel_C_Intel · ‎02-13-2007

Dear Andrew,

> Will the Intel compiler still vectorize the loop if a openmp directive is active?

The short answer is, yes, if the cost modelthinks this is beneficial. To give some more background, most performance boost can be expected from the classical outer-parallel and inner-vector loops, since here the thread setup overhead is amortized over a lot more actual computations, as in:

#pragma omp parallel for shared(a) private(i,j) // line 22

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++) { // line 24

a += 1;

}

joho.c(22) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

joho.c(24) : (col. 3) remark: LOOP WAS VECTORIZED.

For single loops, the cost model usually rejects parallelization and vectorization of the same loop, as in:

#pragma omp parallel for shared(b ) private(i)

for (i = 0; i < n; i++) { // line 23

b += 1;

}

joho.c(22) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

joho.c(23) : (col. 1) remark: loop was not vectorized: OpenMP parallelization already applied.

Because here typically the thread setup overhead combined with the overhead for vectorization (like peeling for alignment) will outweigh any gains. If some expensive, but still vectorizable operations appear in the loop-body, the cost model may "change its mind", as in the example below:

#pragma omp parallel for shared(b) private(i)

for (i = 0; i < n; i++) { // line 23

b = sin( b ) ;

}

joho.c(22) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

joho.c(23) : (col. 1) remark: LOOP WAS VECTORIZED.

Hope this gives you some more background on the compiler's decision related to vectorization and parallelization.

Aart Bik

http://www.aartbik.com/

AndrewC · ‎02-15-2007

Thanks - very helpful!

Dny · ‎03-31-2009

Quoting - ISN Admin

Dear Andrew,

> Will the Intel compiler still vectorize the loop if a openmp directive is active?

The short answer is, yes, if the cost modelthinks this is beneficial. To give some more background, most performance boost can be expected from the classical outer-parallel and inner-vector loops, since here the thread setup overhead is amortized over a lot more actual computations, as in:

#pragma omp parallel for shared(a) private(i,j) // line 22

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++) { // line 24

a += 1;

}

}

joho.c(22) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

joho.c(24) : (col. 3) remark: LOOP WAS VECTORIZED.

For single loops, the cost model usually rejects parallelization and vectorization of the same loop, as in:

#pragma omp parallel for shared(b ) private(i)

for (i = 0; i < n; i++) { // line 23

b += 1;

}

joho.c(22) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

joho.c(23) : (col. 1) remark: loop was not vectorized: OpenMP parallelization already applied.

Because here typically the thread setup overhead combined with the overhead for vectorization (like peeling for alignment) will outweigh any gains. If some expensive, but still vectorizable operations appear in the loop-body, the cost model may "change its mind", as in the example below:

#pragma omp parallel for shared(b) private(i)

for (i = 0; i < n; i++) { // line 23

b = sin( b ) ;

}

joho.c(22) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

joho.c(23) : (col. 1) remark: LOOP WAS VECTORIZED.

Hope this gives you some more background on the compiler's decision related to vectorization and parallelization.

Aart Bik

http://www.aartbik.com/

Hi,

Thanks for explaining how OpenMP and vectorization is used by compiler with good examples.

Regards,
Digambar

ionutg · ‎07-27-2010

Hi Jim,

I'll continue this thread in a slightly different direction. I'm benchmarking a parallelized code and I have realized that even the pure serial code, without any OpenMP instructions, is much slower when compiled with -openmp than without. Does anybody know why this can happen?

Happens with both 11.1.059 on Linux and 11.1.088 on OS X.

Thanks a lot,

Ionut

jimdempseyatthecove · ‎08-03-2010

Check to see if other compiler options may implicitly come on with -openmp.
If you are compiling with Makefiles, check to see if your make file has option differences.

The code generation should be essentially the same, but there may be different rules with respect to inline-ing functions. My experience is the change is less than +/- 2%, which is a deviation due to code and/or data alignment. If you are seeing more than 2%, check your options.

Note, your source code may have conditional statements that alter behavior of the code. Example:

#if defined(_OPENMP)
volatile int count;
#else
int count;
#endif

...
#if defined(_OPENMP)
LockLock lock(&ResourceXLock);
#endif

stuff like that (but in FORTRAN as opposed to C++)

Jim Dempsey