Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Q&A: OpenMP timing advantange examples


The following is a question received by Intel Software Network Support, followed by a response supplied by an expert at Intel:

I would like to use OpenMP in my project. I want to know how much of a timing advantage we get in a loop sequence that uses OpenMP over a thread thatis not usingOpenMP. A code example would be great.

While you've asked the compiler to split the loop into n threads, this code still has to run serially. The code has a single accumulator variable (p) that is used in all parallel copies of the loop. This forces synchronization between threads on any access to p, so your code can't run any faster than the serial case.It's about the same as if you added the decoration "shared(p)" onto the "#pragma omp parallel for" directive. In fact, the compiler surely knows that p is shared, and may just say "there's no point in emitting OpenMP code for this loop". At any rate, you won't see any speedup with this code in its current form.

There are several ways to get around this, but it's probably tidiest to have a thread-private accumulator and then sum up all accumulators when the threads are done. This would look like this:

Pick some variable to accumulate per-thread results, say "p_private". Set it to 0.

Add to the end of your "#pragma omp" line the following: "private(p_private)"

Everywhere you currently say "p+=" change the code to have "p_private+="

Just after the parallel for loop (that is, immediately after the closing brace of the block that is decorated with the "#pragma omp" directive), add a critical section that sums up all the accumulators, with a block that looks like this:
#pragma omp critical
p += p_private;

This gives you fully parallel iterations for as many threads as OpenMP can create, across the whole loop body. Each thread has its own copy of p_private. Each separate p_private is added to the shared variable p once all threads have finished. There's a small performance penalty at the end to add your separate accumulators, but the critical section makes sure it's thread-safe.

To get the zero value to propagate into each thread's private copy, you might have to add a "copyin(p_private)" clause on the end of your "#pragma omp parallel for". Try it out either way.

You can see some OpenMP overhead measurements in another paper I wrote, at

You can always find out more about OpenMP via the OpenMP spec; see It has a number of good examples in it.

Paul Lindberg
Senior Software Engineer
Intel Developer Relations Division, Client Entertainment Enabling

0 Kudos
0 Replies