Superlinear speedup with OpenMP on one core.

dexter · ‎08-27-2009

The most time consuming part in my program are two nested loops (rendering an image).

All iterations are independent of each other, thus both loops can be parallelized.

Pseudocode:

-------------------------------------------------------------------

for (y = 0; y < yres; y++)

{

for (x = 0; x < xres; x++)

{

//Time consuming calculation of value of one pixel

}

-------------------------------------------------------------------

Code was parallelized with openMP pragma:

-------------------------------------------------------------------

#pragma omp parallel for schedule(runtime)

for (y = 0; y < yres; y++)

{

for (x = 0; x < xres; x++)

{

//Time consuming calculation of value of one pixel

}

-------------------------------------------------------------------

First code (sequential) was compiled without any flags.

Second code was compiled with only "-openmp" flag.

Time of execution of the second code, with number of threads limited to one (OMP_NUM_THREADS=1)

and dynamic scheduling (OMP_SCHEDULE="dynamic,4") turned out to be less than time of execution of

the sequential code. That gave a superlinear speedup for one core (1.009). How is it possible?

Differences in times (average times from 10 measurements):

sequential code:2470.428 s

parallelized code: 2448.693 s

I am using Intel C++ Compiler for Linux (icpc (ICC) 11.0 20081105).

According to documentation -openmp flag "enable the compiler to generate multi-threaded code

based on the OpenMP* directives".

Does -openmp flag enables also some additional optimizations?

Or maybe, the compiler utilises in some way the information that iterations of the outer loop

can be executed in parallel?

Is it possible to say what the compiler exactly do with information about possible

parallelism (omp pragmas) and only one thread available?

Om_S_Intel · ‎08-27-2009

The -openmp compiler option is used to recongnize OpenMP directives and to generate multithreaded code. The option will also triger parallel optimization phase in the compiler.

dexter · ‎08-27-2009

Quoting - Om Sachan (Intel)

The -openmp compiler option is used to recongnize OpenMP directives and to generate multithreaded code. The option will also triger parallel optimization phase in the compiler.

Can this optimization (together with omp pragmas) be the explanation of obtained superlinear speedup on one core (one thread)?

srimks · ‎08-27-2009

Quoting - dexter

Can this optimization (together with omp pragmas) be the explanation of obtained superlinear speedup on one core (one thread)?

Do you mean by one core as single core in a processor (which means 1/8 or 1/4 or 1/2 as per Intel processors).

If you mean this, than purpose of OpenMP is to peform COARSE-GRAINED parallelism across cores in a processor for a SMP system. If you only wish to use single core, I would not suggest going for OpenMP but rather perform VECTORIZATION or FINE-GRAINED PARALLELISMto have better use of SSE registers than x87 stack which can give you effective Data-parallelism.

Disabling other cores in a SMP system apart fom single core can be done by THREAD-AFFINITY concept of Operating-system.

~BR

srimks · ‎08-27-2009

Quoting - dexter

The most time consuming part in my program are two nested loops (rendering an image).

All iterations are independent of each other, thus both loops can be parallelized.

Pseudocode:
-------------------------------------------------------------------
for (y = 0; y < yres; y++)

{
for (x = 0; x < xres; x++)
{

//Time consuming calculation of value of one pixel

}
}
-------------------------------------------------------------------
Code was parallelized with openMP pragma:

-------------------------------------------------------------------
#pragma omp parallel for schedule(runtime)

for (y = 0; y < yres; y++)

{
for (x = 0; x < xres; x++)
{

//Time consuming calculation of value of one pixel

}
}
-------------------------------------------------------------------

First code (sequential) was compiled without any flags.

Second code was compiled with only "-openmp" flag.

Time of execution of the second code, with number of threads limited to one (OMP_NUM_THREADS=1)

and dynamic scheduling (OMP_SCHEDULE="dynamic,4") turned out to be less than time of execution of

the sequential code. That gave a superlinear speedup for one core (1.009). How is it possible?

Differences in times (average times from 10 measurements):

sequential code:2470.428 s

parallelized code: 2448.693 s

I am using Intel C++ Compiler for Linux (icpc (ICC) 11.0 20081105).

According to documentation -openmp flag "enable the compiler to generate multi-threaded code

based on the OpenMP* directives".

Does -openmp flag enables also some additional optimizations?

Or maybe, the compiler utilises in some way the information that iterations of the outer loop

can be executed in parallel?

Is it possible to say what the compiler exactly do with information about possible

parallelism (omp pragmas) and only one thread available?

As qouted by you "compiler utilises in some way the information that iterations of the outer loop can be executed in parallel?".

Could you share the complete compiler command given to execute this code.

~BR

jimdempseyatthecove · ‎08-27-2009

pushing optimization issues asside

A potential reason for difference (but not for your circumstance) is when forcing a loop to run with more threads than hardware threads (e.g. two on system one hardware thread), could be the outer loop is cut in half then the inner loop(s) is(are) run. Depending on cache locality, running portions of the larger loop on single hardware thread machine may yield better cache locality of data.

In your case, I do not believe this to be so. Your outer loop is likely run through the OpenMP version of the code as one piece (totality) of loop. Placement of the code may have more to do with differences (~1%)

Jim Dempsey

dexter · ‎08-27-2009

Quoting - srimks

If you only wish to use single core, I would not suggest going for OpenMP but rather perform VECTORIZATION or FINE-GRAINED PARALLELISMto have better use of SSE registers than x87 stack which can give you effective Data-parallelism.

I am aware that OpenMP is not suited for use with one thread/core (single core in a processor). However, during speedups measurements (for one and more cores used), I have observed this strange behavior (superlinear speedup on one core) and I am wondering what is the reason.

Quoting -srimks

Could you share the complete command given to execute this code.

Here it is, but I am not sure if this is what you mean:

export OMP_NUM_THREADS=1

export OMP_THREAD_LIMIT=1

export OMP_SCHEDULE="dynamic,4"

./Program input-script.txt

Quoting -jimdempseyatthecove

Placement of the code may have more to do with differences (~1%)

Could you expand this statement? What do you mean by "placement of the code" ?

srimks · ‎08-27-2009

Quoting - dexter

I am aware that OpenMP is not suited for use with one thread/core (single core in a processor). However, during speedups measurements (for one and more cores used), I have observed this strange behavior (superlinear speedup on one core) and I am wondering what is the reason.

Quoting -srimks

Could you share the complete command given to execute this code.

Here it is, but I am not sure if this is what you mean:

export OMP_NUM_THREADS=1

export OMP_THREAD_LIMIT=1

export OMP_SCHEDULE="dynamic,4"

./Program input-script.txt

Quoting -jimdempseyatthecove

Placement of the code may have more to do with differences (~1%)

Could you expand this statement? What do you mean by "placement of the code" ?

I mean complete Compiler command made to compile this code, if you are using could you check CXXFLAGS or CPPFLAGS of Makefile for icpc & icc resp.

~BR

jimdempseyatthecove · ‎08-27-2009

"placement of code" - well code and/or data placement matters

The cache system typically reads and writes data in lines. Cache lines are (today) 64 bytes long and are transfered singly or in pairs (depending on system). If an often used small structure is aligned such that it crosses cache lines, reading and writing will affect two cache lines and performance is diminished. A similar thing happens when data access cross page boundaries. The system has a very small number of TLBs (translation look aside buffers). Often a small code change (adding/removing a line or two) can cause +/- 1% performance change. Sometimes much larger.

Jim Dempsey