Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

A simple scalable example using OpenMP with tasks

Ami_Marowka
Beginner
715 Views
Dear All,

I'm looking for a simple a example that uses Intel C++ OpenMP tasks but
will achieves a speedup of ~8 on 8-core processor.

Thanks in advance
Ami
0 Kudos
9 Replies
SergeyKostrov
Valued Contributor II
715 Views
0 Kudos
Ami_Marowka
Beginner
715 Views
Dear Sergey,

The links you sent include a TON of examples but NONE of them
is what I'm looking for.

I'm looking for a SIMPLE and SCALABLE example that uses Intel C++ OpenMP TASKS
and achieves a speedup of ~8 on 8-core processor.

Thanks in advance,
Ami
0 Kudos
TimP
Honored Contributor III
715 Views
There are excellent examples of omp task available by web search, but none making preposterous claims of performance scalability on useful applications.
Your best chance of reaching such a goal might be by avoiding vectorization and making an application which runs local to per-core cache.
0 Kudos
Om_S_Intel
Employee
715 Views
The portion of code executed in serial has big impact on scaling. It difficult to get 8x scaling using 8 cores. You maylearn more from http://en.wikipedia.org/wiki/Amdahl's_law on serial code impact.

Only if the application does not have serial code and overheadthen only this kind of scaling is possible.

Om
0 Kudos
Ami_Marowka
Beginner
715 Views

Dear TimP and Om,

I'm aware to Amdhal's Law etc.

However, it seems that the overhead incurred by using tasks is very high.

Even with a simple example (see below) I get only speedup of 1.85 on a 2-core machine.

BTW, there are many examples out there that claim for super-linear speedup. For example, Fiboncci implemented with OpenMP and tasks and its artificial cutoff parameter.

int main(int argc, char *argv[]) {

#pragma omp parallel

{

#pragma omp single

{

#pragma omp task

{

for (i=0; i

}

#pragma omp task

{

for (i=0; i

}

}

} // End of parallel region

return(0);

}

0 Kudos
mecej4
Honored Contributor III
715 Views
Super-linearity of scale up is an interesting curiosity, one that is explained easily when there is a memory hierarchy with different latencies and, perhaps, in other scenarios.

One should refrain from drawing lessons from the parallelization of trivial programs. In such trivial programs, the thread creation, synchronization, etc., can occupy the machine more than the trivial payload itself.

For a non-trivial calculation, I would be grateful for a speedup of 1.85 on a 2-core machine, instead of qualifying the gain by 'only', as you did.
0 Kudos
Ami_Marowka
Beginner
715 Views

Dear mecej4,

I agree with all you wrote. But

Parallelization of trivial programs is the best way (in my opinion) to describe a new language/model. Moreover, it is important to show not only a working example but a scalable one as well.

OpenMP as a thread-based parallel programming offers such examples (The famous Pi example etc. that scale well even with 8 or 12 cores).

Today, the task-based parallel programming is the "bon ton" (Intel TBB, Microsoft TPL and OpenMP 3.0). However, I have not found "trivial examples" as you said that scale well with OpenMP tasks. Maybe because the overhead (creation, synchronization etc.) incurred by using tasks is higher than threads.

0 Kudos
mecej4
Honored Contributor III
715 Views
Consider the most trivial example: parallelizing a do-nothing block

for(i=0; i<100; i++){}

and let us suppose that we prevent the compiler from optimizing this into zero bytes of code.

In this case, the time required for the serial program is entirely overhead: setting the counter i to 0; incrementing i, and comparing against the upper bound for i.

Now, we parallelize it. The creation of threads, managing their terminations, etc. can be more time consuming that the execution of the trivial loop.

For N threads, a plausible expression for the time consumed is

T = a.N + b/N + c

The first part represents the thread management, the second the parallelized loop counting, and the third part the fixed overhead. The time consumed varies non-monotically with N. There is, in fact, an optimum number of threads, and more parallelism is no panacea.
0 Kudos
SergeyKostrov
Valued Contributor II
715 Views

Hi Ami,

Some time ago I've implemented a simple OpenMP-basedtest case. It doesn't use 'task' but it allows to specify a number of threads ( 2, 4, or 8 )to do some processing. You will need to replace all:

- 'CrtXxxxxx' \ 'SysXxxxx'like functionswith corresponding Crt- or Win32-functions;
- 'RTxxx' likedeclarations with corresponding built-indata types;
- 'RTU(...)-wrapper' with '_T(...)' for UNICODE or MBCS support;

because I'm providing the code as is. So, please take a look and I hope that it will help.

Best regards,
Sergey

///////////////////////////////////////////////////////////////////////////////
// OpenMpTestApp.cpp

#include "Stdphf.h"

///////////////////////////////////////////////////////////////////////////////

RTvoid RunTest001( RTint iNumThreads );// Tests OpenMP functionality

///////////////////////////////////////////////////////////////////////////////

RTdword g_dwTicksStart = 0;
RTdword g_dwTicksEnd = 0;
RTdword g_dwRunTime = 0;

///////////////////////////////////////////////////////////////////////////////

//#define NUM_OF_ITERATIONS1
//#define NUM_OF_ITERATIONS2
//#define NUM_OF_ITERATIONS4
//#define NUM_OF_ITERATIONS8
//#define NUM_OF_ITERATIONS16
#define NUM_OF_ITERATIONS32

///////////////////////////////////////////////////////////////////////////////
// Test001 - Tests OpenMP functionality

RTvoid ( *g_pFunction2[8] )( RTvoid );

RTvoid Function20( RTvoid ){ CrtPrintf( RTU("Function20 executed\n") ); }
RTvoid Function21( RTvoid ){ CrtPrintf( RTU("Function21 executed\n") ); }
RTvoid Function22( RTvoid ){ CrtPrintf( RTU("Function22 executed\n") ); }
RTvoid Function23( RTvoid ){ CrtPrintf( RTU("Function23 executed\n") ); }
RTvoid Function24( RTvoid ){ CrtPrintf( RTU("Function24 executed\n") ); }
RTvoid Function25( RTvoid ){ CrtPrintf( RTU("Function25 executed\n") ); }
RTvoid Function26( RTvoid ){ CrtPrintf( RTU("Function26 executed\n") ); }
RTvoid Function27( RTvoid ){ CrtPrintf( RTU("Function27 executed\n") ); }

RTvoid RunTest001( RTint iNumThreads )
{
CrtPrintf( RTU("> Test001 Start <\n") );

g_dwTicksStart = g_dwTicksEnd = g_dwRunTime = 0;
g_dwTicksStart = SysGetTickCount();

g_pFunction2[0] = Function20;
g_pFunction2[1] = Function21;
g_pFunction2[2] = Function22;
g_pFunction2[3] = Function23;
g_pFunction2[4] = Function24;
g_pFunction2[5] = Function25;
g_pFunction2[6] = Function26;
g_pFunction2[7] = Function27;

g_dwTicksEnd = SysGetTickCount();
g_dwRunTime = ( g_dwTicksEnd - g_dwTicksStart );

if( iNumThreads == 2 )
{
#pragma omp parallel num_threads( 2 )
{
for( RTint i = 0; i < NUM_OF_ITERATIONS; i++ )
{
RTint iThreadNum = omp_get_thread_num();

CrtPrintf( RTU("Processing started for Thread %d - Iteration %d\n"), iThreadNum, i+1 );

if( g_pFunction2 != RTnull )
( *g_pFunction2[ iThreadNum ] )();

#pragma omp for
for( RTint iN = 0; iN < ( 16777216 ); iN++ )
{
RTfloat fValue = 3.1415926f;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
}

CrtPrintf( RTU("Processing done for Thread %d\n"), iThreadNum );
}
}
}
else
if( iNumThreads == 4 )
{
#pragma omp parallel num_threads( 4 )
{
for( RTint i = 0; i < NUM_OF_ITERATIONS; i++ )
{
RTint iThreadNum = omp_get_thread_num();

CrtPrintf( RTU("Processing started for Thread %d - Iteration %d\n"), iThreadNum, i+1 );

if( g_pFunction2 != RTnull )
( *g_pFunction2[ iThreadNum ] )();

#pragma omp for
for( RTint iN = 0; iN < ( 16777216 ); iN++ )
{
RTfloat fValue = 3.1415926f;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
}

CrtPrintf( RTU("Processing done for Thread %d\n"), iThreadNum );
}
}
}
else
if( iNumThreads == 8 )
{
#pragma omp parallel num_threads( 8 )
{
for( RTint i = 0; i < NUM_OF_ITERATIONS; i++ )
{
RTint iThreadNum = omp_get_thread_num();

CrtPrintf( RTU("Processing started for Thread %d - Iteration %d\n"), iThreadNum, i+1 );

if( g_pFunction2 != RTnull )
( *g_pFunction2[ iThreadNum ] )();

#pragma omp for
for( RTint iN = 0; iN < ( 16777216 ); iN++ )
{
RTfloat fValue = 3.1415926f;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
}

CrtPrintf( RTU("Processing done for Thread %d\n"), iThreadNum );
}
}
}

CrtPrintf( RTU("\nCompleted in %d ticks\n"), g_dwRunTime );

CrtPrintf( RTU("> Test001 End <\n") );
}

///////////////////////////////////////////////////////////////////////////////

RTint CrtMain( RTint iArgc, RTtchar *pszArgv[] )
{
RTbool bOk = RTfalse;

CrtPrintf( RTU("Tests: Start\n") );

RTint iNumThreads = 1;

if( iArgc == 2 )
iNumThreads = CrtAtoi( pszArgv[1] );
else
// iNumThreads = 2;
//iNumThreads = 4;
iNumThreads = 8;

while( RTtrue )
{
if( iNumThreads == 0 )
break;

RunTest001( iNumThreads );// Tests OpenMP functionality

bOk = RTtrue;
break;
}

CrtPrintf( RTU("Tests: Completed\n") );

return ( RTint )0;
}

///////////////////////////////////////////////////////////////////////////////


0 Kudos
Reply