- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm looking for a simple a example that uses Intel C++ OpenMP tasks but
will achieves a speedup of ~8 on 8-core processor.
Thanks in advance
Ami
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://software.intel.com/en-us/articles/code-downloads/
OpenMP & Win32 example:
http://software.intel.com/en-us/articles/openmp-and-win32-threads-usage-example
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The links you sent include a TON of examples but NONE of them
is what I'm looking for.
I'm looking for a SIMPLE and SCALABLE example that uses Intel C++ OpenMP TASKS
and achieves a speedup of ~8 on 8-core processor.
Thanks in advance,
Ami
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your best chance of reaching such a goal might be by avoiding vectorization and making an application which runs local to per-core cache.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Only if the application does not have serial code and overheadthen only this kind of scaling is possible.
Om
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear TimP and Om,
I'm aware to Amdhal's Law etc.
However, it seems that the overhead incurred by using tasks is very high.
Even with a simple example (see below) I get only speedup of 1.85 on a 2-core machine.
BTW, there are many examples out there that claim for super-linear speedup. For example, Fiboncci implemented with OpenMP and tasks and its artificial cutoff parameter.
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
for (i=0; i
}
#pragma omp task
{
for (i=0; i
}
}
} // End of parallel region
return(0);
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One should refrain from drawing lessons from the parallelization of trivial programs. In such trivial programs, the thread creation, synchronization, etc., can occupy the machine more than the trivial payload itself.
For a non-trivial calculation, I would be grateful for a speedup of 1.85 on a 2-core machine, instead of qualifying the gain by 'only', as you did.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear mecej4,
I agree with all you wrote. But
Parallelization of trivial programs is the best way (in my opinion) to describe a new language/model. Moreover, it is important to show not only a working example but a scalable one as well.
OpenMP as a thread-based parallel programming offers such examples (The famous Pi example etc. that scale well even with 8 or 12 cores).
Today, the task-based parallel programming is the "bon ton" (Intel TBB, Microsoft TPL and OpenMP 3.0). However, I have not found "trivial examples" as you said that scale well with OpenMP tasks. Maybe because the overhead (creation, synchronization etc.) incurred by using tasks is higher than threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for(i=0; i<100; i++){}
and let us suppose that we prevent the compiler from optimizing this into zero bytes of code.
In this case, the time required for the serial program is entirely overhead: setting the counter i to 0; incrementing i, and comparing against the upper bound for i.
Now, we parallelize it. The creation of threads, managing their terminations, etc. can be more time consuming that the execution of the trivial loop.
For N threads, a plausible expression for the time consumed is
T = a.N + b/N + c
The first part represents the thread management, the second the parallelized loop counting, and the third part the fixed overhead. The time consumed varies non-monotically with N. There is, in fact, an optimum number of threads, and more parallelism is no panacea.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ami,
Some time ago I've implemented a simple OpenMP-basedtest case. It doesn't use 'task' but it allows to specify a number of threads ( 2, 4, or 8 )to do some processing. You will need to replace all:
- 'CrtXxxxxx' \ 'SysXxxxx'like functionswith corresponding Crt- or Win32-functions;
- 'RTxxx' likedeclarations with corresponding built-indata types;
- 'RTU(...)-wrapper' with '_T(...)' for UNICODE or MBCS support;
because I'm providing the code as is. So, please take a look and I hope that it will help.
Best regards,
Sergey
///////////////////////////////////////////////////////////////////////////////
// OpenMpTestApp.cpp
#include "Stdphf.h"
///////////////////////////////////////////////////////////////////////////////
RTvoid RunTest001( RTint iNumThreads );// Tests OpenMP functionality
///////////////////////////////////////////////////////////////////////////////
RTdword g_dwTicksStart = 0;
RTdword g_dwTicksEnd = 0;
RTdword g_dwRunTime = 0;
///////////////////////////////////////////////////////////////////////////////
//#define NUM_OF_ITERATIONS1
//#define NUM_OF_ITERATIONS2
//#define NUM_OF_ITERATIONS4
//#define NUM_OF_ITERATIONS8
//#define NUM_OF_ITERATIONS16
#define NUM_OF_ITERATIONS32
///////////////////////////////////////////////////////////////////////////////
// Test001 - Tests OpenMP functionality
RTvoid ( *g_pFunction2[8] )( RTvoid );
RTvoid Function20( RTvoid ){ CrtPrintf( RTU("Function20 executed\n") ); }
RTvoid Function21( RTvoid ){ CrtPrintf( RTU("Function21 executed\n") ); }
RTvoid Function22( RTvoid ){ CrtPrintf( RTU("Function22 executed\n") ); }
RTvoid Function23( RTvoid ){ CrtPrintf( RTU("Function23 executed\n") ); }
RTvoid Function24( RTvoid ){ CrtPrintf( RTU("Function24 executed\n") ); }
RTvoid Function25( RTvoid ){ CrtPrintf( RTU("Function25 executed\n") ); }
RTvoid Function26( RTvoid ){ CrtPrintf( RTU("Function26 executed\n") ); }
RTvoid Function27( RTvoid ){ CrtPrintf( RTU("Function27 executed\n") ); }
RTvoid RunTest001( RTint iNumThreads )
{
CrtPrintf( RTU("> Test001 Start <\n") );
g_dwTicksStart = g_dwTicksEnd = g_dwRunTime = 0;
g_dwTicksStart = SysGetTickCount();
g_pFunction2[0] = Function20;
g_pFunction2[1] = Function21;
g_pFunction2[2] = Function22;
g_pFunction2[3] = Function23;
g_pFunction2[4] = Function24;
g_pFunction2[5] = Function25;
g_pFunction2[6] = Function26;
g_pFunction2[7] = Function27;
g_dwTicksEnd = SysGetTickCount();
g_dwRunTime = ( g_dwTicksEnd - g_dwTicksStart );
if( iNumThreads == 2 )
{
#pragma omp parallel num_threads( 2 )
{
for( RTint i = 0; i < NUM_OF_ITERATIONS; i++ )
{
RTint iThreadNum = omp_get_thread_num();
CrtPrintf( RTU("Processing started for Thread %d - Iteration %d\n"), iThreadNum, i+1 );
if( g_pFunction2 != RTnull )
( *g_pFunction2[ iThreadNum ] )();
#pragma omp for
for( RTint iN = 0; iN < ( 16777216 ); iN++ )
{
RTfloat fValue = 3.1415926f;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
}
CrtPrintf( RTU("Processing done for Thread %d\n"), iThreadNum );
}
}
}
else
if( iNumThreads == 4 )
{
#pragma omp parallel num_threads( 4 )
{
for( RTint i = 0; i < NUM_OF_ITERATIONS; i++ )
{
RTint iThreadNum = omp_get_thread_num();
CrtPrintf( RTU("Processing started for Thread %d - Iteration %d\n"), iThreadNum, i+1 );
if( g_pFunction2 != RTnull )
( *g_pFunction2[ iThreadNum ] )();
#pragma omp for
for( RTint iN = 0; iN < ( 16777216 ); iN++ )
{
RTfloat fValue = 3.1415926f;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
}
CrtPrintf( RTU("Processing done for Thread %d\n"), iThreadNum );
}
}
}
else
if( iNumThreads == 8 )
{
#pragma omp parallel num_threads( 8 )
{
for( RTint i = 0; i < NUM_OF_ITERATIONS; i++ )
{
RTint iThreadNum = omp_get_thread_num();
CrtPrintf( RTU("Processing started for Thread %d - Iteration %d\n"), iThreadNum, i+1 );
if( g_pFunction2 != RTnull )
( *g_pFunction2[ iThreadNum ] )();
#pragma omp for
for( RTint iN = 0; iN < ( 16777216 ); iN++ )
{
RTfloat fValue = 3.1415926f;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
fValue *= fValue;
}
CrtPrintf( RTU("Processing done for Thread %d\n"), iThreadNum );
}
}
}
CrtPrintf( RTU("\nCompleted in %d ticks\n"), g_dwRunTime );
CrtPrintf( RTU("> Test001 End <\n") );
}
///////////////////////////////////////////////////////////////////////////////
RTint CrtMain( RTint iArgc, RTtchar *pszArgv[] )
{
RTbool bOk = RTfalse;
CrtPrintf( RTU("Tests: Start\n") );
RTint iNumThreads = 1;
if( iArgc == 2 )
iNumThreads = CrtAtoi( pszArgv[1] );
else
// iNumThreads = 2;
//iNumThreads = 4;
iNumThreads = 8;
while( RTtrue )
{
if( iNumThreads == 0 )
break;
RunTest001( iNumThreads );// Tests OpenMP functionality
bOk = RTtrue;
break;
}
CrtPrintf( RTU("Tests: Completed\n") );
return ( RTint )0;
}
///////////////////////////////////////////////////////////////////////////////
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page