- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we are trying to use TBB libs to raise performance of existing code. So, we wrote some lines of code to test improvements, but the results we obtained were not suitable.
In particular, we used parallel_for on a Core Duo Processor T2400, using both ICC and GCC compilers, and we always get high computing timings.
Here is the code we used for the test:
#include "parallel_for.h"
#include "blocked_range.h"
#include "task_scheduler_init.h"
#include "tick_count.h"
#include
#include
using namespace tbb;
using namespace std;
class Parallel
{
public:
float* input;
float* output;
void operator( )( const blocked_range
{
int a, randomNumber;
srand(0);
for ( int i=range.begin(); i!=range.end( ) ; ++i )
{
a = input;
randomNumber = rand();
a*= randomNumber;
}
}
};
inline void parallelOperation( float* output, float* input, size_t n )
{
Parallel par;
par.input = input;
par.output = output;
parallel_for( blocked_range
}
void doIteration(int nIterations)
{
}
int main(int argc, char * argv)
{
task_scheduler_init init;
double totalTimeSerial = 0;
double totalTimeParallel = 0;
int nIterations = 100;
const int nValues = 20000;
float pollo [nValues];
float pollame[nValues];
srand(0);
for (int i=0; i
pollo = i;
}
int a, randomNumber;
tick_count start = tick_count::now();
// call serial version nIterations times
for (int k = 0; k < nIterations; k++)
{
srand(0);
for (int i=0; i
a = pollo;
randomNumber = rand();
a*= randomNumber;
}
}
tick_count end = tick_count::now();
totalTimeSerial = (end - start).seconds();
start = tick_count::now();
// call parallel version nIterations times
for (int k = 0; k < nIterations; k++)
{
parallelOperation(pollame, pollo, nValues);
}
end = tick_count::now();
totalTimeParallel = (end - start).seconds();
// take mean times
cout << "mean time serial = " << totalTimeSerial/(double)nIterations << endl;
cout << "mean time parallel = " << totalTimeParallel/(double)nIterations << endl;
return 0;
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Result of execution is this:
mean time serial = 0.0014703 sec
mean time parallel = 0.0114537 sec
mean time parallel is 10 times higher!! Why?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are calling srand() for each piece of the range and rand() for every iteration. But unfortunately, they have global lock inside because them wasn't designed for concurrent execution. It serializes your parallel loop and adds additional overhead due to synchronizaion (and multiple srand() calls).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#include "parallel_for.h"
#include "blocked_range.h"
#include "task_scheduler_init.h"
#include "tick_count.h"
#include
#include
using namespace tbb;
using namespace std;
class Parallel
{
public:
float* input;
float* output;
void operator( )( const blocked_range& range ) const
{
int a;
for ( int i=range.begin(); i!=range.end( ) ; ++i )
{
a = input;
a*= sqrt(double(i));
a/=i*(a*2.5);
a*= a + 3*i - 2*a*i;
a/= a*2 - 6.2*a*i;
a*=atan2(sqrt(double(a)), sqrt(double(i)));
a+=1000*sqrt(atan((float(a))*atan(float(i))) );
}
}
};
void parallelOperation( float* output, float* input, size_t n)
{
Parallel par;
par.input = input;
par.output = output;
parallel_for( blocked_range( 0, n, n/2), par);
}
int main(int argc, char * argv)
{
task_scheduler_init init;
double totalTimeSerial = 0;
double totalTimeParallel = 0;
int nIterations = 1000000;
const int nValues = 50000;
float pollo [nValues];
f loat pollame[nValues];
for (int i=0; i
{
pollo = i;
}
int a, randomNumber;
tick_count start = tick_count::now();
// call serial version
for (int i=0; i
{
a = pollo;
a*= sqrt(double(i));
a/=i*(a*2.5);
a*= a + 3*i - 2*a*i;
a/= a*2 - 6.2*a*i;
a*=atan2(sqrt(double(a)), sqrt(double(i)));
a+=1000*sqrt(atan((float(a))*atan(float(i))) );
}
tick_count end = tick_count::now();
totalTimeSerial = (end - start).seconds();
start = tick_count::now();
// call parallel version
parallelOperation(pollame, pollo, nValues);
end = tick_count::now();
totalTimeParallel = (end - start).seconds();
// measure timings
cout << "time serial = " << totalTimeSerial << endl;
cout << "time parallel = " << totalTimeParallel << endl;
return 0;
}
The output timings are these:
time serial = 2.375e-06
time parallel = 0.00048312
As you can see time parallel is about 200 times higher than time serial
We tried also to use autopartitioner,but the results get worse:
time serial = 1.816e-06
time parallel = 0.00147353
We use following compilation options:
CFLAGS = -O2 -fast -xT -msse3 -ip
We're not able to write a sample code that demonstrate that using tbb is more performant than using classic code. How can we explain to our boss that this technology is better, if we cannot demonstrate it? ;-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You did calculations using a local variable, but never used the result of those calculations.The latest Intel 10.1 compiler I tried just optimizes the loop away, for both serial and parallel version. I would assume the compiler you used was able to optimize away the serial version but not the parallel one.
So change the test so that the result of the calculations is written somewhere. I assumed the pollame array (also referenced as output) was intended for that. Even better if you then do some math over pollame, such as calculating some aggregate function (e.g. sum, or average), and print the result.
Also 50000 iterations is too small a number to efficiently parallelize it, given the amount of calculations per iteration. This is possibly the reason why auto_partitioner showed worse numbers (and I expect that you removed explicit grainsize of n/2 when using auto_partitioner, didn't you?).
In the attachement, you could see the patch I made to your code. With these changes, I got the following result on a system with 2 quad-core processors:
time serial (run 0) = 0.066917
time serial (run 1) = 0.065347
time parallel (2 threads) = 0.038082
time parallel (3 threads) = 0.029991
time parallel (4 threads) = 0.024736
time parallel (5 threads) = 0.022397
time parallel (6 threads) = 0.024379
time parallel (7 threads) = 0.022241
time parallel (8 threads) = 0.016181
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page