Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

parallel_for performance problem

bucciolini
Beginner
485 Views
Hi,
we are trying to use TBB libs to raise performance of existing code. So, we wrote some lines of code to test improvements, but the results we obtained were not suitable.
In particular, we used parallel_for on a Core Duo Processor T2400, using both ICC and GCC compilers, and we always get high computing timings.
Here is the code we used for the test:


#include "parallel_for.h"
#include "blocked_range.h"
#include "task_scheduler_init.h"
#include "tick_count.h"
#include
#include


using namespace tbb;
using namespace std;

class Parallel
{
public:
float* input;
float* output;
void operator( )( const blocked_range& range ) const
{
int a, randomNumber;
srand(0);
for ( int i=range.begin(); i!=range.end( ) ; ++i )
{
a = input;
randomNumber = rand();
a*= randomNumber;
}
}
};

inline void parallelOperation( float* output, float* input, size_t n )
{
Parallel par;
par.input = input;
par.output = output;


parallel_for( blocked_range( 0, n), par, auto_partitioner());

}

void doIteration(int nIterations)
{

}

int main(int argc, char * argv)
{
task_scheduler_init init;
double totalTimeSerial = 0;
double totalTimeParallel = 0;
int nIterations = 100;
const int nValues = 20000;
float pollo [nValues];
float pollame[nValues];

srand(0);

for (int i=0; i {
pollo = i;
}

int a, randomNumber;
tick_count start = tick_count::now();

// call serial version nIterations times
for (int k = 0; k < nIterations; k++)
{
srand(0);
for (int i=0; i {
a = pollo;
randomNumber = rand();
a*= randomNumber;
}
}

tick_count end = tick_count::now();
totalTimeSerial = (end - start).seconds();
start = tick_count::now();

// call parallel version nIterations times
for (int k = 0; k < nIterations; k++)
{
parallelOperation(pollame, pollo, nValues);
}

end = tick_count::now();
totalTimeParallel = (end - start).seconds();

// take mean times
cout << "mean time serial = " << totalTimeSerial/(double)nIterations << endl;
cout << "mean time parallel = " << totalTimeParallel/(double)nIterations << endl;

return 0;

}




////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Result of execution is this:

mean time serial = 0.0014703 sec
mean time parallel = 0.0114537 sec

mean time parallel is 10 times higher!! Why?


0 Kudos
3 Replies
Anton_M_Intel
Employee
485 Views
Actually, the subject could be renamed to "rand() concurrency problem".
You are calling srand() for each piece of the range and rand() for every iteration. But unfortunately, they have global lock inside because them wasn't designed for concurrent execution. It serializes your parallel loop and adds additional overhead due to synchronizaion (and multiple srand() calls).

0 Kudos
bucciolini
Beginner
485 Views
Thank you for the fast reply! We delete calls to rand() function with some arithmetic operations but the result is always the same. Here is the modified code:

#include "parallel_for.h"
#include "blocked_range.h"
#include "task_scheduler_init.h"
#include "tick_count.h"
#include 
#include 
using namespace tbb;
using namespace std;
class Parallel
{
public:
 float* input;
 float* output;
 void operator( )( const blocked_range& range ) const
 {
 int a;
 for ( int i=range.begin(); i!=range.end( ) ; ++i )
 {
 a = input;
 a*= sqrt(double(i));
 a/=i*(a*2.5);
 a*= a + 3*i - 2*a*i;
 a/= a*2 - 6.2*a*i;
 a*=atan2(sqrt(double(a)), sqrt(double(i)));
 a+=1000*sqrt(atan((float(a))*atan(float(i))) );
 }
 }
};
void parallelOperation( float* output, float* input, size_t n)
{
 Parallel par;
 par.input = input;
 par.output = output;
 parallel_for( blocked_range( 0, n, n/2), par);
}
int main(int argc, char * argv)
{
 task_scheduler_init init;
 double totalTimeSerial = 0;
 double totalTimeParallel = 0;
 int nIterations = 1000000;
 const int nValues = 50000;
 float pollo [nValues];
 f
loat pollame[nValues];
 
 for (int i=0; i
 {
 pollo = i;
 }
 
 int a, randomNumber;
 tick_count start = tick_count::now();
 // call serial version
 for (int i=0; i
 {
 a = pollo;
 a*= sqrt(double(i));
 a/=i*(a*2.5);
 a*= a + 3*i - 2*a*i;
 a/= a*2 - 6.2*a*i;
 a*=atan2(sqrt(double(a)), sqrt(double(i)));
 a+=1000*sqrt(atan((float(a))*atan(float(i))) );
 }
 tick_count end = tick_count::now();
 totalTimeSerial = (end - start).seconds();
 start = tick_count::now();
 // call parallel version
 parallelOperation(pollame, pollo, nValues);
 end = tick_count::now();
 totalTimeParallel = (end - start).seconds();
 
 // measure timings
 cout << "time serial = " << totalTimeSerial << endl;
 cout << "time parallel = " << totalTimeParallel << endl;
 return 0;
}

The output timings are these:

time serial = 2.375e-06
time parallel = 0.00048312

As you can see time parallel is about 200 times higher than time serial

We tried also to use autopartitioner,but the results get worse:
time serial = 1.816e-06
time parallel = 0.00147353

We use following compilation options:
CFLAGS = -O2 -fast -xT -msse3 -ip

We're not able to write a sample code that demonstrate that using tbb is more performant than using classic code. How can we explain to our boss that this technology is better, if we cannot demonstrate it? ;-)
0 Kudos
Alexey-Kukanov
Employee
485 Views

You did calculations using a local variable, but never used the result of those calculations.The latest Intel 10.1 compiler I tried just optimizes the loop away, for both serial and parallel version. I would assume the compiler you used was able to optimize away the serial version but not the parallel one.

So change the test so that the result of the calculations is written somewhere. I assumed the pollame array (also referenced as output) was intended for that. Even better if you then do some math over pollame, such as calculating some aggregate function (e.g. sum, or average), and print the result.

Also 50000 iterations is too small a number to efficiently parallelize it, given the amount of calculations per iteration. This is possibly the reason why auto_partitioner showed worse numbers (and I expect that you removed explicit grainsize of n/2 when using auto_partitioner, didn't you?).

In the attachement, you could see the patch I made to your code. With these changes, I got the following result on a system with 2 quad-core processors:

time serial (run 0) = 0.066917
time serial (run 1) = 0.065347
time parallel (2 threads) = 0.038082
time parallel (3 threads) = 0.029991
time parallel (4 threads) = 0.024736
time parallel (5 threads) = 0.022397
time parallel (6 threads) = 0.024379
time parallel (7 threads) = 0.022241
time parallel (8 threads) = 0.016181
0 Kudos
Reply