Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
2421 Discussions

My TBB code is slower than the sequential code.


I'm a fresh new user of TBB. And I am currently trying to make a code parrallel in order to make it works faster on a 48cores machine.

I'm currently working on the parallelization of a sequential code with TBB. But the parallelized version of the code is way much slower .... I don't know if it is because of the size of the data I use (which could be too small) but I really doubt it, because I tried with other data and I always got very poor performances (compared to the sequential code).

I chose two ways to implement my code with TBB : by using a parallel_reduce and by using a parallel_for. I get slightly better performance with the parallel_for of course (but it is still too much slow when compared to the sequential code).

Information about the two version I implemented :

For the parallel_reduce version no data is shared except fews references (no arrays are copied) and the resutls are merged
in the join method (the merging cost almost nothingbecause I'm using std::list).

For the parallel_for version plenty of references to arrays are shared and then I am obliged to used tbb:concurrent_vector to avoid conficts.

Time spent :

I manually chose the number of threads/cores to utilize,and after a quick check I noticed I get the best performances with only one thread/core (over 48 cores !) but the time spent with one core is still way more slower than the original sequential version of the code....

Typically I spent 7ms with the original sequential version, 17ms with the parallel_reduce version and 12ms with the parallel_for version....

Classes and codes :

Here are some parts of code to help you find the problem :

For the parallel_reduce version this is my structure :

struct parallelReduceVersion
int mx;
int my;
int mz;

CenteredGrid3D &data;
const float c;

std::list<:MATRIX> > v;
std::list<:MATRIX> > n;
std::list t;

std::vector<:MATRIX>* > &nPointers;
std::vector<:MATRIX>* > &xVPointers;
std::vector<:MATRIX>* > &yVPointers;
std::vector<:MATRIX>* > &zVPointers;

/* optimization */
int x,y,z;
float a[8];
unsigned char lookupTableEntry;
unsigned char case;
unsigned char config;
unsigned char subConfig;

parallelReduceVersion(CenteredGrid3D &data, float c,
std::vector<:MATRIX>* > &nPointers,
std::vector<:MATRIX>* > &xVPointers,
std::vector<:MATRIX>* > &yVPointers,
std::vector<:MATRIX>* > &zVPointers)
: data(data),c(c),nlPointers(nPointers),xVPointers(xVPointers),yVPointers(yVPointers),zVPointers(zVPointers)
mx = data.geometry().nx();
my = data.geometry().ny();
mz = data.geometry().nz();

parallelReduceVersion(const parallelReduceVersion& smc, tbb::split)
: data(, c(smc.c),nPointers(smc.nPointers),xVPointers(smc.xVPointers),yVPointers(smc.yVPointers),zVPointers(smc.zVPointers)
mx =;
my =;
mz =;

/* methods */

void operator()(const tbb::blocked_range& r);

void join(parallelReduceVersion &smc)
t.splice(t.end(), smc.t);

parallel_reduce is used like that :

parallelReduceVersion PR = new parallelReuceVersion(data, c, nPointers, xVPointers, yVPointers, zVPointers);
tbb::parallel_reduce(tbb::blocked_range(0,nz,round(nz/nbCores)), *PR);

the () operator :

void parallelReduceVersion::operator()(const tbb::blocked_range& r)
for(k = r.begin() ; k < r.end() && k < mz ; k++)
for(j = 0 ; j < my ; j++)
for(i = 0 ; i < mx ; i++)
for(k = r.begin() ; k < r.end() && k < mz-1 ; k++)
for(j = 0 ; j < my-1 ; j++)
for(i = 0 ; i < mx-1 ; i++)

Do you think I am using TBB the right way for best performances ?

0 Kudos
1 Reply
Do you think it is a good idea to have plenty of methods in the class ? Do you think I should restrict the number of methods the more possible to avoid stack calls ?

I'm performing a lot of push_backs in local on several std::lists, when I comment the push_back I've got get performances however I'm pretty sure I have got similar poor performances using classic arrays ... so dead end