- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm a fresh new user of TBB. And I am currently trying to make a code parrallel in order to make it works faster on a 48cores machine.
I'm currently working on the parallelization of a sequential code with TBB. But the parallelized version of the code is way much slower .... I don't know if it is because of the size of the data I use (which could be too small) but I really doubt it, because I tried with other data and I always got very poor performances (compared to the sequential code).
I chose two ways to implement my code with TBB : by using a parallel_reduce and by using a parallel_for. I get slightly better performance with the parallel_for of course (but it is still too much slow when compared to the sequential code).
Information about the two version I implemented :
------------------------------------------------------------
For the parallel_reduce version no data is shared except fews references (no arrays are copied) and the resutls are merged
in the join method (the merging cost almost nothingbecause I'm using std::list).
For the parallel_for version plenty of references to arrays are shared and then I am obliged to used tbb:concurrent_vector to avoid conficts.
Time spent :
----------------
I manually chose the number of threads/cores to utilize,and after a quick check I noticed I get the best performances with only one thread/core (over 48 cores !) but the time spent with one core is still way more slower than the original sequential version of the code....
Typically I spent 7ms with the original sequential version, 17ms with the parallel_reduce version and 12ms with the parallel_for version....
Classes and codes :
-------------------------
Here are some parts of code to help you find the problem :
For the parallel_reduce version this is my structure :
struct parallelReduceVersion
{
int mx;
int my;
int mz;
CenteredGrid3D &data;
const float c;
std::list<:MATRIX> > v;
std::list<:MATRIX> > n;
std::list t;
std::vector<:MATRIX>* > &nPointers;
std::vector<:MATRIX>* > &xVPointers;
std::vector<:MATRIX>* > &yVPointers;
std::vector<:MATRIX>* > &zVPointers;
/* optimization */
int x,y,z;
float a[8];
unsigned char lookupTableEntry;
unsigned char case;
unsigned char config;
unsigned char subConfig;
parallelReduceVersion(CenteredGrid3D &data, float c,
std::vector<:MATRIX>* > &nPointers,
std::vector<:MATRIX>* > &xVPointers,
std::vector<:MATRIX>* > &yVPointers,
std::vector<:MATRIX>* > &zVPointers)
: data(data),c(c),nlPointers(nPointers),xVPointers(xVPointers),yVPointers(yVPointers),zVPointers(zVPointers)
{
mx = data.geometry().nx();
my = data.geometry().ny();
mz = data.geometry().nz();
}
parallelReduceVersion(const parallelReduceVersion& smc, tbb::split)
: data(smc.data), c(smc.c),nPointers(smc.nPointers),xVPointers(smc.xVPointers),yVPointers(smc.yVPointers),zVPointers(smc.zVPointers)
{
mx = smc.data.geometry().nx();
my = smc.data.geometry().ny();
mz = smc.data.geometry().nz();
}
/* methods */
.....
.....
.....
.....
void operator()(const tbb::blocked_range& r);
void join(parallelReduceVersion &smc)
{
v.splice(v.end(),smc.v);
n.splice(n.end(),smc.n);
t.splice(t.end(), smc.t);
}
};
parallel_reduce is used like that :
parallelReduceVersion PR = new parallelReuceVersion(data, c, nPointers, xVPointers, yVPointers, zVPointers);
tbb::parallel_reduce(tbb::blocked_range(0,nz,round(nz/nbCores)), *PR);
the () operator :
void parallelReduceVersion::operator()(const tbb::blocked_range& r)
{
for(k = r.begin() ; k < r.end() && k < mz ; k++)
{
for(j = 0 ; j < my ; j++)
{
for(i = 0 ; i < mx ; i++)
{
// CODE
}
}
}
for(k = r.begin() ; k < r.end() && k < mz-1 ; k++)
{
for(j = 0 ; j < my-1 ; j++)
{
for(i = 0 ; i < mx-1 ; i++)
{
// CODE
}
}
}
}
}
Do you think I am using TBB the right way for best performances ?
Thanks!
I'm a fresh new user of TBB. And I am currently trying to make a code parrallel in order to make it works faster on a 48cores machine.
I'm currently working on the parallelization of a sequential code with TBB. But the parallelized version of the code is way much slower .... I don't know if it is because of the size of the data I use (which could be too small) but I really doubt it, because I tried with other data and I always got very poor performances (compared to the sequential code).
I chose two ways to implement my code with TBB : by using a parallel_reduce and by using a parallel_for. I get slightly better performance with the parallel_for of course (but it is still too much slow when compared to the sequential code).
Information about the two version I implemented :
------------------------------------------------------------
For the parallel_reduce version no data is shared except fews references (no arrays are copied) and the resutls are merged
in the join method (the merging cost almost nothingbecause I'm using std::list).
For the parallel_for version plenty of references to arrays are shared and then I am obliged to used tbb:concurrent_vector to avoid conficts.
Time spent :
----------------
I manually chose the number of threads/cores to utilize,and after a quick check I noticed I get the best performances with only one thread/core (over 48 cores !) but the time spent with one core is still way more slower than the original sequential version of the code....
Typically I spent 7ms with the original sequential version, 17ms with the parallel_reduce version and 12ms with the parallel_for version....
Classes and codes :
-------------------------
Here are some parts of code to help you find the problem :
For the parallel_reduce version this is my structure :
struct parallelReduceVersion
{
int mx;
int my;
int mz;
CenteredGrid3D
const float c;
std::list<:MATRIX>
std::list<:MATRIX>
std::list t;
std::vector<:MATRIX>
std::vector<:MATRIX>
std::vector<:MATRIX>
std::vector<:MATRIX>
/* optimization */
int x,y,z;
float a[8];
unsigned char lookupTableEntry;
unsigned char case;
unsigned char config;
unsigned char subConfig;
parallelReduceVersion(CenteredGrid3D
std::vector<:MATRIX>
std::vector<:MATRIX>
std::vector<:MATRIX>
std::vector<:MATRIX>
: data(data),c(c),nlPointers(nPointers),xVPointers(xVPointers),yVPointers(yVPointers),zVPointers(zVPointers)
{
mx = data.geometry().nx();
my = data.geometry().ny();
mz = data.geometry().nz();
}
parallelReduceVersion(const parallelReduceVersion& smc, tbb::split)
: data(smc.data), c(smc.c),nPointers(smc.nPointers),xVPointers(smc.xVPointers),yVPointers(smc.yVPointers),zVPointers(smc.zVPointers)
{
mx = smc.data.geometry().nx();
my = smc.data.geometry().ny();
mz = smc.data.geometry().nz();
}
/* methods */
.....
.....
.....
.....
void operator()(const tbb::blocked_range
void join(parallelReduceVersion &smc)
{
v.splice(v.end(),smc.v);
n.splice(n.end(),smc.n);
t.splice(t.end(), smc.t);
}
};
parallel_reduce is used like that :
parallelReduceVersion PR = new parallelReuceVersion(data, c, nPointers, xVPointers, yVPointers, zVPointers);
tbb::parallel_reduce(tbb::blocked_range
the () operator :
void parallelReduceVersion::operator()(const tbb::blocked_range
{
for(k = r.begin() ; k < r.end() && k < mz ; k++)
{
for(j = 0 ; j < my ; j++)
{
for(i = 0 ; i < mx ; i++)
{
// CODE
}
}
}
for(k = r.begin() ; k < r.end() && k < mz-1 ; k++)
{
for(j = 0 ; j < my-1 ; j++)
{
for(i = 0 ; i < mx-1 ; i++)
{
// CODE
}
}
}
}
}
Do you think I am using TBB the right way for best performances ?
Thanks!
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok....
Do you think it is a good idea to have plenty of methods in the class ? Do you think I should restrict the number of methods the more possible to avoid stack calls ?
I'm performing a lot of push_backs in local on several std::lists, when I comment the push_back I've got get performances however I'm pretty sure I have got similar poor performances using classic arrays ... so dead end
Do you think it is a good idea to have plenty of methods in the class ? Do you think I should restrict the number of methods the more possible to avoid stack calls ?
I'm performing a lot of push_backs in local on several std::lists, when I comment the push_back I've got get performances however I'm pretty sure I have got similar poor performances using classic arrays ... so dead end
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page