- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.
Any way i have a cicle that used to look like this:
[cpp]for(int i = 0; ifor(int j=0; j {
if(oM != oM){
oM->calculaPhys(oM);
veces++;
}
}
}[/cpp]
I modified that code to use a parallel_for like this:
[cpp]class ApplyPhys {
objectMass *const my_Obj;
public:
ApplyPhys( objectMass obj[]): my_Obj(obj){}
void operator()( const blocked_range& r ) const{
objectMass *obj = my_Obj;
for( size_t i = r.begin(); i != r.end(); i++){
funcTest(&obj);
}
}
};void funcTest(objectMass * obj){
for (int i=0; iif(obj != oM){
if(obj->mass != 0){
veces++;
obj->calculaPhys(oM);
}
if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){
obj->mass += oM->mass;
oM->mass =0;
}
}
}
}[/cpp]
[cpp]parallel_for(blocked_range(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());[/cpp]
Anyway I'm getting worse performance on using parallel_for that when using a serial approach.
What could I do to increase performance?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
task_scheduler_init init (2);
and
task_scheduler_init init (1);
So why is it that there is so much overhead when using parallel_for than when using a simple nested for.
Quoting - dreamthtr
My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.
Any way i have a cicle that used to look like this:
[cpp]for(int i = 0; ifor(int j=0; j {
if(oM != oM){
oM->calculaPhys(oM);
veces++;
}
}
}[/cpp]
I modified that code to use a parallel_for like this:
[cpp]class ApplyPhys {
objectMass *const my_Obj;
public:
ApplyPhys( objectMass obj[]): my_Obj(obj){}
void operator()( const blocked_range& r ) const{
objectMass *obj = my_Obj;
for( size_t i = r.begin(); i != r.end(); i++){
funcTest(&obj);
}
}
};void funcTest(objectMass * obj){
for (int i=0; iif(obj != oM){
if(obj->mass != 0){
veces++;
obj->calculaPhys(oM);
}
if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){
obj->mass += oM->mass;
oM->mass =0;
}
}
}
}[/cpp]
[cpp]parallel_for(blocked_range(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());[/cpp]
Anyway I'm getting worse performance on using parallel_for that when using a serial approach.
What could I do to increase performance?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:
[cpp]obj->calculaPhys(oM); [/cpp]
But clearly when the code collapses the sticky masses:
[cpp]if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){ obj->mass += oM->mass; oM->mass =0; } [/cpp]
A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object.Even if that object is itselfbeing operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducinga lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Robert Reed (Intel)
The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:
[cpp]obj->calculaPhys(oM); [/cpp]
But clearly when the code collapses the sticky masses:
[cpp]if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){
obj->mass += oM->mass;
oM->mass =0;
} [/cpp]
A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object.Even if that object is itselfbeing operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducinga lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
task_scheduler_init init (2);
and
task_scheduler_init init (1);
So why is it that there is so much overhead when using parallel_for than when using a simple nested for.
So you see 2x speedup by TBB version with 2 threads over TBB version with 1 thread; And the TBB version with 1thread is much slower than your serial version, right?Then might be data sharing is less of a problem at the moment;rather some compiler optimizations might get missed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FYI, I was poking around onanother task and ran across this bit on the Wikipedia page regarding parallel algorithms:
Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.
If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant thatyou can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you'reconsidering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FYI, I was poking around onanother task and ran across this bit on the Wikipedia page regarding parallel algorithms:
Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.
If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant thatyou can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you'reconsidering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)
You could also identify clusters, work out the forces within each cluster, then treat the clusters as composite objects with centers of mass and compute composite forces. Inverse-squared attenuation would reduce the forces between the clusters and also their impact as suggested by Raf, so perhaps you could get by with computing the inter-cluster forces less frequently than the intra-cluster forces.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
dreamthtr,
The following is a potential solution using QuickThread, I will leave it as an exercize for you to rework it into TBB
[cpp]/* // single threaded code for(int i = 0; i(lessThan)CANTIDAD; i++) for(int j=0; j(lessThan)CANTIDAD; j++) { if(oM != oM){ oM->calculaPhys(oM ); veces++; } } } */ // parallel task performing independent slices of volume // no atomic operations required void CalcPhysTask(size_t i, size_t j, size_t nVolumes) { size_t stride = CANTIDAD / nVolumes; size_t iBegin = stride * i; size_t iEnd = iBegin + stride; if((i+1) == nVolumns) iEnd = CANTIDAD; size_t jBegin = stride * j; size_t jEnd = jBegin + stride; if((j+1) == nVolumes) jEnd = CANTIDAD; for(i = iBegin; i(lessThan)iEnd; i++) for(int j=jBegin; j(lessThan)jEnd; j++) { if(oM != oM ){ oM->calculaPhys(oM ); veces++; } } } } void CalcPhys() { size_t nThreads = qt_get_num_threads(); // nVolumns = Arbitrary tuning parameter // Determine an optimal number of volumes // The processing loops are // for(i=0; i(lessThan)nVolumes; ++i) // for(j=i; j(lessThan)nVolumes; j++) // // When nVolumes-i becomes less than number of threads // then some cores become under utilized // However, if you create too many volumes // then you incure excessive task enqueuing overhead // This will have to be experimented depending on system size_t nVolumes = CANTIDAD / nThreads / 128; if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 16; if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 4; if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads; if(nVolumes (lessThan) nThreads) { if(CANTIDAD (lessThanOrEquql) 1) return; nVolumes = nThreads = CANTIDAD; } size_t i, j; for(i=0; i(lessThan)nVolumes; ++i) { { // scope of qtControl qtControl qtControl; for(j=i; j(lessThan)nVolumes; j++) { parallel_task(&qtControl, CalcPhysTask, i, (i+j)%nVolumes, nVolumes); } } // end scope of qtControl (synchronization point) } } [/cpp]
Replace the (lessThan) and (lessThanOrEqual) with the appropriate glyphs.The paste C++ code is having problems with left angle bracket.
The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please excuse the typo's (sorry about that)
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[snip]
The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).
I also think that the use of QuickThreads to describe a solution on the TBB forum (leaving the TBB solution as an exercise to the reader) where that QT implementation offers no unique advantage (over, say a TBB parallel_invoke) is gratuitous.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Turns out there is some literature on the subject, and several of the methods use ideas similar to the ones we've been kicking around here. For example the Barnes-Hut algorithm uses an octree subdivision process to localize the force interactions. However, most of these methods seem to be focused on reducing the n2 complexity rather than partitioning for parallelization. Barnes-Hut for example subdivides until there's only a single mass in each partition and repartitions for every time step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Robert,
Some cleaned up code and additional comments
(BTW someone should fix this forum past C++ code feature so that it handles less than signs)
[cpp]// parallel task performing independent slices of volume // no atomic operations required void CalcPhysTask(size_t i, size_t j, size_t nVolumns) { size_t stride = CANTIDAD / nVolumns; size_t iBegin = stride * i; size_t iEnd = iBegin + stride; if((i+1) == nVolumns) iEnd = CANTIDAD; size_t jBegin = stride * j; size_t jEnd = jBegin + stride; if((j+1) == nVolumns) jEnd = CANTIDAD; for(i = iBegin; i (lessThan) iEnd; i++) for(int j=jBegin; j (lessThan) jEnd; j++) { if(oM != oM){ oM->calculaPhys(oM ); veces++; } } } } void CalcPhys() { size_t nThreads = qt_get_num_threads(); // Decide on optimal number of volumes // // 4 threads, 4 fixed partitions // p1 p2 p3 p4 // 0,0 0,1 0,2 0,3 // 1,1 1,2 1,3 // 2,2 2,3 // 3,3 // utilization 10/16 (62.5% - 10 x task schedules) // // 4 threads, 8 fixed partitions // p1 p2 p3 p4 p5 p6 p7 p8 // 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 // 1,1 1,2 1,3 1,4 1,5 1,6 1,7 // 2,2 2,3 2,4 2,5 2,6 2,7 // 3,3 3,4 3,5 3,6 3,7 // utilization 26/32 (81.25% - 24 x task schedules) // // 4 threads, 16 fixed partitions // ... // utilization 58/64 (90.625% - 58 x task schedules) // 4 threads, 32 fixed partitions // ... // utilization 122/128 (95.3125% - 122 x task schedules) // // An alternate way is to use variable sized partitions // 4 threads, 4 variable partitions (3 splits) // p1 p2 p3 p4 p5 p6 p7 // 0,0 split 0:2 A,a split 0:1 A,a split 0 A,a A,b A,c A,d // 1,1 into A:D B,b into A:D B,b into A:D B,b B,c B,d // 2,2 split 1:3 C,c split 2:3 C,c split 3 C,c C,d // 3,3 into a:d D,d into a:d D,d into a:d D,d // utilization 58/64 (90.625% - 22 x task schedules) // Note, reduction in proportion of task schedules // below technique uses fixed partitions // Experiment with this size_t nVolumes = nThreads * 8; size_t i,j; for(i=0; i (lessThan) nVolumes; ++i) { { // scope of qtControl qtControl qtControl; for(j=0; ((j (lessThan) nThreads) && ((i+j) (lessThan) nVolumes)); j++) { parallel_task(&qtControl, CalcPhysTask, j, i+j, nVolumes); } } // end scope of qtControl (synchronization point) } } [/cpp]
Additional notes,
calcPhys is (should be) an interaction between two particles (could be coulomb or gravitational)
The above techniqueproduces all particle to particle interactions. You can add optimizations shch that when the particles are far away you do something different (to reduce the numbers of interaction calculations).
When the particles are sparsely distributed, such as gravitational bodies. You can collect nearby objects into barycenters and then use the effective aggrigate mass at the barycenter. e.g. using Jupiter + its moons barycenter may be sufficient for calculating its gravitational effects on an Earth orbiting satellite. For evenly distributed particles (charged particles in gas) then a different technique can be applied to reduce the numbers of interaction calculations and yet maintain a specific level of accuracy in the calculations.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
dreamthtr,
Here is an OpenMP version using the diminishing volumes technique (read comments following code sampe) as always test the code before assuming correct. You can rework this into TBB
[cpp]#pragma omp parallel reduction(+: veces) { int nThreads = omp_get_num_threads(); int myThreadNum = omp_get_thread_num(); size_t skew = 0; while(skew (lessThan) CANTIDAD) { size_t iBegin; size_t iEnd; size_t jBegin; size_t jEnd; size_t range = (CANTIDAD - skew) / nThreads; if(range == 0) range = 1; iBegin = range * myThreadNum; iEnd = iBegin + range; if(iEnd > CANTIDAD) iEnd = CANTIDAD; jBegin = iBegin + skew; jEnd = iEnd + skew; if(jEnd > CANTIDAD) { iEnd -= jEnd - CANTIDAD; jEnd = CANTIDAD; } for(i = iBegin; i (lessThan) iEnd; i++) for(int j=jBegin; j (lessThan) jEnd; j++) { if(oM != oM){ oM->calculaPhys(oM ); veces++; } } } #pragma omp barrier if(myThreadNum == 0) skew += range; #pragma omp barrier } // while(skew (lessThan) CANTIDAD) }[/cpp]
An important note to make is the above will work well when all the threads scheduled for execution are available.
There are several factors that can interfere with this.
a) Your application (process) is sharing the system with other applications.
b) Your application may have additional threads competing for available cores (HW threads)
c) You are on a NUMA system and some of the objects are in near memory while others are in distant memory
d) Thermal throttling of CPU (or parts thereof) may vary the runtime of some of the threads
When any interference occurs, the blockage time at the barriers will increase.
This can be mittigated by adding code to increase the partitioning from nThreads to some number larger (e.g. nThreads * 128). This will be a trade-off between increasing the number of iterations against reduction in wasted time at the barriers.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[snip]
calcPhys is (should be) an interaction between two particles (could be coulomb or gravitational)
The above techniqueproduces all particle to particle interactions. You can add optimizations such that when the particles are far away you do something different (to reduce the numbers of interaction calculations).
When the particles are sparsely distributed, such as gravitational bodies. You can collect nearby objects into barycenters and then use the effective aggregate mass at the barycenter. e.g. using Jupiter + its moons barycenter may be sufficient for calculating its gravitational effects on an Earth orbiting satellite. For evenly distributed particles (charged particles in gas) then a different technique can be applied to reduce the numbers of interaction calculations and yet maintain a specific level of accuracy in the calculations.
Close, but no cigar. This latest reformulation of the blocking does reduce the potential races but it does not eliminate them.
//p1p2p3p4p5p6p7p8
//0,00,10,20,30,40,50,60,7
//1,11,21,31,41,51,61,7
//2,22,32,42,52,62,7
//3,33,43,53,63,7
//utilization26/32(81.25%-24xtaskschedules)
In the first pass of this modified algorithm, the individual partitions are truly disjoint and concurrent calculations can proceed without any cross-thread interactions. However, the same is not true of the second pass, ((0,1),(1,2),(2,3),(3,4)). In this case we havethe potential forthreads competing for access to masses on either side of the calculaPhys call. Likewise in passes 3 and 4: given Jim's assumption that calculaPhys updates status on both masses, there's still a lot of potential races even with this organization.
There's also the underlying concern about how these updates get staged so that early updates on one time step avoid affecting later updates on the same time step as position/velocity/acceleration changes propagate through the array. But this is an issue with the original calculaPhys, whose details have not been shared in this thread but whose behavior can be inferred from the context.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Close, but no cigar. This latest reformulation of the blocking does reduce the potential races but it does not eliminate them.
There exists in fact a cute way to compute the N^2 interactions that avoids races, but the solution is not completely
trivial. I have written a short article at our blog at Cilk Arts that explains this technique. See
http://www.cilk.com/multicore-blog/bid/9328/A-cute-technique-for-avoiding-certain-race-conditions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah, yes! That's a very interesting approach, with lots of recursive parallelism. There's just enough synchronization to avoid the races. It seems you should be able to do roughly the same thing in TBB using a two-function parallel_invoke in interact() and rect() (which would impose a few more synchronization points so might not scale quiteas well). Thereis abit of overkill in rect(). Since it starts with a square and always partitions equally in x and y, all the triangles should be isosceles right triangles and all the rectangles should be square, meaning that the else condition should only occur when di == dj == 1
, thus obviating the need for the nested fors.
You're, right. It's a very cute technique. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since it starts with a square and always partitions equally in x and y, all the triangles should be isosceles right triangles and all the rectangles should be square, meaning that the else condition should only occur when di == dj == 1
, thus obviating the need for the nested fors.
Actually, rect() starts with a square, but things become rectangular after the first cut if the square has an odd size. So you need
to check both di and dj. (E.g., the base case may end up being 1x2 or 2x1.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh, of course! You end up with two (on-diagonal) squares, one smaller than the other and two (off-diagonal) rectangles whose shapes are transposes of each other. Thanks for clarifying that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Should my suspicions have any merit, I would expect the main concern to instead be how to recycle data between tasks in the same thread, i.e., perhaps just subdividing down the middle of the longest side in blocked_range2d might not be optimal in this case, maybe it should be a blocked_range on only one dimension anyway, with a carefully chosen strip width (based on cache size, not number of elements and cores, i.e., using a simple_partitioner), and evaluated by sweeping a perpendicular row/column of accumulators along its length.
So, am I mistaken, or not really, or really not? :-)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page