Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

New to Intel Tbb - Question regarding performance.

dreamthtr
Beginner
794 Views
Hi I'm new to Intel TBB and am trying to get more performance out of my app.

My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.

Any way i have a cicle that used to look like this:

[cpp]for(int i = 0; i	for(int j=0; j	{
if(oM != oM){
oM->calculaPhys(oM);
veces++;
}
}
}[/cpp]


I modified that code to use a parallel_for like this:

[cpp]class ApplyPhys {
objectMass *const my_Obj;
public:
ApplyPhys( objectMass obj[]): my_Obj(obj){}
void operator()( const blocked_range& r ) const{
objectMass *obj = my_Obj;
for( size_t i = r.begin(); i != r.end(); i++){
funcTest(&obj);
}
}

};

void funcTest(objectMass * obj){
for (int i=0; i if(obj != oM){
if(obj->mass != 0){
veces++;
obj->calculaPhys(oM);
}
if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){
obj->mass += oM->mass;
oM->mass =0;
}
}
}
}[/cpp]



[cpp]parallel_for(blocked_range(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());[/cpp]



Anyway I'm getting worse performance on using parallel_for that when using a serial approach.

What could I do to increase performance?
0 Kudos
24 Replies
dreamthtr
Beginner
701 Views
Actually I can see an improvement of 2X by using:

task_scheduler_init init (2);

and

task_scheduler_init init (1);

So why is it that there is so much overhead when using parallel_for than when using a simple nested for.

Quoting - dreamthtr
Hi I'm new to Intel TBB and am trying to get more performance out of my app.

My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.

Any way i have a cicle that used to look like this:

[cpp]for(int i = 0; i	for(int j=0; j	{
if(oM != oM){
oM->calculaPhys(oM);
veces++;
}
}
}[/cpp]


I modified that code to use a parallel_for like this:

[cpp]class ApplyPhys {
objectMass *const my_Obj;
public:
ApplyPhys( objectMass obj[]): my_Obj(obj){}
void operator()( const blocked_range& r ) const{
objectMass *obj = my_Obj;
for( size_t i = r.begin(); i != r.end(); i++){
funcTest(&obj);
}
}

};

void funcTest(objectMass * obj){
for (int i=0; i if(obj != oM){
if(obj->mass != 0){
veces++;
obj->calculaPhys(oM);
}
if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){
obj->mass += oM->mass;
oM->mass =0;
}
}
}
}[/cpp]



[cpp]parallel_for(blocked_range(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());[/cpp]



Anyway I'm getting worse performance on using parallel_for that when using a serial approach.

What could I do to increase performance?

0 Kudos
robert-reed
Valued Contributor II
701 Views
Quoting - dreamthtr
So why is it that there is so much overhead when using parallel_for than when using a simple nested for.

The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:

[cpp]obj->calculaPhys(oM); [/cpp]

But clearly when the code collapses the sticky masses:

[cpp]if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){ 
   obj->mass += oM->mass; 
   oM->mass =0; 
} [/cpp]

A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object.Even if that object is itselfbeing operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducinga lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.

0 Kudos
dreamthtr
Beginner
701 Views
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

Quoting - Robert Reed (Intel)

The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:

[cpp]obj->calculaPhys(oM); [/cpp]

But clearly when the code collapses the sticky masses:

[cpp]if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){ 
obj->mass += oM->mass;
oM->mass =0;
} [/cpp]

A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object.Even if that object is itselfbeing operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducinga lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.


0 Kudos
Alexey-Kukanov
Employee
701 Views
Quoting - dreamthtr
Actually I can see an improvement of 2X by using:

task_scheduler_init init (2);

and

task_scheduler_init init (1);

So why is it that there is so much overhead when using parallel_for than when using a simple nested for.


So you see 2x speedup by TBB version with 2 threads over TBB version with 1 thread; And the TBB version with 1thread is much slower than your serial version, right?Then might be data sharing is less of a problem at the moment;rather some compiler optimizations might get missed.
0 Kudos
robert-reed
Valued Contributor II
701 Views
Quoting - dreamthtr
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

FYI, I was poking around onanother task and ran across this bit on the Wikipedia page regarding parallel algorithms:

Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.

If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant thatyou can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you'reconsidering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.

0 Kudos
robert_jay_gould
Beginner
701 Views
Quoting - dreamthtr
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

FYI, I was poking around onanother task and ran across this bit on the Wikipedia page regarding parallel algorithms:

Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.

If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant thatyou can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you'reconsidering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.


I'm assuming your goal here is accuracy, in which case as Robert Reed points out trueparallelizationwill be hard/impossible(?). However we do this sort of calculation all the time in games (where we can be quite sloppy with our simulations), and the standard solution is to break space up into a grid/hash and calculate a step for each sub-systemindependently. Whencalculated in this fashion parallelization is trivial. Of course it's not scientifically accurate (at all), but good enoughfor entertainment purposes .

0 Kudos
RafSchietekat
Valued Contributor III
701 Views
"it's not scientifically accurate"
Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)
0 Kudos
robert-reed
Valued Contributor II
701 Views
Quoting - Raf Schietekat
"it's not scientifically accurate"
Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)

You could also identify clusters, work out the forces within each cluster, then treat the clusters as composite objects with centers of mass and compute composite forces. Inverse-squared attenuation would reduce the forces between the clusters and also their impact as suggested by Raf, so perhaps you could get by with computing the inter-cluster forces less frequently than the intra-cluster forces.

0 Kudos
jimdempseyatthecove
Honored Contributor III
701 Views

dreamthtr,

The following is a potential solution using QuickThread, I will leave it as an exercize for you to rework it into TBB

[cpp]/*
// single threaded code
for(int i = 0; i(lessThan)CANTIDAD; i++) 
  for(int j=0; j(lessThan)CANTIDAD; j++) 
    { 
       if(oM != oM){ 
         oM->calculaPhys(oM); 
           veces++; 
        } 
   } 
}  
*/

// parallel task performing independent slices of volume
// no atomic operations required
void CalcPhysTask(size_t i, size_t j, size_t nVolumes)
{
	size_t	stride = CANTIDAD / nVolumes;
	size_t	iBegin = stride * i;
	size_t	iEnd = iBegin + stride;
	if((i+1) == nVolumns) iEnd = CANTIDAD;
	size_t	jBegin = stride * j;
	size_t	jEnd = jBegin + stride;
	if((j+1) == nVolumes) jEnd = CANTIDAD;
	for(i = iBegin; i(lessThan)iEnd; i++) 
	  for(int j=jBegin; j(lessThan)jEnd; j++) 
		{ 
		   if(oM != oM){ 
			 oM->calculaPhys(oM); 
			   veces++; 
			} 
	   } 
	}  
}

void CalcPhys()
{
	size_t	nThreads = qt_get_num_threads();
	// nVolumns = Arbitrary tuning parameter
	// Determine an optimal number of volumes
	// The processing loops are
	// for(i=0; i(lessThan)nVolumes; ++i)
	//		for(j=i; j(lessThan)nVolumes; j++)
	//
	// When nVolumes-i becomes less than number of threads
	// then some cores become under utilized
	// However, if you create too many volumes
	// then you incure excessive task enqueuing overhead
	// This will have to be experimented depending on system
	size_t	nVolumes = CANTIDAD / nThreads / 128;
	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 16;
	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 4;
	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads;
	if(nVolumes (lessThan) nThreads)
	{
		if(CANTIDAD (lessThanOrEquql) 1)
			return;
		nVolumes = nThreads = CANTIDAD;
	}
	size_t	i, j;
	for(i=0; i(lessThan)nVolumes; ++i)
	{
		{ // scope of qtControl
			qtControl	qtControl;
			for(j=i; j(lessThan)nVolumes; j++)
			{
				parallel_task(&qtControl, CalcPhysTask, i, (i+j)%nVolumes, nVolumes);
			}
		} // end scope of qtControl (synchronization point)
	}
}

[/cpp]

Replace the (lessThan) and (lessThanOrEqual) with the appropriate glyphs.The paste C++ code is having problems with left angle bracket.

The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
701 Views

Please excuse the typo's (sorry about that)

Jim
0 Kudos
robert-reed
Valued Contributor II
701 Views
The following is a potential solution using QuickThread, I will leave it as an exercize for you to rework it into TBB


[snip]


The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).
As I understand this suggestion, the idea here is to band the interactions into regions of the array to provide some separation of the dependencies. However, what's going on inside calculaPhys? As I interpret this "qt" stuff, it appears to me that the "parallel_task" call is launching multiple workers, all working on a particular i-band, each managing the calculations for a separate j-band. The j-bands may all be read-only but multiple threads all working on the same i-band are going to have the same problems with update that the original parallel implementation would have. Some sort of locking or atomic update will still be required to manage the accesses of the workers operating on a particular oM.

I also think that the use of QuickThreads to describe a solution on the TBB forum (leaving the TBB solution as an exercise to the reader) where that QT implementation offers no unique advantage (over, say a TBB parallel_invoke) is gratuitous.
0 Kudos
robert-reed
Valued Contributor II
701 Views
You could also identify clusters, work out the forces within each cluster, then treat the clusters as composite objects with centers of mass and compute composite forces. Inverse-squared attenuation would reduce the forces between the clusters and also their impact as suggested by Raf, so perhaps you could get by with computing the inter-cluster forces less frequently than the intra-cluster forces.

Turns out there is some literature on the subject, and several of the methods use ideas similar to the ones we've been kicking around here. For example the Barnes-Hut algorithm uses an octree subdivision process to localize the force interactions. However, most of these methods seem to be focused on reducing the n2 complexity rather than partitioning for parallelization. Barnes-Hut for example subdivides until there's only a single mass in each partition and repartitions for every time step.

0 Kudos
jimdempseyatthecove
Honored Contributor III
701 Views

Robert,

Some cleaned up code and additional comments
(BTW someone should fix this forum past C++ code feature so that it handles less than signs)

[cpp]// parallel task performing independent slices of volume
// no atomic operations required
void CalcPhysTask(size_t i, size_t j, size_t nVolumns)
{
	size_t	stride = CANTIDAD / nVolumns;
	size_t	iBegin = stride * i;
	size_t	iEnd = iBegin + stride;
	if((i+1) == nVolumns) iEnd = CANTIDAD;
	size_t	jBegin = stride * j;
	size_t	jEnd = jBegin + stride;
	if((j+1) == nVolumns) jEnd = CANTIDAD;
	for(i = iBegin; i (lessThan) iEnd; i++) 
	  for(int j=jBegin; j (lessThan) jEnd; j++) 
		{ 
		   if(oM != oM){ 
			 oM->calculaPhys(oM); 
			   veces++; 
			} 
	   } 
	}  
}

void CalcPhys()
{
	size_t	nThreads = qt_get_num_threads();
	// Decide on optimal number of volumes
	//
	// 4 threads, 4 fixed partitions
	// p1   p2   p3   p4
	// 0,0  0,1	 0,2  0,3
	// 1,1  1,2  1,3  
	// 2,2  2,3  
	// 3,3  
	// utilization 10/16 (62.5% - 10 x task schedules)
	//
	// 4 threads, 8 fixed partitions
	// p1   p2   p3   p4   p5   p6   p7   p8
	// 0,0  0,1	 0,2  0,3  0,4  0,5  0,6  0,7
	// 1,1  1,2  1,3  1,4  1,5  1,6  1,7
	// 2,2  2,3  2,4  2,5  2,6  2,7
	// 3,3  3,4  3,5  3,6  3,7
	// utilization 26/32 (81.25% - 24 x task schedules)
	//
	// 4 threads, 16 fixed partitions
	// ...
	// utilization 58/64 (90.625% - 58 x task schedules)
	// 4 threads, 32 fixed partitions
	// ...
	// utilization 122/128 (95.3125% - 122 x task schedules)
	//
	// An alternate way is to use variable sized partitions
	// 4 threads, 4 variable partitions (3 splits)
	// p1             p2              p3              p4   p5   p6   p7
	// 0,0  split 0:2 A,a  split 0:1  A,a  split 0    A,a  A,b  A,c  A,d
	// 1,1  into  A:D B,b  into  A:D  B,b  into  A:D  B,b  B,c  B,d
	// 2,2  split 1:3 C,c  split 2:3  C,c  split 3    C,c  C,d
	// 3,3  into  a:d D,d  into  a:d  D,d  into  a:d  D,d  
	// utilization 58/64 (90.625% - 22 x task schedules)
	// Note, reduction in proportion of task schedules
	// below technique uses fixed partitions
	// Experiment with this
	size_t	nVolumes = nThreads * 8;
	size_t	i,j;
	for(i=0; i (lessThan) nVolumes; ++i)
	{
		{ // scope of qtControl
			qtControl	qtControl;
			for(j=0; ((j (lessThan) nThreads) && ((i+j) (lessThan) nVolumes)); j++)
			{
				parallel_task(&qtControl, CalcPhysTask, j, i+j, nVolumes);
			}
		} // end scope of qtControl (synchronization point)
	}
}

[/cpp]

Additional notes,

calcPhys is (should be) an interaction between two particles (could be coulomb or gravitational)

The above techniqueproduces all particle to particle interactions. You can add optimizations shch that when the particles are far away you do something different (to reduce the numbers of interaction calculations).

When the particles are sparsely distributed, such as gravitational bodies. You can collect nearby objects into barycenters and then use the effective aggrigate mass at the barycenter. e.g. using Jupiter + its moons barycenter may be sufficient for calculating its gravitational effects on an Earth orbiting satellite. For evenly distributed particles (charged particles in gas) then a different technique can be applied to reduce the numbers of interaction calculations and yet maintain a specific level of accuracy in the calculations.


Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
701 Views

dreamthtr,

Here is an OpenMP version using the diminishing volumes technique (read comments following code sampe) as always test the code before assuming correct. You can rework this into TBB

[cpp]#pragma omp parallel reduction(+: veces)
{
	int nThreads = omp_get_num_threads();
	int myThreadNum = omp_get_thread_num();
	size_t	skew = 0;
	while(skew (lessThan) CANTIDAD)
	{
		size_t	iBegin;
		size_t	iEnd;
		size_t	jBegin;
		size_t	jEnd;
		size_t range = (CANTIDAD - skew) / nThreads;
		if(range == 0) range = 1;
		iBegin = range * myThreadNum;
		iEnd = iBegin + range;
		if(iEnd > CANTIDAD) iEnd = CANTIDAD;
		jBegin = iBegin + skew;
		jEnd = iEnd + skew;
		if(jEnd > CANTIDAD)
		{
			iEnd -= jEnd - CANTIDAD;
			jEnd = CANTIDAD;
		}
		for(i = iBegin; i (lessThan) iEnd; i++) 
		  for(int j=jBegin; j (lessThan) jEnd; j++) 
			{ 
			   if(oM != oM){ 
				 oM->calculaPhys(oM); 
				   veces++; 
				} 
		   } 
		}
#pragma omp barrier
		if(myThreadNum == 0) skew += range;
#pragma omp barrier
	} // while(skew (lessThan) CANTIDAD)
}[/cpp]

An important note to make is the above will work well when all the threads scheduled for execution are available.
There are several factors that can interfere with this.

a) Your application (process) is sharing the system with other applications.
b) Your application may have additional threads competing for available cores (HW threads)
c) You are on a NUMA system and some of the objects are in near memory while others are in distant memory
d) Thermal throttling of CPU (or parts thereof) may vary the runtime of some of the threads

When any interference occurs, the blockage time at the barriers will increase.

This can be mittigated by adding code to increase the partitioning from nThreads to some number larger (e.g. nThreads * 128). This will be a trade-off between increasing the number of iterations against reduction in wasted time at the barriers.

Jim Dempsey

0 Kudos
robert-reed
Valued Contributor II
701 Views
Some cleaned up code and additional comments

[snip]

calcPhys is (should be) an interaction between two particles (could be coulomb or gravitational)

The above techniqueproduces all particle to particle interactions. You can add optimizations such that when the particles are far away you do something different (to reduce the numbers of interaction calculations).

When the particles are sparsely distributed, such as gravitational bodies. You can collect nearby objects into barycenters and then use the effective aggregate mass at the barycenter. e.g. using Jupiter + its moons barycenter may be sufficient for calculating its gravitational effects on an Earth orbiting satellite. For evenly distributed particles (charged particles in gas) then a different technique can be applied to reduce the numbers of interaction calculations and yet maintain a specific level of accuracy in the calculations.

Close, but no cigar. This latest reformulation of the blocking does reduce the potential races but it does not eliminate them.

//4threads,8fixedpartitions
//p1p2p3p4p5p6p7p8
//0,00,10,20,30,40,50,60,7
//1,11,21,31,41,51,61,7
//2,22,32,42,52,62,7
//3,33,43,53,63,7
//utilization26/32(81.25%-24xtaskschedules)

In the first pass of this modified algorithm, the individual partitions are truly disjoint and concurrent calculations can proceed without any cross-thread interactions. However, the same is not true of the second pass, ((0,1),(1,2),(2,3),(3,4)). In this case we havethe potential forthreads competing for access to masses on either side of the calculaPhys call. Likewise in passes 3 and 4: given Jim's assumption that calculaPhys updates status on both masses, there's still a lot of potential races even with this organization.

There's also the underlying concern about how these updates get staged so that early updates on one time step avoid affecting later updates on the same time step as position/velocity/acceleration changes propagate through the array. But this is an issue with the original calculaPhys, whose details have not been shared in this thread but whose behavior can be inferred from the context.

0 Kudos
matteocilk_com
Beginner
701 Views

Close, but no cigar. This latest reformulation of the blocking does reduce the potential races but it does not eliminate them.


There exists in fact a cute way to compute the N^2 interactions that avoids races, but the solution is not completely
trivial. I have written a short article at our blog at Cilk Arts that explains this technique. See
http://www.cilk.com/multicore-blog/bid/9328/A-cute-technique-for-avoiding-certain-race-conditions

0 Kudos
robert-reed
Valued Contributor II
701 Views
Quoting - matteocilk.com
There exists in fact a cute way to compute the N^2 interactions that avoids races, but the solution is not completely trivial. I have written a short article at our blog at Cilk Arts that explains this technique.

Ah, yes! That's a very interesting approach, with lots of recursive parallelism. There's just enough synchronization to avoid the races. It seems you should be able to do roughly the same thing in TBB using a two-function parallel_invoke in interact() and rect() (which would impose a few more synchronization points so might not scale quiteas well). Thereis abit of overkill in rect(). Since it starts with a square and always partitions equally in x and y, all the triangles should be isosceles right triangles and all the rectangles should be square, meaning that the else condition should only occur when di == dj == 1, thus obviating the need for the nested fors.

You're, right. It's a very cute technique. Thanks.

0 Kudos
matteocilk_com
Beginner
701 Views

Since it starts with a square and always partitions equally in x and y, all the triangles should be isosceles right triangles and all the rectangles should be square, meaning that the else condition should only occur when di == dj == 1, thus obviating the need for the nested fors.


Actually, rect() starts with a square, but things become rectangular after the first cut if the square has an odd size. So you need
to check both di and dj. (E.g., the base case may end up being 1x2 or 2x1.)

0 Kudos
robert-reed
Valued Contributor II
701 Views
Quoting - matteocilk.com
Actually, rect() starts with a square, but things become rectangular after the first cut if the square has an odd size. So you need to check both di and dj. (E.g., the base case may end up being 1x2 or 2x1.)

Oh, of course! You end up with two (on-diagonal) squares, one smaller than the other and two (off-diagonal) rectangles whose shapes are transposes of each other. Thanks for clarifying that.

0 Kudos
RafSchietekat
Valued Contributor III
651 Views
I would just use a parallel_for(blocked_range2d) without first folding the domain along the diagonal, speculating that it is cheaper to do the work twice than to suffer all the synchronisation barriers implied by the fancy algorithm shown above. Has that algorithm's performance been explored for different combinations of problem size, number of cores, amount of data/work per pair? Or is that superfluous and can reasoning alone decide? Should I be embarrassed for not seeing the light as well as jealous for not having discovered it myself, or can I breathe easy? :-)

Should my suspicions have any merit, I would expect the main concern to instead be how to recycle data between tasks in the same thread, i.e., perhaps just subdividing down the middle of the longest side in blocked_range2d might not be optimal in this case, maybe it should be a blocked_range on only one dimension anyway, with a carefully chosen strip width (based on cache size, not number of elements and cores, i.e., using a simple_partitioner), and evaluated by sweeping a perpendicular row/column of accumulators along its length.

So, am I mistaken, or not really, or really not? :-)
0 Kudos
Reply