Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

parallel_reduce missing join ?

azmodai
Beginner
442 Views
Hello,

I'm currently working on a parallel marching cubes. Everything works nicely except sometimes when I am using a certain grainsize to compute the mesh (I am using a 46cores machine). I've put some std::cout in my join method to be able to undertand why I got badly built meshes when using certain grainsizes.

The size of the data to split is 64, with a grainsize of 32 and 16 I've got no problem (using reprectively 2 and 4 cores). When I am using a grainsize of 8 strange problems occur, indeed, the join from 0 to 8 is missed !

Here is what I got for a grainsize of 8 :
Join : 48 / 64 nrml/vertexNbr : 1685 / 1757(3442) trgNumber : 3332 / 3050(6382)
Join : 32 / 48 nrml/vertexNbr : 1003 / 1424(2427) trgNumber : 2052 / 2842(4894)
Join : 16 / 32 nrml/vertexNbr : 1670 / 1163(2833) trgNumber : 3300 / 2260(5560)
Join : 32 / 64 nrml/vertexNbr : 2427 / 3442(5869) trgNumber : 4894 / 6382(11276)
Join : 8 / 32 nrml/vertexNbr : 946 / 2833(3779) trgNumber : 2016 / 5560(7576)
Join : 8 / 64 nrml/vertexNbr : 3779 / 5869(9648) trgNumber : 7576 / 11276(18852)

As you can see there are no join from 0 to 8 !

What I got with a grainsize of 16 :

Join : 0 / 32 nrml/vertexNbr : 946 / 2833(3779) trgNumber : 2016 / 5560(7576)
Join : 32 / 64 nrml/vertexNbr : 2427 / 3442(5869) trgNumber : 4894 / 6382(11276)
Join : 0 / 64 nrml/vertexNbr : 3779 / 5869(9648) trgNumber : 7576 / 11276(1885

Have you ever experienced something similar ? If so what did you do to fix such a behavior ?

Thanks a lot !


0 Kudos
7 Replies
RafSchietekat
Valued Contributor III
442 Views
If you also trace the splits and executions, you should see that the joins match the splits but also that there can be several chunks (executed subranges) between split/join divisions, so you should write the Body with that in mind (it can be executed with state resulting from a subrange immediately to the left of the argument subrange), as described in the Reference Manual.

As effective granularity gets coarser, the opportunity for this optimisation dimishes as the likelihood for adjacent chunks to be executed by different threads increases.

Correct the Body's operator() to aggregate onto the current state.
0 Kudos
RafSchietekat
Valued Contributor III
442 Views
"As effective granularity gets coarser, the opportunity for this optimisation dimishes as the likelihood for adjacent chunks to be executed by different threads increases."
That sounds worse than I meant it: the optimisation is only relevant to decrease the opportunity cost (right?) for parallelism (in a way specific to this algorithm), so it is less likely to kick in when this cost has already been lowered another way (though hopefully not by removing too much opportunity). And "likelihood" is relative to the number of chunk boundaries, because assumedly it should stay about constant in absolute number until granularity starts getting too coarse.

"Correct the Body's operator() to aggregate onto the current state."
Was this helpful?
0 Kudos
azmodai
Beginner
442 Views
Sorry, I think what you said is helpful. But I didn't try to solve this problem right now because I discovered some stuff I had to fix first :-/. I keep you in touch, don't worry ;-) thanks for the answer anyway
0 Kudos
azmodai
Beginner
442 Views
Hi,

Here is the code of my join method :

void join(const pMarchingCubes &smc)
{
std::cout << "####################" << std::endl;
std::cout << "####### JOIN #######" << std::endl;
std::cout << "kfirst : " << kfirst << " smc.kfirst : " << smc.kfirst << std::endl;
std::cout << "klast : " << klast << " smc.klast : " << smc.klast << std::endl;
std::cout << "offset : " << offset << " smc.offset : " << smc.offset << std::endl;
std::cout << "triangleNumber : " << triangleNumber << " smc.triangleNumber : " << smc.triangleNumber << std::endl;
std::cout << "vertexNumber : " << vertexNumber << " smc.vertexNumber : " << smc.vertexNumber << std::endl;
std::cout << "####################" << std::endl;

for(int i = 0 ; i < smc.vertexNumber ; i++)
{
// interesting lines 1 & 2
vertices[i+vertexNumber+offset] = vertices[i+smc.offset];
normals[i+vertexNumber+offset] = normals[i+smc.offset];
}

//! get the right indices of triangles crossing the border
for(int i = trianglesToCheck+offset ; i < triangleNumber+offset ; i++)
{
if(triangles.v[0] <= 0) triangles.v[0] = - triangles.v[0] + vertexNumber;
if(triangles.v[1] <= 0) triangles.v[1] = - triangles.v[1] + vertexNumber;
if(triangles.v[2] <= 0) triangles.v[2] = - triangles.v[2] + vertexNumber;
}

for(int i = 0 ; i < smc.triangleNumber ; i++)
{
triangles[i+smc.offset].v[0] += vertexNumber;
triangles[i+smc.offset].v[1] += vertexNumber;
triangles[i+smc.offset].v[2] += vertexNumber;

// interseting line 3
triangles[i+triangleNumber+offset] = triangles[i+smc.offset];
}

triangleNumber += smc.triangleNumber;
vertexNumber += smc.vertexNumber;
}

"kfirst" equal to r.begin() and "klast" to r.end()

"offset" permits to know where to get the triangles (or the vertices, or the normals).

Indeed I'm using one big array to store all the computed triangles, one big array to store all the computed vertices and one big array to store all the computed normals in order not to allocate any memory in the join method() because it would slow down a lot everything.

The three interesting lines are the ones commented as following : "intersing line X".
I am using an "offset" to be sure I do not overwrite a triangle (or a vertex, or a normal) computed in one thread by other triangles computed by other(s) threads. That's why I take care of moving the vertices the normals and the triangles so that there is no more "holes" between values.

I've got these constraints :
--------------------------------

- Because each triangle contains the indices of there vertices, I need the vertices to be stored in the right order.
- Furthermore I've got dependencies between triangles calculated in on chunk related to the neighbor chunk (some triangles are "crossing the border" of the chunks).

Results so far :
-------------------

(I am running this program with 2cores)

I've got no problem when running the program with 2threads, but with more then 2threads data are kind of missing ....

The output I've got with 2threads :
####################
####### JOIN #######
kfirst : 0 smc.kfirst : 32
klast : 32 smc.klast : 64
offset : 0 smc.offset : 73728
triangleNumber : 7576 smc.triangleNumber : 11276
vertexNumber : 3779 smc.vertexNumber : 5869
####################
triangles = 18852
vertices / normals = 9648
average duration = 0.00638916
Bad trgs : 0

With 4threads, I've got :
kfirst : 16 smc.kfirst : 32
klast : 32 smc.klast : 64

A chunk is missing ... the chunk from 0 to 16.
I'm sure this is very easy to solve but I didn't find yet because I was coding using 2threads so far now (I need to do something special in the split const maybe ?)





0 Kudos
RafSchietekat
Valued Contributor III
442 Views
I didn't study your code because I still think the most likely source of the problem is in the operator(): when it executes, the Body may already contain state from a previous execution, and you have to aggregate to it, not replace it. Please check this first.
0 Kudos
azmodai
Beginner
442 Views
Oh ok, indeed. I am going to correct this... Thanks.
Btw it explains the split constructor.
0 Kudos
azmodai
Beginner
442 Views
Ok, thanks for the answer it is exactlty the case I'm facing, a "same operator" can be called several times (most of the times when nbThreads > nbCores), it obliges me to take that into account to manage the data computed by the marching cubes algorithm (because it is a slightly different case when compared to a simple addition for instance, marching cubes is highly "dependent on chunks"). Thanks again.
0 Kudos
Reply