Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

About nbodies parallel

vu64
Beginner
338 Views
I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

void addAcc(int i, int j) {

// Compute the force between bodies and apply to each as an acceleration

// compute the distance between them
double dx = body.pos[0]-body.pos[0];
double dy = body.pos[1]-body.pos[1];
double dz = body.pos[2]-body.pos[2];

double distsq = dx*dx + dy*dy + dz*dz;
if (distsq < MINDIST) distsq = MINDIST;
double dist = sqrt(distsq);

double ai = GFORCE*body.mass/(dist*distsq);

body.acc[0] -= ai*dx;
body.acc[1] -= ai*dy;
body.acc[2] -= ai*dz;
}
I am very grateful if you can explain about this or point out any mistake if there is.
0 Kudos
1 Solution
robert-reed
Valued Contributor II
338 Views
Quoting - vu64
I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as

// Compute the accelerations of the bodies
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)
addAcc(i, j);


will need to be modified to look something like this:

// Compute the accelerations of the bodies
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);

Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

View solution in original post

0 Kudos
4 Replies
robert-reed
Valued Contributor II
339 Views
Quoting - vu64
I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as

// Compute the accelerations of the bodies
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)
addAcc(i, j);


will need to be modified to look something like this:

// Compute the accelerations of the bodies
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);

Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

0 Kudos
vu64
Beginner
338 Views

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as

// Compute the accelerations of the bodies
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)
addAcc(i, j);


will need to be modified to look something like this:

// Compute the accelerations of the bodies
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);

Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.


Here is my parallel version:
parallel_for(blocked_range (0, n),
[&] (const blocked_range& r) {
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < i; ++j)
addAcc(i,j);
for (int j = i+1; j < n; ++j)
addAcc(i,j);
}
}, auto_partitioner());
0 Kudos
robert-reed
Valued Contributor II
338 Views
Quoting - vu64
Here is my parallel version:

parallel_for(blocked_range (0, n),
[&] (const blocked_range& r) {
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < i; ++j)
addAcc(i,j);
for (int j = i+1; j < n; ++j)
addAcc(i,j);
}
}, auto_partitioner());

Yeah, that should produce the correct result. I'll add it to my topic set and give it a try when my blog seriesgets there. Thanks!
0 Kudos
vu64
Beginner
338 Views

Yeah, that should produce the correct result. I'll add it to my topic set and give it a try when my blog seriesgets there. Thanks!

I have also try using blocked_range2d over the matrix but it seems to run slower. Hope that you can also look into it. Thanks.
0 Kudos
Reply