2467 Discussions

Beginner
403 Views
I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

void addAcc(int i, int j) {

// Compute the force between bodies and apply to each as an acceleration

// compute the distance between them
double dx = body.pos[0]-body.pos[0];
double dy = body.pos[1]-body.pos[1];
double dz = body.pos[2]-body.pos[2];

double distsq = dx*dx + dy*dy + dz*dz;
if (distsq < MINDIST) distsq = MINDIST;
double dist = sqrt(distsq);

double ai = GFORCE*body.mass/(dist*distsq);

body.acc[0] -= ai*dx;
body.acc[1] -= ai*dy;
body.acc[2] -= ai*dz;
}
I am very grateful if you can explain about this or point out any mistake if there is.
1 Solution
Valued Contributor II
403 Views
Quoting - vu64
I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as

// Compute the accelerations of the bodies
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)

will need to be modified to look something like this:

// Compute the accelerations of the bodies
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);

Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

4 Replies
Valued Contributor II
404 Views
Quoting - vu64
I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as

// Compute the accelerations of the bodies
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)

will need to be modified to look something like this:

// Compute the accelerations of the bodies
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);

Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

Beginner
403 Views

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as

// Compute the accelerations of the bodies
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)

will need to be modified to look something like this:

// Compute the accelerations of the bodies
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);

Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

Here is my parallel version:
parallel_for(blocked_range (0, n),
[&] (const blocked_range& r) {
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < i; ++j)
for (int j = i+1; j < n; ++j)
}
}, auto_partitioner());
Valued Contributor II
403 Views
Quoting - vu64
Here is my parallel version:

parallel_for(blocked_range (0, n),
[&] (const blocked_range& r) {
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < i; ++j)