- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.
void addAcc(int i, int j) {
// Compute the force between bodies and apply to each as an acceleration
// compute the distance between them
double dx = body.pos[0]-body
double dy = body.pos[1]-body
double dz = body.pos[2]-body
double distsq = dx*dx + dy*dy + dz*dz;
if (distsq < MINDIST) distsq = MINDIST;
double dist = sqrt(distsq);
double ai = GFORCE*body
body.acc[0] -= ai*dx;
body.acc[1] -= ai*dy;
body.acc[2] -= ai*dz;
}
I am very grateful if you can explain about this or point out any mistake if there is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.
I am very grateful if you can explain about this or point out any mistake if there is.
Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)
addAcc(i, j);
will need to be modified to look something like this:
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);
Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.
I am very grateful if you can explain about this or point out any mistake if there is.
Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)
addAcc(i, j);
will need to be modified to look something like this:
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);
Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running addAcc(i,j) as a function that only updates i-bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying addAcc to update only body i, the nested loop that calls addAcc be modified likewise to feed all the required ij pairs to the function. This means the loop that was described in the serial version as
for (i = 0; i < n - 1; ++i)
for (j = i + 1; j < n; ++j)
addAcc(i, j);
will need to be modified to look something like this:
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
if (i != j) addAcc(i, j);
Otherwise, not all the j bodies will be considered in computing the accelerations for an i body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.
Here is my parallel version:
parallel_for(blocked_range
[&] (const blocked_range
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < i; ++j)
addAcc(i,j);
for (int j = i+1; j < n; ++j)
addAcc(i,j);
}
}, auto_partitioner());
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[&] (const blocked_range
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < i; ++j)
addAcc(i,j);
for (int j = i+1; j < n; ++j)
addAcc(i,j);
}
}, auto_partitioner());
Yeah, that should produce the correct result. I'll add it to my topic set and give it a try when my blog seriesgets there. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, that should produce the correct result. I'll add it to my topic set and give it a try when my blog seriesgets there. Thanks!
I have also try using blocked_range2d over the matrix but it seems to run slower. Hope that you can also look into it. Thanks.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page