- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

void addAcc(int i, int j) {

// Compute the force between bodies and apply to each as an acceleration

// compute the distance between them

double dx = body

*.pos[0]-body*.pos[0];

double dy = body

double dy = body

*.pos[1]-body*.pos[1];

double dz = body

double dz = body

*.pos[2]-body*.pos[2];

double distsq = dx*dx + dy*dy + dz*dz;

if (distsq < MINDIST) distsq = MINDIST;

double dist = sqrt(distsq);

double ai = GFORCE*body.mass/(dist*distsq);

body

double distsq = dx*dx + dy*dy + dz*dz;

if (distsq < MINDIST) distsq = MINDIST;

double dist = sqrt(distsq);

double ai = GFORCE*body

body

*.acc[0] -= ai*dx;*

body

body

*.acc[1] -= ai*dy;*

bodybody

*.acc[2] -= ai*dz;*

}

I am very grateful if you can explain about this or point out any mistake if there is.}

I am very grateful if you can explain about this or point out any mistake if there is.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.*

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running

*addAcc*(

*i,j*) as a function that only updates

*i-*bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying

*addAcc*to update only body

*i*, the nested loop that calls

*addAcc*be modified likewise to feed all the required

*ij*pairs to the function. This means the loop that was described in the serial version as

for (i = 0; i < n - 1; ++i)

for (j = i + 1; j < n; ++j)

addAcc(i, j);

will need to be modified to look something like this:

for (i = 0; i < n; ++i)

for (j = 1; j < n; ++j)

if (i != j) addAcc(i, j);

Otherwise, not all the *j* bodies will be considered in computing the accelerations for an *i* body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*I have been following Robert Reed's nbodies parallel blog and I have read about the discussion on this forum. I have tried the suggestion of computing everything twice, just use blocked_range to iterate over the rows of the matrix. The result is as fast as the unsafe parallel implementation by Robert. I think you may have an explanation.*

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Also, because I just use addAcc(i,j) to compute for j, I remove some temporaries from this function and I see an increase more than two times in speed. I think this is due to less memory allocation of this frequently called function.

I am very grateful if you can explain about this or point out any mistake if there is.

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running

*addAcc*(

*i,j*) as a function that only updates

*i-*bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying

*addAcc*to update only body

*i*, the nested loop that calls

*addAcc*be modified likewise to feed all the required

*ij*pairs to the function. This means the loop that was described in the serial version as

for (i = 0; i < n - 1; ++i)

for (j = i + 1; j < n; ++j)

addAcc(i, j);

will need to be modified to look something like this:

for (i = 0; i < n; ++i)

for (j = 1; j < n; ++j)

if (i != j) addAcc(i, j);

Otherwise, not all the *j* bodies will be considered in computing the accelerations for an *i* body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running

// Compute the accelerations of the bodies

for (i = 0; i < n - 1; ++i)

for (j = i + 1; j < n; ++j)

addAcc(i, j);
// Compute the accelerations of the bodies

for (i = 0; i < n; ++i)

for (j = 1; j < n; ++j)

if (i != j) addAcc(i, j);

Since the allocation of local functionvariables is usually accomplished by a single stack adjustment on entry to the function, having more local temporaries should notaffectperformance (within reasonable limits), but the question of running

*addAcc*(*i,j*) as a function that only updates*i-*bodies but runs the whole interaction square to update all the bodies seems like a worthy alternative to consider as I explore the parallelization space of this problem. Since my blog series at the moment is just at the point of examining my first stab at parallelization, it will be very easy for me to explore this alternative in a future post. The one sanity check I would make in vu64's experimentis to verify that in modifying*addAcc*to update only body*i*, the nested loop that calls*addAcc*be modified likewise to feed all the required*ij*pairs to the function. This means the loop that was described in the serial version as

for (i = 0; i < n - 1; ++i)

for (j = i + 1; j < n; ++j)

addAcc(i, j);

will need to be modified to look something like this:

for (i = 0; i < n; ++i)

for (j = 1; j < n; ++j)

if (i != j) addAcc(i, j);

Otherwise, not all the *j* bodies will be considered in computing the accelerations for an *i* body. I'll plan to explore this alternative as I move the blog series forward. Thank you for the question.

Here is my parallel version:

parallel_for(blocked_range

[&] (const blocked_range

for (int i = r.begin(); i != r.end(); ++i) {

for (int j = 0; j < i; ++j)

addAcc(i,j);

for (int j = i+1; j < n; ++j)

addAcc(i,j);

}

}, auto_partitioner());

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*Here is my parallel version:*

parallel_for(blocked_range (0, n),

[&] (const blocked_range& r) {

for (int i = r.begin(); i != r.end(); ++i) {

for (int j = 0; j < i; ++j)

addAcc(i,j);

for (int j = i+1; j < n; ++j)

addAcc(i,j);

}

}, auto_partitioner());

[&] (const blocked_range

for (int i = r.begin(); i != r.end(); ++i) {

for (int j = 0; j < i; ++j)

addAcc(i,j);

for (int j = i+1; j < n; ++j)

addAcc(i,j);

}

}, auto_partitioner());

Yeah, that should produce the correct result. I'll add it to my topic set and give it a try when my blog seriesgets there. Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Yeah, that should produce the correct result. I'll add it to my topic set and give it a try when my blog seriesgets there. Thanks!

Yeah, that should produce the correct result. I'll add it to my topic set and give it a try when my blog seriesgets there. Thanks!

I have also try using blocked_range2d over the matrix but it seems to run slower. Hope that you can also look into it. Thanks.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page