- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
num_local_octants = 8 / num_processors
do nlo=1,num_local_octants
The contents of "crunch_numbers_for_octant" are pretty dense, with about 5 layers of nested loops and fairly intense computation. When I back the Optimization level back to -O0, the expected behavior is observed, and I see something along the lines of
1 processor: 14.25 seconds
However, with -O2, I see:
1 processor: 1.02 seconds
Edit: -O1 shows the same behavior.
Previous versions of the code exhibited much more linear behavior. The differences between the current version of the code and the previous versions mostly consist of re-ordering loops and arrays for maximum efficiency (column-major order, etc.). Also, in the previous version, the computations for each of the 8 octants were hard-coded. The new version only codes the behavior for a single octant, and uses variables for the loop parameters. That's about it. The computations being performed by both codes are identical, and the results they produce are identical.
I've tried previous versions of the compiler, but they also show the same drop in efficiency. Does anyone have any suggestions about what may be going on, or how to identify the problem, or how to fix it?
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess, by processor, you mean number of MPI processes, probably on an 8 core platform. It may be important to know what kind of platform.
If you are using an MPI which doesn't default to core affinity, try turning on that option. Taskset could be used explicitly, if the MPI doesn't have the option.
As you've seen, it's easier to get linear speedup when you don't optimize. Bottlenecks which you didn't see before become prominent.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess, by processor, you mean number of MPI processes, probably on an 8 core platform. It may be important to know what kind of platform.
If you are using an MPI which doesn't default to core affinity, try turning on that option. Taskset could be used explicitly, if the MPI doesn't have the option.
As you've seen, it's easier to get linear speedup when you don't optimize. Bottlenecks which you didn't see before become prominent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
4 processes: 0.304 seconds
This is much closer to what I expect.
Thanks guys!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page