Solved: Decrease in multi-processor efficiency with optimized code

gregfi04 · ‎06-28-2011

Hello,

I'm seeing a pretty drastic drop in efficiency moving from 2 processors to 4 processors in one of our MPI codes with Version 12.0.3. The pertinent parts of the code are compiled with -O2. The subroutine in question has no dependence on inter-process communication of any kind, so the speedup should be darn-near linear, which is why this behavior is so puzzling. The code chunk looks something like:

num_local_octants = 8 / num_processors

do nlo=1,num_local_octants

global_octant=npid*num_local_octants + nlo

call crunch_numbers_for_octant(global_octant)

enddo

The contents of "crunch_numbers_for_octant" are pretty dense, with about 5 layers of nested loops and fairly intense computation. When I back the Optimization level back to -O0, the expected behavior is observed, and I see something along the lines of

1 processor: 14.25 seconds

2 processors: 7.15 seconds

4 processors: 3.64 seconds

8 processors: 1.92 seconds

However, with -O2, I see:

1 processor: 1.02 seconds

2 processors: 0.57 seconds

4 processors: 0.49 seconds

8 processors: 0.27 seconds

Edit: -O1 shows the same behavior.

Previous versions of the code exhibited much more linear behavior. The differences between the current version of the code and the previous versions mostly consist of re-ordering loops and arrays for maximum efficiency (column-major order, etc.). Also, in the previous version, the computations for each of the 8 octants were hard-coded. The new version only codes the behavior for a single octant, and uses variables for the loop parameters. That's about it. The computations being performed by both codes are identical, and the results they produce are identical.

I've tried previous versions of the compiler, but they also show the same drop in efficiency. Does anyone have any suggestions about what may be going on, or how to identify the problem, or how to fix it?

Thanks,

Greg

TimP · ‎06-28-2011

I guess, by processor, you mean number of MPI processes, probably on an 8 core platform. It may be important to know what kind of platform.

If you are using an MPI which doesn't default to core affinity, try turning on that option. Taskset could be used explicitly, if the MPI doesn't have the option.

As you've seen, it's easier to get linear speedup when you don't optimize. Bottlenecks which you didn't see before become prominent.

View solution in original post

Michael_J_Slater__In · ‎06-28-2011

I have a few suggestions, but someone else may have more specific answers.

Try compiling your code with -opt-report -opt-report-phase=all while you are testing out different optimization flags.

This will display alot of messages that highlight all of the optimizations happening (memory, vectorization, ipo)

Its possible that your loops aren't being vectorized, so compare the output of those reports when you compile again with:

-ipo-opt-report -opt-report-phase=all

-ipo permits function inlining among multiple files. Your inner loop can't vectorize if it calls a non-inline function so this may also help.

Also -fast sets many pre-defined compiler flags such as -O3 and -ipo so you could try this first

-O3 should help even more if your loops also have floating-point calculations or large data sets.

Also, if you are running this code on the same machine you are compiling for use -xhost

and the compiler will generate optimal instruction sets for that machine

You can see all these options on the man page:

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/mans/ifort.txt

Also, we have a user guide on optimization here:

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/fortran/lin/index.htm#optaps/common/optaps_bk.htm

TimP · ‎06-28-2011

I guess, by processor, you mean number of MPI processes, probably on an 8 core platform. It may be important to know what kind of platform.

If you are using an MPI which doesn't default to core affinity, try turning on that option. Taskset could be used explicitly, if the MPI doesn't have the option.

As you've seen, it's easier to get linear speedup when you don't optimize. Bottlenecks which you didn't see before become prominent.

gregfi04 · ‎06-28-2011

Tim and Michael,

Thanks. Yes, I mean MPI processes, sorry for the sloppy nomenclature. The blades in the cluster each contain two Xeon X5260s, and are hooked together via Infiniband and OpenMPI.

I guess the best way to test whether I'm hitting a memory or cache bottleneck is to split the 4-process job across two different blades and see if performance improves. I'll see if I can figure out how to do that tomorrow.

Greg

gregfi04 · ‎06-28-2011

Yep, it appears that was the answer: the system memory and/or cache were being swamped. The 4-process and 8-process results above were being generated using 4 cores per blade. When I back that off to 2 cores per blade, I see:

4 processes: 0.304 seconds

8 processes: 0.175 seconds

This is much closer to what I expect.

Thanks guys!

Greg