I've mostly seen 'how' questions in this forum, so if I should be asking this somewhere else, tell me where to go.
I have a 1980s F77/VAX code for acoustics that in principle should be almost entirely executing FP. But instead it is running 8% FP and doing something else 92% of the time. It has a lot of branches and a lot of loops.
I ran Basic and Advanced Performance ratios, and got these numbers.
Which if any of these is the issue I should look for?
Branch Misprediction Per Micro-Op Retired
Branch Misprediction Performance Impact
Clocks per Instructions Retired - CPI
Data Bus Utilization
Floating Point Instructions Ratio
L1 Data Cache Miss Performance Impact
L1 Data Cache Miss Rate
L2 Cache Demand Miss Rate
L2 Cache Miss Rate
Store Block by Snoop Ratio
Store Order Block
TLB miss penalty
The demonstration that you are spending so much time on mispredicted branches, and have no vectorized execution, is good confirmation of what you said about the code.
Results are from Intel Core 2 T7400 2.16 GHz, but target is a rack of dual CPU blades. And I don't think we need a rack of dual CPU blades. And the quality of the solution is usually compromised to make the app run in the time available.
Theapplicationcomputes a bunch of eigenrays - should beall R*8 math. Should also be trivially parallel in three dimensions (frequency, direction, eigenray). Right now it is single-threaded (CVF 6.0 compiler, minimal optimization). So I view the 8% FP utilization as the 'smoking gun'.
There are about 900 files in the source, which contains several different models. For the results shown, about 5 subroutines are important. (One finds the maximum value in elements 1:I of an array length N using an algorithm of cyclometric complexity 14!) So some routines I might tune, but I need to consider global approaches also. And globally, I want to understand what is happening the other 92% of the time.
You have estimates showing that a large majority of your execution cycles appear to be spent on mis-predicted branches. You may be able to show by function or basic block where most of this time is spent. To do that, you may have to build and analyze with interprocedural optimization disabled, with debug symbols and normal optimization specified. The evident way to work on those stalls is to remove as many conditionals as possible from the loops where they are costing time. Best result is to make the important inner loops vectorizable, and parallelize the outer loops.
>... Right now it is single-threaded (CVF 6.0 compiler, minimal optimization). ...
Hi, take Intel Fortran Compiler ... old CVF lacks many optimizations for modern microarchitecture Intel Compiler excels at - no point in tuning car's engine when you are feeding it with wrong fuel ...