- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've mostly seen 'how' questions in this forum, so if I should be asking this somewhere else, tell me where to go.
I have a 1980s F77/VAX code for acoustics that in principle should be almost entirely executing FP. But instead it is running 8% FP and doing something else 92% of the time. It has a lot of branches and a lot of loops.
I ran Basic and Advanced Performance ratios, and got these numbers.
Which if any of these is the issue I should look for?
Thanks!
Process
model
CPU_CLK_UNHALTED.CORE samples
56939
INST_RETIRED.ANY samples
67203
RESOURCE_STALLS.BR_MISS_CLEAR samples
116073
BR_INST_RETIRED.MISPRED samples
109578
L1D_REPL samples
111397
X87_OPS_RETIRED.ANY samples
116590
UOPS_RETIRED.ANY samples
94615
BUS_TRANS_ANY.ALL_AGENTS samples
48932
CPU_CLK_UNHALTED.BUS samples
99844
BUS_DRDY_CLOCKS.ALL_AGENTS samples
63929
L2_LINES_IN.SELF.DEMAND samples
29064
L2_LINES_IN.SELF.ANY samples
23743
STORE_BLOCK.SNOOP samples
0
STORE_BLOCK.ORDER samples
50874
PAGE_WALKS.CYCLES samples
100909
Branch Misprediction Per Micro-Op Retired
0.001
Branch Misprediction Performance Impact
2.66%
Bus Utilization
0.98%
Clocks per Instructions Retired - CPI
0.847
Data Bus Utilization
1.35%
Floating Point Instructions Ratio
8.38%
L1 Data Cache Miss Performance Impact
12.66%
L1 Data Cache Miss Rate
0.013
L2 Cache Demand Miss Rate
0
L2 Cache Miss Rate
0
Store Block by Snoop Ratio
0.00%
Store Order Block
0.29%
TLB miss penalty
4.21%
CPU_CLK_UNHALTED.CORE %
91.08%
INST_RETIRED.ANY %
95.58%
RESOURCE_STALLS.BR_MISS_CLEAR %
88.03%
BR_INST_RETIRED.MISPRED %
80.12%
L1D_REPL %
94.64%
X87_OPS_RETIRED.ANY %
99.93%
UOPS_RETIRED.ANY %
90.32%
BUS_TRANS_ANY.ALL_AGENTS %
39.13%
CPU_CLK_UNHALTED.BUS %
75.32%
BUS_DRDY_CLOCKS.ALL_AGENTS %
37.27%
L2_LINES_IN.SELF.DEMAND %
18.47%
L2_LINES_IN.SELF.ANY %
14.78%
STORE_BLOCK.SNOOP %
0.00%
STORE_BLOCK.ORDER %
51.01%
PAGE_WALKS.CYCLES %
87.77%
CPU_CLK_UNHALTED.CORE events
1.23E+11
INST_RETIRED.ANY events
1.45E+11
RESOURCE_STALLS.BR_MISS_CLEAR events
3.27E+09
BR_INST_RETIRED.MISPRED events
2.17E+08
L1D_REPL events
1.95E+09
X87_OPS_RETIRED.ANY events
1.22E+10
UOPS_RETIRED.ANY events
1.78E+11
BUS_TRANS_ANY.ALL_AGENTS events
46240740
CPU_CLK_UNHALTED.BUS events
9.48E+09
BUS_DRDY_CLOCKS.ALL_AGENTS events
1.28E+08
L2_LINES_IN.SELF.DEMAND events
4621176
L2_LINES_IN.SELF.ANY events
9568429
STORE_BLOCK.SNOOP events
0
STORE_BLOCK.ORDER events
3.6E+08
PAGE_WALKS.CYCLES events
5.19E+09
Process Path
model
Process ID
9828
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First let me explain a few things. The data you posted shows that your application is executing floating point operations about 8% of the time, as you identified, but it does not show what your code was doing the rest of the time.
Thanks,
Shannon
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The demonstration that you are spending so much time on mispredicted branches, and have no vectorized execution, is good confirmation of what you said about the code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Results are from Intel Core 2 T7400 2.16 GHz, but target is a rack of dual CPU blades. And I don't think we need a rack of dual CPU blades. And the quality of the solution is usually compromised to make the app run in the time available.
Theapplicationcomputes a bunch of eigenrays - should beall R*8 math. Should also be trivially parallel in three dimensions (frequency, direction, eigenray). Right now it is single-threaded (CVF 6.0 compiler, minimal optimization). So I view the 8% FP utilization as the 'smoking gun'.
There are about 900 files in the source, which contains several different models. For the results shown, about 5 subroutines are important. (One finds the maximum value in elements 1:I of an array length N using an algorithm of cyclometric complexity 14!) So some routines I might tune, but I need to consider global approaches also. And globally, I want to understand what is happening the other 92% of the time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The demonstration that you are spending so much time on mispredicted branches, and have no vectorized execution, is good confirmation of what you said about the code.
Thanks. Both of your conclusions fit what I expected, but can you tell me which metrics quantify these problems as something that I need to examine?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. Both of your conclusions fit what I expected, but can you tell me which metrics quantify these problems as something that I need to examine?
You have estimates showing that a large majority of your execution cycles appear to be spent on mis-predicted branches. You may be able to show by function or basic block where most of this time is spent. To do that, you may have to build and analyze with interprocedural optimization disabled, with debug symbols and normal optimization specified. The evident way to work on those stalls is to remove as many conditionals as possible from the loops where they are costing time. Best result is to make the important inner loops vectorizable, and parallelize the outer loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First let me explain a few things. The data you posted shows that your application is executing floating point operations about 8% of the time, as you identified, but it does not show what your code was doing the rest of the time.
Thanks,
Shannon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First let me explain a few things. The data you posted shows that your application is executing floating point operations about 8% of the time, as you identified, but it does not show what your code was doing the rest of the time.
Thanks,
Shannon
Thanks. It will take me some time to make the measurements you suggested, but knowing which numbers I can safely ignore is also a big help!
Art
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>... Right now it is single-threaded (CVF 6.0 compiler, minimal optimization). ...
Hi, take Intel Fortran Compiler ... old CVF lacks many optimizations for modern microarchitecture Intel Compiler excels at - no point in tuning car's engine when you are feeding it with wrong fuel ...
-Max
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page