Solved: Now I have results, but I don't have insight

croucar1 · ‎11-20-2008

I've mostly seen 'how' questions in this forum, so if I should be asking this somewhere else, tell me where to go.

I have a 1980s F77/VAX code for acoustics that in principle should be almost entirely executing FP. But instead it is running 8% FP and doing something else 92% of the time. It has a lot of branches and a lot of loops.

I ran Basic and Advanced Performance ratios, and got these numbers.

Which if any of these is the issue I should look for?

Thanks!

Process	model
CPU_CLK_UNHALTED.CORE samples	56939
INST_RETIRED.ANY samples	67203
RESOURCE_STALLS.BR_MISS_CLEAR samples	116073
BR_INST_RETIRED.MISPRED samples	109578
L1D_REPL samples	111397
X87_OPS_RETIRED.ANY samples	116590
UOPS_RETIRED.ANY samples	94615
BUS_TRANS_ANY.ALL_AGENTS samples	48932
CPU_CLK_UNHALTED.BUS samples	99844
BUS_DRDY_CLOCKS.ALL_AGENTS samples	63929
L2_LINES_IN.SELF.DEMAND samples	29064
L2_LINES_IN.SELF.ANY samples	23743
STORE_BLOCK.SNOOP samples	0
STORE_BLOCK.ORDER samples	50874
PAGE_WALKS.CYCLES samples	100909
Branch Misprediction Per Micro-Op Retired	0.001
Branch Misprediction Performance Impact	2.66%
Bus Utilization	0.98%
Clocks per Instructions Retired - CPI	0.847
Data Bus Utilization	1.35%
Floating Point Instructions Ratio	8.38%
L1 Data Cache Miss Performance Impact	12.66%
L1 Data Cache Miss Rate	0.013
L2 Cache Demand Miss Rate	0
L2 Cache Miss Rate	0
Store Block by Snoop Ratio	0.00%
Store Order Block	0.29%
TLB miss penalty	4.21%
CPU_CLK_UNHALTED.CORE %	91.08%
INST_RETIRED.ANY %	95.58%
RESOURCE_STALLS.BR_MISS_CLEAR %	88.03%
BR_INST_RETIRED.MISPRED %	80.12%
L1D_REPL %	94.64%
X87_OPS_RETIRED.ANY %	99.93%
UOPS_RETIRED.ANY %	90.32%
BUS_TRANS_ANY.ALL_AGENTS %	39.13%
CPU_CLK_UNHALTED.BUS %	75.32%
BUS_DRDY_CLOCKS.ALL_AGENTS %	37.27%
L2_LINES_IN.SELF.DEMAND %	18.47%
L2_LINES_IN.SELF.ANY %	14.78%
STORE_BLOCK.SNOOP %	0.00%
STORE_BLOCK.ORDER %	51.01%
PAGE_WALKS.CYCLES %	87.77%
CPU_CLK_UNHALTED.CORE events	1.23E+11
INST_RETIRED.ANY events	1.45E+11
RESOURCE_STALLS.BR_MISS_CLEAR events	3.27E+09
BR_INST_RETIRED.MISPRED events	2.17E+08
L1D_REPL events	1.95E+09
X87_OPS_RETIRED.ANY events	1.22E+10
UOPS_RETIRED.ANY events	1.78E+11
BUS_TRANS_ANY.ALL_AGENTS events	46240740
CPU_CLK_UNHALTED.BUS events	9.48E+09
BUS_DRDY_CLOCKS.ALL_AGENTS events	1.28E+08
L2_LINES_IN.SELF.DEMAND events	4621176
L2_LINES_IN.SELF.ANY events	9568429
STORE_BLOCK.SNOOP events	0
STORE_BLOCK.ORDER events	3.6E+08
PAGE_WALKS.CYCLES events	5.19E+09
Process Path	model
Process ID	9828

Shannon_C_Intel · ‎11-24-2008

Hi,
First let me explain a few things. The data you posted shows that your application is executing floating point operations about 8% of the time, as you identified, but it does not show what your code was doing the rest of the time.

The 88.03% RESOURCE_STALLS.BR_MISS_CLEAR data does not indicate that your program was executing missed branches 88% of the time! Although from how the data is posted it is easy to misinterpret it this way. The data is coming from the VTune interface, which shows data for every process running on the system as the analysis was being done. For a particular event, such as CPU_CLK_UNHALTED.CORE, the interface shows the number of samples, number of events, and percentage for each process. The percentage for a particular event shows the percentage of the samples that occured in your application as opposed to the other applications. For example, in the data you showed, 91.08% of all the unhalted clock ticks on the system while your application ran came from your application. 95.58% of the instructions retired while your application ran were from your app. And 88.03% of the mis-predicted branches were from your app. All of the data names shown in your list that end in "%" (from CPU_CLK_UNHALTED.CORE % to PAGE_WALKS.CYCLES %) are of little value taken out of the context of the complete system profile. So it is better to ignore those. If you want to know if your program has a mispredicted branch problem, we can use the other data. Your application had 217 million mispredicted branches. It executed 145 billion instructions. So it executed .001 mispredicted branches for each instruction, meaning that approximately approximately .14% of the executed instructions were branches that got mispredicted. To think about it in terms of time, we can use the RESOURCE_STALLS.BR_MISS_CLEAR event. This counts cycles where the pipeline was stalled due to a mispredicted branch. Your application had 3.27 billion of these stalled cycles. If we divide that by total execution cycles (your application 123 billion), then we see that your application was stalled due to a mispredicted branch 2.66% percent of the time (this is shown in your data as "Branch Misprediction Performance Impact"). I wouldn't consider this a serious problem.

What I suggest for you to do to move forward is this. Use VTune to identify your hotspots (it sounds like you have already done this and found 5). Then, for each hotspot, you can find the efficiency of code executing in it. To do this, sample the events CPU_CLK_UNHALTED.CORE, RS_UOPS_DISPATCHED.CYCLES_ANY and RS_UOPS_DISPATCHED.CYCLES_NONE. These show you the ratio of time the operations were being dispatched vs. not dispatched (aka not stalled vs. stalled). For each hotspot, divide the CYCLES_NONE event count (NOT the sample count!) by the CPU_CLK_UNHALTED.CORE event count and multiply by 100 to find the percentage of stalls while that hotspot was executing. Then identify the hotspots where you have a high stall percentage (indicating inefficiency). There are no rock-solid thresholds for this, but I tend to expect a stall percentage of 10-50% for client apps and 50-80% for server apps. If yours is higher it might be worth investigating. Then try to optimize your inefficient hotspots. I can't give you more than some leads here - but things to investigate for this app are vectorization (check your compiler) and some simple parallelization. After each optimization you try, you can run the basic and advanced metrics again to see how your optimization has changed things. This is how we intend VTune to be used.

Now - 2 other things. 1 - you are interested in global optimization approaches - one easy way to examine this is at a system level. Once you have your application testbed ready, test BIOS options that might affect performance. 2 - if you want to find out what is being executed besides FP code, some other things to sample are BR_INST_RETIRED.ANY (branches, calls, etc), INST_RETIRED.LOADS, INST_RETIRED.STORES, and SIMD_INST_RETIRED.ANY (SIMD/SSE instructions). These do not cover all possible instructions, but they can help you understand things further.

Hope this helps! Good luck with your tuning.
Thanks,
Shannon

View solution in original post

Shannon_C_Intel · ‎11-20-2008

Hello,

Thanks for your question. The analysis and optimization process is iterative and may involve looking at the data in many different ways. If you want to dig deeper, the first 2 things to answer are:

1. Why you are analyzing your program in the first place? You mention looking for the issue - is there a particular reason you think your program has a problem? Of the data you have given here, there is no "smoking gun" problem that jumps out. But that doesn't mean you couldn't optimize things if you were determined to.

2. What processor are you using? It looks like it is Core-archictecture based; please confirm. Send the processor name string from My Computer-> Properties on Windows* or /proc/cpuinfo on Linux*.

Once we know these things I can give you some pointers for next steps to take.

One other thing - You seem concerned about the 8% floating point ratio, but it may not indicate an issue at all. The majority of your other instructions are probably for data movement (load, store, etc) and control flow (branch, jump, etc). You can double-click into your program's data from the VTune GUI to get to the source code and disassembly view. Scanning through the disassembly can give you a quick idea of what instructions comprise your program; or there are ways to get exact counts through sampling as well.

Thanks,

Shannon

TimP · ‎11-21-2008

The demonstration that you are spending so much time on mispredicted branches, and have no vectorized execution, is good confirmation of what you said about the code.

croucar1 · ‎11-21-2008

Quoting - Shannon Cepeda (Intel)

Hello,

Thanks for your question. The analysis and optimization process is iterative and may involve looking at the data in many different ways. If you want to dig deeper, the first 2 things to answer are:

1. Why you are analyzing your program in the first place? You mention looking for the issue - is there a particular reason you think your program has a problem? Of the data you have given here, there is no "smoking gun" problem that jumps out. But that doesn't mean you couldn't optimize things if you were determined to.

2. What processor are you using? It looks like it is Core-archictecture based; please confirm. Send the processor name string from My Computer-> Properties on Windows* or /proc/cpuinfo on Linux*.

Once we know these things I can give you some pointers for next steps to take.

One other thing - You seem concerned about the 8% floating point ratio, but it may not indicate an issue at all. The majority of your other instructions are probably for data movement (load, store, etc) and control flow (branch, jump, etc). You can double-click into your program's data from the VTune GUI to get to the source code and disassembly view. Scanning through the disassembly can give you a quick idea of what instructions comprise your program; or there are ways to get exact counts through sampling as well.

Thanks,

Shannon

Results are from Intel Core 2 T7400 2.16 GHz, but target is a rack of dual CPU blades. And I don't think we need a rack of dual CPU blades. And the quality of the solution is usually compromised to make the app run in the time available.

Theapplicationcomputes a bunch of eigenrays - should beall R*8 math. Should also be trivially parallel in three dimensions (frequency, direction, eigenray). Right now it is single-threaded (CVF 6.0 compiler, minimal optimization). So I view the 8% FP utilization as the 'smoking gun'.

There are about 900 files in the source, which contains several different models. For the results shown, about 5 subroutines are important. (One finds the maximum value in elements 1:I of an array length N using an algorithm of cyclometric complexity 14!) So some routines I might tune, but I need to consider global approaches also. And globally, I want to understand what is happening the other 92% of the time.

croucar1 · ‎11-21-2008

Quoting - tim18

The demonstration that you are spending so much time on mispredicted branches, and have no vectorized execution, is good confirmation of what you said about the code.

Thanks. Both of your conclusions fit what I expected, but can you tell me which metrics quantify these problems as something that I need to examine?

TimP · ‎11-21-2008

Quoting - croucar1@jhuapl.edu

Thanks. Both of your conclusions fit what I expected, but can you tell me which metrics quantify these problems as something that I need to examine?

You have estimates showing that a large majority of your execution cycles appear to be spent on mis-predicted branches. You may be able to show by function or basic block where most of this time is spent. To do that, you may have to build and analyze with interprocedural optimization disabled, with debug symbols and normal optimization specified. The evident way to work on those stalls is to remove as many conditionals as possible from the loops where they are costing time. Best result is to make the important inner loops vectorizable, and parallelize the outer loops.

Shannon_C_Intel · ‎11-24-2008

Hi,
First let me explain a few things. The data you posted shows that your application is executing floating point operations about 8% of the time, as you identified, but it does not show what your code was doing the rest of the time.

The 88.03% RESOURCE_STALLS.BR_MISS_CLEAR data does not indicate that your program was executing missed branches 88% of the time! Although from how the data is posted it is easy to misinterpret it this way. The data is coming from the VTune interface, which shows data for every process running on the system as the analysis was being done. For a particular event, such as CPU_CLK_UNHALTED.CORE, the interface shows the number of samples, number of events, and percentage for each process. The percentage for a particular event shows the percentage of the samples that occured in your application as opposed to the other applications. For example, in the data you showed, 91.08% of all the unhalted clock ticks on the system while your application ran came from your application. 95.58% of the instructions retired while your application ran were from your app. And 88.03% of the mis-predicted branches were from your app. All of the data names shown in your list that end in "%" (from CPU_CLK_UNHALTED.CORE % to PAGE_WALKS.CYCLES %) are of little value taken out of the context of the complete system profile. So it is better to ignore those. If you want to know if your program has a mispredicted branch problem, we can use the other data. Your application had 217 million mispredicted branches. It executed 145 billion instructions. So it executed .001 mispredicted branches for each instruction, meaning that approximately approximately .14% of the executed instructions were branches that got mispredicted. To think about it in terms of time, we can use the RESOURCE_STALLS.BR_MISS_CLEAR event. This counts cycles where the pipeline was stalled due to a mispredicted branch. Your application had 3.27 billion of these stalled cycles. If we divide that by total execution cycles (your application 123 billion), then we see that your application was stalled due to a mispredicted branch 2.66% percent of the time (this is shown in your data as "Branch Misprediction Performance Impact"). I wouldn't consider this a serious problem.

What I suggest for you to do to move forward is this. Use VTune to identify your hotspots (it sounds like you have already done this and found 5). Then, for each hotspot, you can find the efficiency of code executing in it. To do this, sample the events CPU_CLK_UNHALTED.CORE, RS_UOPS_DISPATCHED.CYCLES_ANY and RS_UOPS_DISPATCHED.CYCLES_NONE. These show you the ratio of time the operations were being dispatched vs. not dispatched (aka not stalled vs. stalled). For each hotspot, divide the CYCLES_NONE event count (NOT the sample count!) by the CPU_CLK_UNHALTED.CORE event count and multiply by 100 to find the percentage of stalls while that hotspot was executing. Then identify the hotspots where you have a high stall percentage (indicating inefficiency). There are no rock-solid thresholds for this, but I tend to expect a stall percentage of 10-50% for client apps and 50-80% for server apps. If yours is higher it might be worth investigating. Then try to optimize your inefficient hotspots. I can't give you more than some leads here - but things to investigate for this app are vectorization (check your compiler) and some simple parallelization. After each optimization you try, you can run the basic and advanced metrics again to see how your optimization has changed things. This is how we intend VTune to be used.

Now - 2 other things. 1 - you are interested in global optimization approaches - one easy way to examine this is at a system level. Once you have your application testbed ready, test BIOS options that might affect performance. 2 - if you want to find out what is being executed besides FP code, some other things to sample are BR_INST_RETIRED.ANY (branches, calls, etc), INST_RETIRED.LOADS, INST_RETIRED.STORES, and SIMD_INST_RETIRED.ANY (SIMD/SSE instructions). These do not cover all possible instructions, but they can help you understand things further.

Hope this helps! Good luck with your tuning.
Thanks,
Shannon

croucar1 · ‎11-25-2008

Quoting - Shannon Cepeda (Intel)

Hi,
First let me explain a few things. The data you posted shows that your application is executing floating point operations about 8% of the time, as you identified, but it does not show what your code was doing the rest of the time.

The 88.03% RESOURCE_STALLS.BR_MISS_CLEAR data does not indicate that your program was executing missed branches 88% of the time! Although from how the data is posted it is easy to misinterpret it this way. The data is coming from the VTune interface, which shows data for every process running on the system as the analysis was being done. For a particular event, such as CPU_CLK_UNHALTED.CORE, the interface shows the number of samples, number of events, and percentage for each process. The percentage for a particular event shows the percentage of the samples that occured in your application as opposed to the other applications. For example, in the data you showed, 91.08% of all the unhalted clock ticks on the system while your application ran came from your application. 95.58% of the instructions retired while your application ran were from your app. And 88.03% of the mis-predicted branches were from your app. All of the data names shown in your list that end in "%" (from CPU_CLK_UNHALTED.CORE % to PAGE_WALKS.CYCLES %) are of little value taken out of the context of the complete system profile. So it is better to ignore those. If you want to know if your program has a mispredicted branch problem, we can use the other data. Your application had 217 million mispredicted branches. It executed 145 billion instructions. So it executed .001 mispredicted branches for each instruction, meaning that approximately approximately .14% of the executed instructions were branches that got mispredicted. To think about it in terms of time, we can use the RESOURCE_STALLS.BR_MISS_CLEAR event. This counts cycles where the pipeline was stalled due to a mispredicted branch. Your application had 3.27 billion of these stalled cycles. If we divide that by total execution cycles (your application 123 billion), then we see that your application was stalled due to a mispredicted branch 2.66% percent of the time (this is shown in your data as "Branch Misprediction Performance Impact"). I wouldn't consider this a serious problem.

What I suggest for you to do to move forward is this. Use VTune to identify your hotspots (it sounds like you have already done this and found 5). Then, for each hotspot, you can find the efficiency of code executing in it. To do this, sample the events CPU_CLK_UNHALTED.CORE, RS_UOPS_DISPATCHED.CYCLES_ANY and RS_UOPS_DISPATCHED.CYCLES_NONE. These show you the ratio of time the operations were being dispatched vs. not dispatched (aka not stalled vs. stalled). For each hotspot, divide the CYCLES_NONE event count (NOT the sample count!) by the CPU_CLK_UNHALTED.CORE event count and multiply by 100 to find the percentage of stalls while that hotspot was executing. Then identify the hotspots where you have a high stall percentage (indicating inefficiency). There are no rock-solid thresholds for this, but I tend to expect a stall percentage of 10-50% for client apps and 50-80% for server apps. If yours is higher it might be worth investigating. Then try to optimize your inefficient hotspots. I can't give you more than some leads here - but things to investigate for this app are vectorization (check your compiler) and some simple parallelization. After each optimization you try, you can run the basic and advanced metrics again to see how your optimization has changed things. This is how we intend VTune to be used.

Now - 2 other things. 1 - you are interested in global optimization approaches - one easy way to examine this is at a system level. Once you have your application testbed ready, test BIOS options that might affect performance. 2 - if you want to find out what is being executed besides FP code, some other things to sample are BR_INST_RETIRED.ANY (branches, calls, etc), INST_RETIRED.LOADS, INST_RETIRED.STORES, and SIMD_INST_RETIRED.ANY (SIMD/SSE instructions). These do not cover all possible instructions, but they can help you understand things further.

Hope this helps! Good luck with your tuning.
Thanks,
Shannon

Thanks. It will take me some time to make the measurements you suggested, but knowing which numbers I can safely ignore is also a big help!

Art

Max_L · ‎01-11-2009

>... Right now it is single-threaded (CVF 6.0 compiler, minimal optimization). ...

Hi, take Intel Fortran Compiler ... old CVF lacks many optimizations for modern microarchitecture Intel Compiler excels at - no point in tuning car's engine when you are feeding it with wrong fuel ...

-Max