I'm trying to do some time base sampling on Itanium 2 system Bandera. The Vtune greys out the choice for timebase sampling. So the only choice is event base sampling. Which event should I select to do time base sampling? Is it CPU Cycles?
Also, there is a default selection for event-ratio IA64-IPC. Should I delete it for clockticks analysis.
Hi Paulina using CPU_Cycles event on an Itanium 2 will give you the same results as time based sampling. If you are just collecting CPU_Cycles and Instructions Retired you can disable calibration so you dont have multiple runs. The overhead of collecting both CPU_Cycles and Instructions retired is negligible. Collecting them both is useful because it allows you to see how efficiently your code is executing on the CPU.
So I did a CPU_CYCLES event sampling accordingly. However I found Itanium 2 spending long time (132 out of 194 total cycles for the procedure foo) to retrieve the address of a global variable var2 in my program. Is this normal to spend so much time in a register math operation as indicated below? If so, why? Please see the attached file for details. The procedure only has 2 lines of code to update 2 global variables.
I also met up with similar weird things. But that because I used the compiler with optimization with max-speed and excluded the debug info. So the VTune lost the symbol and line number information in my binary code. Be sure included the debug info in your compiler settings, if you want to drill down to see the hotspot and source code correctly with VTune.
I compile the program with O2 and Zi. So I do have debug info file (pdb file). I think Micorsoft 64-bit compiler supports only one type of debug info file, i.e. pdb. And the debug info can not be included in binary code since /pdb:none is no longer supported.
Are you saying to drill down in Vtune, we should not compile with O2?
One thing to keep in mind is that the CPU_CYCLES event is not an "exact" event which means that the assembly language instruction that shows up as the performance bottleneck is usually a few instructions beyond the one that is the real culprit. The interrupt mechanism that the VTune Analyzer uses to gather performance data has some built-in delays that result from completing the currently executing instructions and acknowledging the hardware interrupt. So when you look at performance data for assembly language instructions you should consider the instructions immediately prior to the indicated instruction as candidates for the performance issue. Also note that on Intel's 64 bit processors most cache events and TLB events are "exact" in that there are hardware mechanisms for the VTune Analyzer to retrieve the exact instruction address.
Is there documentation about what events are "exact"? I thought itanium system fixed instruction skid problem. Now it seems it is partially fixed.
Also, the line taking a lot of CPU cycles is the first assembly line of the procedure. There is no prior line to analyze. Basically it is trying to get the address of var2. Compiler code looks like below.
and Vtune code is this.
foo: addl r30 = -2077280, r1
Could it ever be expensive to find the value for @gprel(var2#)? What's involved in getting @gprel(var2#)?
The "Exact" events on the Itanium 2 processor are L1 data and Instruction Cache misses, & Instruction and Data TLB misses. This is described in more detail in section 10.2.2.2 of the "Intel Itanium 2 Processor Reference Manual For Software Development and Optimization" available at http://www.intel.com/design/itanium2/manuals/251110.htm.
In the case of your code, the prior line would probably be the call instruction that was executed prior to the first line of your procedure.