- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It depends on your test scenario:
1. If you ran (launched) target application, thendo lightweight hotspots profiling - most of sampleswill drop in your target process which includes yourapp modules, system library [vmlinux].Usually afew samples in kernel mode code whichbelongs to your process, if your app's activities is High.
2. Other situation is - if you already hadsome applications runningin system but their CPU utilization is not high, thus, many samples in kernel code will be captured. These kernel code serves astask scheduler, driver service, power management, etc. NoteifCPU utilization is low for overall, service of kernel code is not busy - that was why you saw a few retired instruction in that area, but still saw some hotspots.
Another thing is - if your CPU frequency is 2GHz, SAV (sample after value) = 2,000,000, it means that executing kernel code2,000,000 instructions THEN capture a sample. You can adjust SAV value (right-click on lightweight-hotspots to select "Copy from here??" to create a new custom analysis type), changing SAV to small number will see more samples for instructions retired.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. For Nehalem processors: http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-on-intel-core-i7-processors/
2. For SandBridge processors: http://software.intel.com/en-us/articles/using-intel-vtune-amplifier-xe-to-tune-software-on-the-2nd-generation-intel-core-processor-family/
Here are my thoughts for monitoring system performance by using VTune Amplifier XE:
1.Understand docs, and identify your threshold forevents.
2. May usethe tool in command line, and periodically collect performance data (system wide).
3. Found whichmodule consumes high CPU clocks, and is it reasonable?
4. Is there some critical issue from event data? For example, DTLB misses, high latency in which module?
5. Generate a report (log file) if met strange situation for post-analyzing?
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reducing finalizing time or time to open result on GUI, You may do:
1. Reduce duration if possible. -OR-
2. Copy existing analysis type to add/remove events, and change SAV value of events. -OR-
3. Zoom-in/Filter data by selecting small time range, to find critical function then open source file as quickabove your expectation?
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can refer to optimization manual - Intel 64 and IA-32 Architectures Optimization Reference Manual
You also can see what performance countersin helper,used for measuring DTLB miss and page walk,on SNB - after installing the product, look into VTune Amplifier XE 2011\documentation\en\help\snb.chm,
DTLB_LOAD_MISSES.MISS_CAUSE_A_WALK ; miss which cause a walk
DTLB_LOAD_MISSES.STLB_HIT ; hit at 2nd page
DTLB_LOAD_MISSES.WALK_DURATION ; cycles during a walk
So, you can founthe formula in Ref Manual - Appendix B.3.4.4
Cost of page walks:
%STLB.LOAD.MISS.WALK.COST =
100 * DTLB_LOAD_MISSES.WALK_DURATION / CPU_CLK_UNHALTED.THREAD;
Other performance counters from helper, will be interpreted in manual as well.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you choose your preferred events, to create your analysis type - report on GUIhas NO pink tag, since there is no formula to compute them.
Whatare indicators you can use, based on result?
1. CPI, average cycles per instruction - 0.4 - 0.8 is comfortable (if codehas noSSE/AVX, and x87 instruction)
2. Know total latency of specific events. Calculate it : event_num * latency, for example - 100,000 L2 memory load misses,latency = 40 * 100,000 = 4,000,000 ,
If CPU unhalted clock = 200,000,000, rate = 2% - you don't care
If CPU unhalted clock = 20,000,000, rate = 20%, you may optimize it
Usually threashold is .2, based on my experience. Also penaltyof eachbranch misprediction is 20 cycles, you can evaluate how it impacts on performance.
Regard, Peter

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page