Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4995 Discussions

Number of Inst. retired is zero and it is in hot spot

prokash
Beginner
544 Views
Hello,
I'm trying to capture some lightweight hotspot ( that is allowed when trying system profile ). This is using the free download version of the latest vTune.
I see there are hotspots ( bottlenecks) in my kernel mode code, now few of them has retired inst. zero, why is that? And what does it tell.
Thanks
-pro
0 Kudos
12 Replies
Peter_W_Intel
Employee
544 Views
This is a good question.

It depends on your test scenario:
1. If you ran (launched) target application, thendo lightweight hotspots profiling - most of sampleswill drop in your target process which includes yourapp modules, system library [vmlinux].Usually afew samples in kernel mode code whichbelongs to your process, if your app's activities is High.

2. Other situation is - if you already hadsome applications runningin system but their CPU utilization is not high, thus, many samples in kernel code will be captured. These kernel code serves astask scheduler, driver service, power management, etc. NoteifCPU utilization is low for overall, service of kernel code is not busy - that was why you saw a few retired instruction in that area, but still saw some hotspots.

Another thing is - if your CPU frequency is 2GHz, SAV (sample after value) = 2,000,000, it means that executing kernel code2,000,000 instructions THEN capture a sample. You can adjust SAV value (right-click on lightweight-hotspots to select "Copy from here??" to create a new custom analysis type), changing SAV to small number will see more samples for instructions retired.

Regards, Peter
0 Kudos
prokash
Beginner
544 Views
Thanks Peter,
For (1), that is the way I can really test. I thought, if my app executes say, some fixed code path, then event counter should capture that ( and show as perf counters to app, when I'm interested to see app perf). On the other hand when I am looking at sys profile, it should really tell, that this component's that path took that many events... I may be lacking knowledge here...
(2) Not sure, but will look...
(3) This make sense, so I will take couple more samples to see ...
BTW, thanks much for explaining...
-pro
0 Kudos
prokash
Beginner
544 Views
Also, I'm looking for some documentations about some (if not all ) of the HARDWARE events. I tried to google and found some. Architetures Optimization Manual does have some of the explanations, but perhaps not all!!!.
Since, my main task is to analyze system profiles, I would appreciate if someone can point me to some resourcces that I can look up.
I know there is an outdated VTune book in amazon. I looked up some of the sampled topics. It all looks like geared toward Application perf analysis. Correct me if I'm wrong.
Also pls provide me some link to docs , etc.
thanks
-pro
0 Kudos
Peter_W_Intel
Employee
544 Views
Here are docs (guideline) for using VTune Amplifier XE to tune software on Microarchitecture level:

1. For Nehalem processors: http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-on-intel-core-i7-processors/

2. For SandBridge processors: http://software.intel.com/en-us/articles/using-intel-vtune-amplifier-xe-to-tune-software-on-the-2nd-generation-intel-core-processor-family/

Here are my thoughts for monitoring system performance by using VTune Amplifier XE:
1.Understand docs, and identify your threshold forevents.
2. May usethe tool in command line, and periodically collect performance data (system wide).
3. Found whichmodule consumes high CPU clocks, and is it reasonable?
4. Is there some critical issue from event data? For example, DTLB misses, high latency in which module?
5. Generate a report (log file) if met strange situation for post-analyzing?

Regards, Peter
0 Kudos
prokash
Beginner
544 Views
Thanks once again, Peter...
Is there a way to eliminate some of the system modules when collecting system profile. Since, I was trying to run some PCMark system storage tests, and in 10 minutes it was capturing a lot of data. I did not change the sampling frequency as you mentioned earlier, the defaults are mostly 2 Millions, so I thought it be fine...
The problem is that a 10 to 20 mts of data, makes the whole VTune to be not responding state. So when done, collecting and analyzing is just impossible. It takes 15 to 20 mts just to load the collected data ...
-pro
0 Kudos
prokash
Beginner
544 Views
Now to the detail, fixing a base line in terms of thruput or latency is not always possible, but yes we can easily set the max. For example, we don't expect to have 10GB data thruput on a 10GB Enet...
So my objective, is to take cross-sectional view to weed out some of the most demanding functions and/or instructions or block of instructions. And to capture those cross-sectional data, as an example: I will take hotspot, lock & contention, memory, etc type of analysis, and take the intersections of the cross-sectional data. But for all of these, I need a typical representative sample app to run. The representative sample in our case is a fairly full set of PCMark ( We can not use SYSMark, since it does have several reboots. But PCMark is already more than good enough for my case here).
Perhaps, I will have bring down the sampling frequency by bumping up those counters when I make customized configurations by copying from the existing configurations.
-pro
0 Kudos
prokash
Beginner
544 Views
Forgot to mention that I see some of the functions in my module having quite a bit ( I mean quite a bit is relative term, but they are on the top of the chart among all the functions of this module) of DTLB miss. Yes page walking is expensive... So that is why I was trying to get to the instruction(s) that causes this, but while I was trying load the source, VTune became excessively slow, I could not get to that point... May be something else, so I will have to try couple more time. But then, even when I try to load the collected data into VTune, it seems to take forever... That is more than 10 mts.
-pro
0 Kudos
Peter_W_Intel
Employee
544 Views

Reducing finalizing time or time to open result on GUI, You may do:
1. Reduce duration if possible. -OR-
2. Copy existing analysis type to add/remove events, and change SAV value of events. -OR-
3. Zoom-in/Filter data by selecting small time range, to find critical function then open source file as quickabove your expectation?

Regards, Peter

0 Kudos
prokash
Beginner
544 Views
Thanks once again, Peter!
I could not use the option (1) above, because even the elimetary set of trial runs have sequencing, meaning that it would exercise , say, four types of test runs, and they are in sequence. So for example, under user productivity - text editing, web brwowsing, emailing etc. and they are in sequence...
So I used the other alternative to knock off couple HW events that I can capture later - and sure the result is perfect in terms of time to pull in the stats to the VTune analyzer, and src/sym resolve...
Now as you are well aware, in the system space we are getting some automatic analysis, so there is no pink tabs etc. Looking at the sandy bridge doc, I see the old style way of computing some stat, like Branch Mispredictions etc.
I was wondering if there is such a place ( or doc ), I can refer to for some of these calculations. Most of HW events and their meanings are defined in the help file. But I'm looking for some of the pertinent formulas as well as some thresholds to compare.
(a) I am currently referring to Software Optimization CookBook from Intel
(b) I do have the software optimization pdf from you folks website.
As an example: DTLB and page-walk. Sure there will be some page-walk, but how could I get a hard number to compare with some reference number ( if available), so that it might spark the thought that --- "Hmm we need to loook at here first, then there, then some other places" !!!
-pro
0 Kudos
Peter_W_Intel
Employee
544 Views

You can refer to optimization manual - Intel 64 and IA-32 Architectures Optimization Reference Manual

You also can see what performance countersin helper,used for measuring DTLB miss and page walk,on SNB - after installing the product, look into VTune Amplifier XE 2011\documentation\en\help\snb.chm,

DTLB_LOAD_MISSES.MISS_CAUSE_A_WALK ; miss which cause a walk
DTLB_LOAD_MISSES.STLB_HIT ; hit at 2nd page
DTLB_LOAD_MISSES.WALK_DURATION ; cycles during a walk

So, you can founthe formula in Ref Manual - Appendix B.3.4.4
Cost of page walks:
%STLB.LOAD.MISS.WALK.COST =
100 * DTLB_LOAD_MISSES.WALK_DURATION / CPU_CLK_UNHALTED.THREAD;

Other performance counters from helper, will be interpreted in manual as well.

Regards, Peter

0 Kudos
prokash
Beginner
544 Views
Thanks,
I've those manuals ...
Yet few more questions --
1) Is there any base number or percentage for some stats can be used as threashold? So once I compute, I will know if it has the pink tag or something so that I can concentrate!!!
2) These are all experimental at this stage -
What I am seeing is out of 2,000,000 INST_RETIRED.ANY in one of the c language macro, there are 1,200,000BR_MISP_RETIRED.ALL_BRANCHES_PS, and 600,000 ICACHE.MISSES. This is on sandy bridge (x64).
Now the macro adds to a shared variable, so it is defined like the following -
#define ADD32(S, V) if (V != 0) InterlockedExchangeAdd (S, V)
Clearly I see that there is a branch, and at a first look at it, it would be evident that there is an effort to optimize by not going out to lock the bus if the value to be added is zero...
So one try would be to flatout add the value, but was wondering if the lock would going to beat us...
Any thought?
Of course, we don't want to use assembler level constructs ...
-pro
0 Kudos
Peter_W_Intel
Employee
544 Views
If you use tool's predefined analysis types, there is no threshold needed; pink tag is in report directly:-)
If you choose your preferred events, to create your analysis type - report on GUIhas NO pink tag, since there is no formula to compute them.

Whatare indicators you can use, based on result?
1. CPI, average cycles per instruction - 0.4 - 0.8 is comfortable (if codehas noSSE/AVX, and x87 instruction)
2. Know total latency of specific events. Calculate it : event_num * latency, for example - 100,000 L2 memory load misses,latency = 40 * 100,000 = 4,000,000 ,
If CPU unhalted clock = 200,000,000, rate = 2% - you don't care
If CPU unhalted clock = 20,000,000, rate = 20%, you may optimize it

Usually threashold is .2, based on my experience. Also penaltyof eachbranch misprediction is 20 cycles, you can evaluate how it impacts on performance.

Regard, Peter
0 Kudos
Reply