Number of Inst. retired is zero and it is in hot spot

prokash · ‎03-02-2012

Hello,

I'm trying to capture some lightweight hotspot ( that is allowed when trying system profile ). This is using the free download version of the latest vTune.

I see there are hotspots ( bottlenecks) in my kernel mode code, now few of them has retired inst. zero, why is that? And what does it tell.

Thanks

-pro

Peter_W_Intel · ‎03-02-2012

This is a good question.

It depends on your test scenario:
1. If you ran (launched) target application, thendo lightweight hotspots profiling - most of sampleswill drop in your target process which includes yourapp modules, system library [vmlinux].Usually afew samples in kernel mode code whichbelongs to your process, if your app's activities is High.

2. Other situation is - if you already hadsome applications runningin system but their CPU utilization is not high, thus, many samples in kernel code will be captured. These kernel code serves astask scheduler, driver service, power management, etc. NoteifCPU utilization is low for overall, service of kernel code is not busy - that was why you saw a few retired instruction in that area, but still saw some hotspots.

Another thing is - if your CPU frequency is 2GHz, SAV (sample after value) = 2,000,000, it means that executing kernel code2,000,000 instructions THEN capture a sample. You can adjust SAV value (right-click on lightweight-hotspots to select "Copy from here??" to create a new custom analysis type), changing SAV to small number will see more samples for instructions retired.

Regards, Peter

prokash · ‎03-07-2012

Thanks Peter,

For (1), that is the way I can really test. I thought, if my app executes say, some fixed code path, then event counter should capture that ( and show as perf counters to app, when I'm interested to see app perf). On the other hand when I am looking at sys profile, it should really tell, that this component's that path took that many events... I may be lacking knowledge here...

(2) Not sure, but will look...

(3) This make sense, so I will take couple more samples to see ...

BTW, thanks much for explaining...

-pro

prokash · ‎03-08-2012

Also, I'm looking for some documentations about some (if not all ) of the HARDWARE events. I tried to google and found some. Architetures Optimization Manual does have some of the explanations, but perhaps not all!!!.

Since, my main task is to analyze system profiles, I would appreciate if someone can point me to some resourcces that I can look up.

I know there is an outdated VTune book in amazon. I looked up some of the sampled topics. It all looks like geared toward Application perf analysis. Correct me if I'm wrong.

Also pls provide me some link to docs , etc.

thanks

-pro

Peter_W_Intel · ‎03-08-2012

Here are docs (guideline) for using VTune Amplifier XE to tune software on Microarchitecture level:

1. For Nehalem processors: http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-on-intel-core-i7-processors/

2. For SandBridge processors: http://software.intel.com/en-us/articles/using-intel-vtune-amplifier-xe-to-tune-software-on-the-2nd-generation-intel-core-processor-family/

Here are my thoughts for monitoring system performance by using VTune Amplifier XE:
1.Understand docs, and identify your threshold forevents.
2. May usethe tool in command line, and periodically collect performance data (system wide).
3. Found whichmodule consumes high CPU clocks, and is it reasonable?
4. Is there some critical issue from event data? For example, DTLB misses, high latency in which module?
5. Generate a report (log file) if met strange situation for post-analyzing?

Regards, Peter

prokash · ‎03-13-2012

Thanks once again, Peter...

Is there a way to eliminate some of the system modules when collecting system profile. Since, I was trying to run some PCMark system storage tests, and in 10 minutes it was capturing a lot of data. I did not change the sampling frequency as you mentioned earlier, the defaults are mostly 2 Millions, so I thought it be fine...

The problem is that a 10 to 20 mts of data, makes the whole VTune to be not responding state. So when done, collecting and analyzing is just impossible. It takes 15 to 20 mts just to load the collected data ...

-pro

prokash · ‎03-13-2012

Now to the detail, fixing a base line in terms of thruput or latency is not always possible, but yes we can easily set the max. For example, we don't expect to have 10GB data thruput on a 10GB Enet...

So my objective, is to take cross-sectional view to weed out some of the most demanding functions and/or instructions or block of instructions. And to capture those cross-sectional data, as an example: I will take hotspot, lock & contention, memory, etc type of analysis, and take the intersections of the cross-sectional data. But for all of these, I need a typical representative sample app to run. The representative sample in our case is a fairly full set of PCMark ( We can not use SYSMark, since it does have several reboots. But PCMark is already more than good enough for my case here).

Perhaps, I will have bring down the sampling frequency by bumping up those counters when I make customized configurations by copying from the existing configurations.

-pro

prokash · ‎03-13-2012

Forgot to mention that I see some of the functions in my module having quite a bit ( I mean quite a bit is relative term, but they are on the top of the chart among all the functions of this module) of DTLB miss. Yes page walking is expensive... So that is why I was trying to get to the instruction(s) that causes this, but while I was trying load the source, VTune became excessively slow, I could not get to that point... May be something else, so I will have to try couple more time. But then, even when I try to load the collected data into VTune, it seems to take forever... That is more than 10 mts.

-pro

Peter_W_Intel · ‎03-13-2012

Reducing finalizing time or time to open result on GUI, You may do:
1. Reduce duration if possible. -OR-
2. Copy existing analysis type to add/remove events, and change SAV value of events. -OR-
3. Zoom-in/Filter data by selecting small time range, to find critical function then open source file as quickabove your expectation?

Regards, Peter

prokash · ‎03-15-2012

Thanks once again, Peter!

I could not use the option (1) above, because even the elimetary set of trial runs have sequencing, meaning that it would exercise , say, four types of test runs, and they are in sequence. So for example, under user productivity - text editing, web brwowsing, emailing etc. and they are in sequence...

So I used the other alternative to knock off couple HW events that I can capture later - and sure the result is perfect in terms of time to pull in the stats to the VTune analyzer, and src/sym resolve...

Now as you are well aware, in the system space we are getting some automatic analysis, so there is no pink tabs etc. Looking at the sandy bridge doc, I see the old style way of computing some stat, like Branch Mispredictions etc.

I was wondering if there is such a place ( or doc ), I can refer to for some of these calculations. Most of HW events and their meanings are defined in the help file. But I'm looking for some of the pertinent formulas as well as some thresholds to compare.

(a) I am currently referring to Software Optimization CookBook from Intel

(b) I do have the software optimization pdf from you folks website.

As an example: DTLB and page-walk. Sure there will be some page-walk, but how could I get a hard number to compare with some reference number ( if available), so that it might spark the thought that --- "Hmm we need to loook at here first, then there, then some other places" !!!

-pro

Peter_W_Intel · ‎03-15-2012

You can refer to optimization manual - Intel 64 and IA-32 Architectures Optimization Reference Manual

You also can see what performance countersin helper,used for measuring DTLB miss and page walk,on SNB - after installing the product, look into VTune Amplifier XE 2011\documentation\en\help\snb.chm,

DTLB_LOAD_MISSES.MISS_CAUSE_A_WALK ; miss which cause a walk
DTLB_LOAD_MISSES.STLB_HIT ; hit at 2nd page
DTLB_LOAD_MISSES.WALK_DURATION ; cycles during a walk

So, you can founthe formula in Ref Manual - Appendix B.3.4.4
Cost of page walks:
%STLB.LOAD.MISS.WALK.COST =
100 * DTLB_LOAD_MISSES.WALK_DURATION / CPU_CLK_UNHALTED.THREAD;

Other performance counters from helper, will be interpreted in manual as well.

Regards, Peter

prokash · ‎03-16-2012

Thanks,

I've those manuals ...

Yet few more questions --

1) Is there any base number or percentage for some stats can be used as threashold? So once I compute, I will know if it has the pink tag or something so that I can concentrate!!!

2) These are all experimental at this stage -

What I am seeing is out of 2,000,000 INST_RETIRED.ANY in one of the c language macro, there are 1,200,000BR_MISP_RETIRED.ALL_BRANCHES_PS, and 600,000 ICACHE.MISSES. This is on sandy bridge (x64).

Now the macro adds to a shared variable, so it is defined like the following -

#define ADD32(S, V) if (V != 0) InterlockedExchangeAdd (S, V)

Clearly I see that there is a branch, and at a first look at it, it would be evident that there is an effort to optimize by not going out to lock the bus if the value to be added is zero...

So one try would be to flatout add the value, but was wondering if the lock would going to beat us...

Any thought?

Of course, we don't want to use assembler level constructs ...

-pro

Peter_W_Intel · ‎03-16-2012

If you use tool's predefined analysis types, there is no threshold needed; pink tag is in report directly:-)
If you choose your preferred events, to create your analysis type - report on GUIhas NO pink tag, since there is no formula to compute them.

Whatare indicators you can use, based on result?
1. CPI, average cycles per instruction - 0.4 - 0.8 is comfortable (if codehas noSSE/AVX, and x87 instruction)
2. Know total latency of specific events. Calculate it : event_num * latency, for example - 100,000 L2 memory load misses,latency = 40 * 100,000 = 4,000,000 ,
If CPU unhalted clock = 200,000,000, rate = 2% - you don't care
If CPU unhalted clock = 20,000,000, rate = 20%, you may optimize it

Usually threashold is .2, based on my experience. Also penaltyof eachbranch misprediction is 20 cycles, you can evaluate how it impacts on performance.

Regard, Peter