How to collect useful assembly instruction profiles?
I've been trying to use VTune for Linux for a while now, and it doesn't seem to be of much help to me.
I write highly-tuned Itanium 2 assembly code, and I am interested in finding out the slighest imperfections in the code -- a branch which gets mispredicted often, a load which stalls because of a bank conflict, an EXE stall due to unoptimal prefetching, OzQ recirculates, DTLB misses, etc.
So far when running VTune, I get almost no useful data. I run my "microbenchmark" which exercises the assembly code repeatedly, for about 30 minutes on a dedicated machine.
I've tried giving VTune all kinds of different events to collect, such as IA64_INST_RETIRED, BE_EXE_BUBBLE-ALL, and DEAR_LATENCY_GE_16.
But VTune provides no useful statistics at the assembly instruction level which will help me find my code's inefficiencies and correct them.
In the aggregate statistics, it says that one of my assembly routines makes up 60% of IA64_INST_RETIRED events. That is what I would expect. Independent analysis confirms that around 60% of realtime is spent in that routine.
But if I try to zoom into that assembly code routine for more details, I find that only about 4 instructions have any data beside them. The one marked red is usually the first instruction in the routine, or the first load/store in the routine. But even so, it only accounts for .05% of the stalls (at least, that's the percentage displayed), and I know it's not a true hotspot.
Most of the 4 or so instructions with data beside them, display percentages which I did not choose myself. I would guess that the percentages are percentage contributions to the total events obtained during the collection. I cannot specify my own ratios in Linux, from what I've heard.
But in any case, why does it spend 60% of IA64_INST_RETIRED events in my assembly routine, and yet there are less than .05% stalls, and these are only listed for about 4 instructions in the whole routine, instructions which have nothing to do with the main loops in the routine? I know there are other stalls, but VTune never displays them.
I've tried collecting more samples -- 30 minutes worth at 1 ms, .1 ms, .001 ms intervals, with my routine being called thousands of times in loops. I've tried it both with and without calibration. I've tried :sa=1, :sa=10, :sa=100, etc. The aggregate counts certainly increase as the samples increase, but it does not improve the resolution at the assembly instruction level. I still see only 4 or so instructions with any data in them, and their event counts are on the order of .05% of the total events collected, even though my routine accounts for 60% of the IA64_INST_RETIRED events. Certainly more than that percentage of real time, is being spent in my routine.
Could the fact that this is handwritten assembly and not compiler-generated code, and doesn't have unwind or debug structures, have anything to do with the lack of detailed instruction profiling?
If this is related to event skidding, then how can it be reduced when profiling highly-tuned assembly code? Is there an optimum value of sa? Is calibration good for finding this out? (I've always been told to turn calibration off.) The skidding, if that's what is happening, seems to be happening rather predictably, because it aways ends up counting 4 instructions in each assembly routine, and these 4 instructions are not even inside of instruction groups that you would expect to
have any stalls in them. The locations recorded for the events seem random for each routine, but are reproducible across multiple collection runs.
How can I profile events and have them recorded at the assembly instruction level, for detecting stalls and memory events, in code which is already running at 90% or better of machine peak, and hence for which such events are rare but important to identify, and for which there is no single "hotspot" which stands out from the rest? How can I show event counts for ALL assembly instructions which ever got sampled, not just the "hotspots"? (If the answers to these questions is to use a CLI viewer instead of a GUI one, I'm all for it, but I need docs.)
EAR events are the only ones which show up "normally", scattered across many assembly instructions, but even they are not much help for tuning, because their counts are so uncertain -- call it Gallagher's Uncertainty Principle :-) The more precise you are in tracking down addresses with EAR, the more uncertain you are of the counts. The more precise you are in counting events (non-EAR), the more uncertain you are of the addresses. In the routine I profiled which takes 60% of the runtime, there were 24 instructions with DEAR_LATENCY_GE_16 events listed next to them, but only 4 non-EAR event instructions were listed when I sampled those.
The routine itself is composed of over 5000 instructions, and despite showing 24 DEAR_LATENCY_GE_16 event instructions, none of them stood out as being hotspots over the others, and, all combined, they still represented less than 1% of the total DEAR_LATENCY_GE_16 events, in a routine which takes 60% of the runtime. I tried many different values of sa (1,10,100,1000) but it did not improve the resolution enough to clearly identify the bottlenecks in the code -- it could not, for example, tell me that a particular load was stalling more often than others. Basically, I'm at a loss as to where to go next in tuning my code.
Can there be code which is "too efficient" for VTune -- code which has so few stalls, that the infrequency of these stalls, when combined with the statistical uncertainies inherent in event skidding and EAR sampling, makes VTune unable to provide any meaningful data on the hotspots at the assembly instruction (or even bundle pair) level? In other words, do the statistical uncertainties of event skidding and EAR sampling make pinpointing assembly instruction hotspots impossible, in code which experiences less than 1% stalls, because the error of uncertainty, of either the location, or of the count, is greater than the frequency of the events themselves? I hope this isn't true.
The main documentation is for Windows, not Linux, so it's not much help -- I don't want to wade through pages of walk-throughs of the Windows GUI. Is there a reference guide out there for Linux users, one which doesn't spend excessive time on sampling tutorials or GUI howto instructions? I've already seen and read the Linux VTune FAQ.
Vtune merely displays the data where it occurs...So perhaps we should review how this works first.
1)for non exact events..which are most of them for Itanium 2 processors..
you program a counter to count a particular event and initialize the counter value to the overflow value - the SAV value. Then the events start getting counted. On overflow an interupt is raised..and the OS then services the interupt when it feels like it..and records the IIP (interupt instruction pointer) which is NOT the instruction pointer of when the event occurred...thus there is "skid".
Skid has a few sources.among them..
1) the instruction that causes the event may have been retired dozens of cycles before the event does. ex: and L3 cache miss occurs about a dozen cycles after the load. meaning the IP in the pipeline when the counter overflows will not correspond to the instruction.
2) OS response
The net result is that the events are only accurate to around a basic block
The exceptions are the EAR events..where the Hardware records the address of the instruction responsible (ie data_ears are caused by loads)
The Branch_Event is also exact..however VTune for historical reasons displays the event at the branch target address..and by default collects data in the mode where the PMU collects the branch (source) address and the "branch details". so there is no target address..to correct this edit the event and set the "ds" bit to zero..(uncheck the box)
But the bottom line is that the skid will move the majority of events all over..and all you can do is characterize the performance of a basic block..
For loops this is not bad..because the events that occur in the loop are at least displayed somewhere in the loop..
There is a tendency for be_exe_bubble to line up with the instruction causing the stall when the underlying reason is a load that caused an L3 cache miss. This can be identified by using the dear_latency_gt_64 event in conjuction with the be_exe_bubble..the latency > 64 are pretty much L3 misses..so there are sufficient stalls at the consuming instruction to get the stall events to pile up there.