static timing estimates of basic blocks - how?

jikel · ‎06-23-2009

I would like to determine static timing estimates for basic blocks in a product I'm developing. By "static" I mean that I'd like the analysis done on a DLL or object file without having to run the code - so that I get estimates for all basic blocks (not just those covered by some specific application). I know that any such estimates will be very rough due to the dynamic timing nature of the Intel instruction model, but for my application, that would still be useful. I am evaluating VTune now, but it doesn't appear to provide any such estimates (I thought it might based on what little I read about its static analysis capabilities). Does anyone know of a product that does this?

robert-reed · ‎06-23-2009

Quoting - jikel

I would like to determine static timing estimates for basic blocks in a product I'm developing. By "static" I mean that I'd like the analysis done on a DLL or object file without having to run the code - so that I get estimates for all basic blocks (not just those covered by some specific application). I know that any such estimates will be very rough due to the dynamic timing nature of the Intel instruction model, but for my application, that would still be useful. I am evaluating VTune now, but it doesn't appear to provide any such estimates (I thought it might based on what little I read about its static analysis capabilities). Does anyone know of a product that does this?

Something else you might look into, though it does not offer static assembly code timing analysis (at least last time I checked) is the Intel Performance Tuning Utility (PTU), available from Intel's Whatif web site and runnable on your VTune analyzer license. It does have some static analyzers though not one like you desire that I know of, but it also does dynamic analysis via basic block. If you're not able to find a static analyzer that meets your needs, perhaps you can get some useful work out of PTU, though it will require running your code with sufficient test cases to enable coverage of all basic blocks.

Thomas_W_Intel · ‎06-28-2009

For a large application, a static performance analysis will be of limited use because there is the high chance that it guides you in the wrong direction, i.e. towards a cold code block. However, if you already know where the hot spots are (by using a profiler) I found it sufficient as a first estimate tosum the instructions weighted by their latencies. Actually, we did this before implementing a new variant of a function to get a first idea howthe new variant might perform. Unfortunately, I am therefore not aware of any tool that does it other than paper and pencil :).

Kind regards

Thomas

TimP · ‎06-28-2009

Quoting - Thomas Willhalm (Intel)

if you already know where the hot spots are (by using a profiler) I found it sufficient as a first estimate to sum the instructions weighted by their latencies.

Does your method analyze context so as to determine whether throughput or serial latency is relevant for each instruction?

jikel · ‎06-29-2009

Quoting - tim18

Does your method analyze context so as to determine whether throughput or serial latency is relevant for each instruction?

My goal is to process data collected about the number of times basic blocks in my product's DLLs were run at customer sites (I cannot easily get my own running copies of the customers' applications, and I have to minimize the instrumentation overhead in my product's DLLs because of the time sensitivity of my application domain, so collecting counts instead of actual timings is a good compromise), and use that to determine how to modify my internal benchmarks to more accurately represent important (with respect to performance) typical customer usage of my DLLs. The ultimate goal is to have a set of internal benchmarks that I can rely on to gage the effect on performance that changes to my DLLs will have at customer sites during development of those changes. My product is sufficiently complicated to make it very likely that customers use it in very diverse and unexpected ways. So, given a set of basic block counts collected from customers, I'd like to provide a rough weighting to those counts based on their relative timings such that I can find those basic blocks that are likely contributing the most to the customer's perceived performance of my DLLs - and make sure my internal benchmarks are covering those cases - producing count ratios similar to the customer data for the most important (based on timing) basic blocks. My product is too big for me to develop a set of benchmarks that run all basic blocks in the ratios that my customers do (a much harder problem than merely getting 100% coverage, which is itself hard) - so I'd use the timing data to prioritize which basic blocks to work on first, and then iteratively make my benchmarks more accurate as resources permit. The timings will also give me a gage for how accurate the benchmarks are at any point in this process - allowing me to decide whether the effort to make them more accurate at some point is worth he cost.

So, I don't really want to know independent instruction latency or throughput - I want instead to know basic block execution time, which is a complex interplay of the latency and throughput of the instructions in the basic block (as well as some of the instructions that precede it).