What is the best way to measure how many cycles a portion of code takes to execute?
I found this text explaining it [http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf] but is this approach still the one I should pursue today? Furthermore, I'm profiling a function that is part of a large application, Its just does not seems feasible to make that "application" as a kernel module in order to measure the cost of executing a function.
What do you guys advise?
If you just want to measure the time in the routine, using a timer based on rdtsc is fine but it depends on what accuracy you want.
If the time of the function is relatively large compared to the timing routine, you can use any timing routine, such as gettimeofday() on Linux, or QueryPerformanceCounter (http://support.microsoft.com/kb/815668). Or you can use VTune (if you want to sample the performance) or perf.
I guess it depends on exactly what you are trying to do.
Thank you for your answers.
These functions that I need to measure has no more than 20 instructions and is called from inside a loop body. Thus, due this short granularity and what you have just said, I believe that using RDTSCP (I'm using i7) to calculate the average cycles per invocation is my best option.
On a recent Core i7 processor I would expect the RDTSCP instruction to have an overhead of >20 cycles and a repeat rate of no less than the CPU frequency divided by 100 MHz (plus a few cycles). Then you will need to include the overhead of saving the results.
Repeated calls to RDTSCP most commonly (92 of 100 tests) take 36 cycles on my Sandy Bridge Xeon E5-2680 systems (2.7 GHz nominal, actually running at 3.1 GHz all the time) when I only save the low-order 32 bits of the result (into an array that is pre-loaded to be dirty in the L1 data cache). A few times (7 out of 100) I measured 32 cycles.... Dunno why.
When I added the extra code to combine the 2 32-bit registers and perform a 64-bit save, the most commonly observed interval was 40 cycles (68%), with a 31% showing the slightly smaller interval of 36 cycles. (I was amused to note that gcc removed my code that combined the two 32-bit registers into a 64-bit register and simply saved the two 32-bit register values in adjacent 32-bit memory locations.)
It has been a while since I did these experiments, so I am not sure whether unrolled the loop and/or whether I attempted to compensate for the loop control code. It does not really matter unless you want to try to subtract the overhead, and that is always a very tricky proposition. It is much safer to use this minimum delta as a guide to how large a code region needs to be in order to be measured accurately. These values correspond to almost exactly 10 ns on my system, so I try to set the code up to give me regions taking at least 1 microsecond so that the measurement overhead will be O(1%). This is especially important if you want to also read the hardware performance counters -- each of those takes a bit longer to execute than the RDTSCP -- Agner Fog quotes a count of 35 uops for RDPMC on Sandy Bridge and Ivy Bridge cores, with maximum repeat rates of 39-42 cycles. If you are reading all 8 hardware performance counters on a Sandy Bridge or Ivy Bridge core, these cycles add up to about 100 ns, suggesting a 10 microsecond minimum measurement interval to keep the overhead in the 1% range.
I suppose that on Core i7 Haswell loop control statements(machine code) will be executed in parallel to timed function body. In such a situation I am not sure how much the result could be skewed.
In such situations, I usually execute the function many times in a loop and use statistical profiling (e.g. with VTune or perf) to get the break down. This might not be 100% precise but often good enough.
The Intel Architecture Code Analyzer may be helpful as well: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
It gives you a first-order approximation of performance of your code.
I was surprised to find that I could add "repeat loops" to the STREAM benchmark and the compiler did not eliminate them. This enabled me to reduce the size of the benchmark so that it would fit in the L1 cache (or L2, or L3), but still run long enough to make the overhead of the performance counter reads negligible.
In this case the repeat loops do not even change the result -- since each STREAM kernel reads from locations that it does not update, you can repeat each kernel loop any number of times and the output is the same.
In one case I got tripped up by the compiler. When I compiled with the "-no-vec" option, the compiler decided to unroll the "repeat loop" once. It then noticed that the two copies of the STREAM kernel inside the unrolled "repeat loop" wrote to the same memory locations, so it eliminated the first copy as being redundant ("dead code"). This is perfectly correct, but it surprised me that it only eliminated 1/2 of the repetitions, since the compiler could eliminate *all* of the kernel invocations before the final one. In any case, I was able to work around this issue by inserting a "#pragma nounroll" before the "repeat loop", returning the desired behavior. Of course it is important to monitor such "tricks" closely -- the code is redundant, so the compiler would be correct in removing all the extra iterations, and it can be hard to predict what minor changes in compiler flags or compiler revisions could cause the behavior to change. Since I am doing this to enable high-accuracy hardware performance counter measurements, I always include at least one counter event that will enable me to confirm that the code is executing all of the "repeat loop" iterations. (A count of the total number of retired load uops is convenient for this purpose.)
I used to measure short function performance(my own implementation of trigo and special functions) with rdtsc inline assembly and with Win API GetTickCount in order to defeat compiler optimization random arguments were passed to those functions.I suppose that rdtsc instruction based measurement were more accurate when compared to those which were based on GetTickCount.