Many thanks for your detailed

Porter__Andrew · ‎04-21-2016

Hello, I am attempting to understand the performance of a small piece of Fortran code that is run within a larger program. The issue I have is that, for small problem sizes, this piece of code only takes between 2 and 3 microseconds to run. However, I have hit problems determining exactly what this duration is. Since the code is so quick to execute, I have tried doing the obvious thing of repeating it a number of times and measuring how long that takes. Unfortunately, when I do this, the mean time to execute the code systematically reduces until it converges upon some value (generally at about 100 repeats). In my case the time that it converges to is 2.33us. I was expecting that this process would simply reduce the error/noise in my measurement of the elapsed time, not give a systematic reduction in the value obtained.

I have experimented with using gettimeofday(), omp_get_wtime(), POSIX clock_gettime() and _rdtsc() (appropriately wrapped so that I can call them from Fortran).

To add to the mystery, I have some code that measures the effective granularity of the particular clock in use. In pseudo-code this looks like:

do i=1,10000
  time(i) = time_now()
end do

I then look for the smallest, non-zero value of time(i+1)-time(i). For the POSIX timer this gives 0.5us, for the others it is 1us. For rdtsc it is approximately 30 counts. This shows that the timer is behaving as expected.

I have stored and then plotted the values of the individual time periods that my code measures as it runs. This shows that no individual period is ever less than 3us when I have no repeats of the timed region. Given that, when I average over 100 repeats, I know that the mean time in the region is 2.3us I do not understand how I never get a measured duration of 2us. If anybody can shed any light I would be very grateful.

My system is an Intel E5 1620 v.2 CPU. I have switched off turbo, stepping and hyperthreading (doing so reduced the noise in the timings).

Many thanks,

Andy.

TimP · ‎04-21-2016

According to Livermore Fortran Kernel, on a typical platform I see overhead less than 0.1 microsecond for timing calls wrapped around _rdtsc. I don't expect satisfactory repeatability of code segments under 100 microseconds. I use _rdtsc mainly on account of it working as well on Windows as linux.

I suppose you would need to time within a thread pinned to a single core to have a chance with intervals under 10 microseconds. Guessing that you're running linux, you have several fairly easy choices including tasks at.

McCalpinJohn · ‎04-21-2016

For timings in the range of 2000-3000 ns, either RDTSC or RDTSCP are the preferred timers. The extra instruction ordering required for the RDTSCP instruction should have negligible impact at these time scales....

Two things to consider:

Is the process pinned to a single core to prevent process migration?
- TSC values are typically quite uniform across cores, but they don't have to be.
  - RDTSCP can be used to get the time stamp counter and the core number atomically on Linux systems. (See code below)
- Even if the TSCs are uniform, process migration incurs a significant overhead in OS code and a modest overhead in re-loading any required cache state on the new core.
Does the code under test use 256-bit registers? Does the surrounding code use 256-bit registers?
- It appears that Sandy Bridge and Ivy Bridge cores "turn off" the upper 128 bits of the vector pipelines when they are not in use.
- When the processor begins using 256-bit vector instructions, they are run at half speed for a few hundred to a few thousand cycles before transitioning to full speed.
- This is consistent with the hypothesis that the upper 128 bits of each 256-bit registers is shifted to the lower (active) 128-bit lane which does all the work (taking twice as many cycles) until the upper 128-bit lane has been turned back on and the behavior returns to the expected values.
- The upper 128 bits of the pipelines are turned off relatively quickly (<1 millisecond) on Sandy Bridge and Ivy Bridge if they are not in use, but I have not tried to pin down a specific value.
- This topic is discussed in a number of postings at http://www.agner.org/optimize/blog/read.php?i=142

Appendix A: An analogous behavior is observed in Haswell (Xeon E5 v3) , but with some important differences:

Performance is about 1/4 of the nominal performance during the start-up period.
- This is consistent with all the work being done by the lower 128 bits of the pipeline, but with a 3-cycle latency for transfers between the upper 128 bits of the YMM registers and the lower 128 bits of the pipeline. (This transfer latency is only 1 cycle in Sandy Bridge and Ivy Bridge.)
On the Xeon E5 v3 parts that I have tested there is also a ~30,000 cycle stall whenever the upper 128 bits of the pipeline is turned on.
- When the processor frequency is pinned to the nominal frequency, this stall can be seen in the difference between the TSC counts (which don't stop incrementing during the stall) and the CPU_CLK_UNHALTED.THREAD_P performance counter event (which does stop incrementing during the stall).
The upper 128 bits of the pipeline are turned back off if the core has used no 256-bit register instructions in the last millisecond.
- I have not looked to see if there is also a core stall when the upper 128-bit pipelines are turned off.
Some of these issues are discussed in Agner's blog (above), and I recently ran across a paper that includes some useful measurements and analysis: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7284406&tag=1

Appendix B: RDTSCP code to get TSC, Socket, and Core number atomically in C:

unsigned long tacc_rdtscp(int *chip, int *core)
{
   unsigned a, d, c;

   __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c));
    *chip = (c & 0xFFF000)>>12;
    *core = c & 0xFFF;

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

Porter__Andrew · ‎04-22-2016

Many thanks for your detailed replies. It turns out that the answer is actually quite simple (but it took my post and your replies for me to realise that). I was treating the (measured) timer granularity as a random error but it is in fact systematic - if I time a region that contains absolutely nothing then I always get a duration that is greater than or equal to that 'granularity'. I therefore either need to subtract my estimate of this error from my results or ensure that the timed region has a duration that is large enough that this error is not significant.

I may have a play with rdtscp - many thanks for the C fragment.

Thanks for your help,

Andy.

Problems interpreting timing results for code with microsecond duration