Timing methods

zhangxiuxia · ‎02-21-2012

How to measure the precise clock of a piece of code ?

Alger fog use this following method
_rdpmc
{
cpuid
rdpmc
cpuid
}

first test overhead of these instruction by finding the minum time interval between two _rdpmc .

But I find it varies greately . Iested 1000 times. Most cases , the minumum overhead is 420 cyles.
While sometimes it is 188 cycle. It varies greately . What cause the difference ?

I use taskset -C to test this to reduce shift beteen process.

TimP · ‎02-22-2012

I never saw anyone recommend cpuid after rdtsc as well as before it.
I see that VS2010 added the __rdtscp() builtin. Note that it doesn't support empty parameter list like __rdtsc(). I've never seen experimental study of the pros and cons between them.
I agree that I never saw the advantage of using cpuid to serialize __rdtsc, but that's a personal preference. I simply put __rdtsc in a separately compiled function, but don't expect to get repeatable timing closer than 80 nanoseconds or to measure the pipeline flush time. It may appear to work to smaller tolerances on one combination of compiler/OS on one machine, but the same code will not do so with a change of platform.

zhangxiuxia · ‎02-22-2012

I saw using cpuid befor rdtsc " at "Ager fog's homepage " and one of Intel report.
Since procssors are out-of-order , rdpsc may test false time(at prior or after your tested codes).
So I think using cpuid is accurate .
Sanybridge now support rdtscp. I used rdtscp as well ,and found the same problem.
Time varies greately. Overhead of rdtscp in most cases is 76 . While when test hundreds times, the minimum
31 clock cycle appears.

"cpuid" and "rdtscp" may need to flush the pipeline. Since I test the same code , should not the pipeline be the same ? If so , the time takes to flush the pipeline should not change a lot .

styc · ‎02-22-2012

I wonder what exactly you are tuning. At least to me, you seem to be fighting for the last few cycles of improvement. A cache miss easily makes that indistinguishable from noise.

SergeyKostrov · ‎02-22-2012

>>...
>>Most cases , the minumum overhead is 420 cyles. While sometimes it is 188 cycle. It varies greately.
>>What cause the difference ?
>>...

In order to measure time intervals or overheadswith as better as possible accuracy Iraise a priority ofan
application, or a thread, to Realtime. I admit that an overheadof 420 cycles looks too big.

I just ran a test and here is an example of measuring a time interval of a0.5 second at a Normal Priority
for an application:

Sub-Test 1 - [ CrtClock ]
RTuint64 Value: 501.389 ms

Sub-Test 2 - [ CrtClock ]
RTuint64 Value: 501.374 ms

Sub-Test 3 - [ CrtClock ]
RTuint64 Value: 501.372 ms

Note: I didn't measure an overhead and I simply wanted to verify accuracy.

Best regards,
Sergey

SergeyKostrov · ‎02-22-2012

Please take a look at a thread:

http://software.intel.com/en-us/forums/showthread.php?t=101379&o=a&s=lr

for different implementations of time interval measuring helper functions.

Best regards,
Sergey

SergeyKostrov · ‎02-22-2012

Quoting TimP (Intel)

...like __rdtsc()...

By the way, I wanted to make a Feature Request for an intrinsic function "around" an assembly
instruction 'rdtsc' and actually it is already done by Microsoft.

All three widely used Visual Studios 2005, 2008 and 2010 have a declaration of '__rdtsc' intrinsic
function in 'intrin.h' header file:

...
__MACHINEI( unsigned __int64 __rdtsc( void ) )
...

zhangxiuxia · ‎02-22-2012

I want to mearsure pipeline performance of a segment of code. It has no cache miss.
It executed 3000 cycles each time with data is small scale. That's why I care so much about clock cycles.

Max_L · ‎02-22-2012

you indeed can measure cycles extremely precisely, and you are on the right track, do this:

1. Disable EIST and Turbo in the BIOS - otherwise you will get variable bus multiplier (while TSC is calculated as x )

2. Use XOR EAX, EAX; CPUID; RDTSC to get timing - back to back it should take ~20 cycles on Nehalem and later CPU's and is a constant overhead that you will need to subtract - I think you missed zeroing EAX and thus getting variable results (CPUID can be very slow with some args)

3. Run timing many times and get the _minimum_ to get the best case pipeline timing

4. Run a sample kernel with roughly known # of cycles (like a LEA EAX, [EAX+1]; LEA EAX, [EAX+1]; ... LEA EAX, [EAX+1];dependency chain in a loop, it cannot be more than 1 instruction/cycle) to validate

feel free to send me an email at maxim.locktyukhin intel com if you cannot make it work - I may share some sample code too ...

-Max

SergeyKostrov · ‎02-24-2012

I've measured an overhead of 'rdtsc' instruction and depending on a computer a range of values was
from 31 cycles to 84 cycles. Even if I usedintrinsic calls and inline assembler there are alwaysa couple of
instructions between 'rdtsc'calls.Here is a piece of code I used:

...
RTuint64 uiOverhead = 0xffffffffffffffff;
RTuint64 uiDelta = 0;

union ClockV
{
RTuint64 uiClockV;
RTuint32 uiClockV32[2];
};

ClockV cvS = { 0 };
ClockV cvE = { 0 };
RTuint32 uiCvSL = 0;
RTuint32 uiCvSH = 0;
RTuint32 uiCvEL = 0;
RTuint32 uiCvEH = 0;

CrtPrintf( RTU("REAL TIME\n") );
SysSetPriorityClass( ::GetCurrentProcess(), REALTIME_PRIORITY_CLASS );
CrtPrintf( RTU("TIME CRITICAL\n") );
SysSetThreadPriority( ::GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );

for( RTint t = 0; t < 10000000; t++ )
{
// Version 1
// cvS.uiClockV = __rdtsc();
// cvE.uiClockV = __rdtsc();

// Version 2
_asm rdtsc
_asm push eax
_asm push edx

_asm rdtsc
_asm mov dword ptr [ uiCvEL ], eax
_asm mov dword ptr [ uiCvEH ], edx

_asm pop edx
_asm pop eax
_asm mov dword ptr [ uiCvSL ], eax
_asm mov dword ptr [ uiCvSH ], edx

cvS.uiClockV32[0] = uiCvSL;
cvS.uiClockV32[1] = uiCvSH;
cvE.uiClockV32[0] = uiCvEL;
cvE.uiClockV32[1] = uiCvEH;

uiDelta = cvE.uiClockV - cvS.uiClockV;

if( uiOverhead > uiDelta )
uiOverhead = uiDelta;
}

SysSetPriorityClass( SysGetCurrentProcess(), NORMAL_PRIORITY_CLASS );
SysSetThreadPriority( SysGetCurrentThread(), THREAD_PRIORITY_NORMAL );

CrtPrintf( RTU("Overhead Value: %10.3f cycles\n"), ( RTfloat )( cvE.uiClockV - cvS.uiClockV ) );
...

Max_L · ‎02-24-2012

the lowest granularity and hence the number you can get is the _default_ CPUs bus multiplier - i.e. if say you are running Intel Core i5 2400 Processor (3.10 GHz) - you get 31, since bus is at 100MHz - you can see 20-ish number on a laptop - the reason is that TSC produces its number by multiplying bus cycles by the _default_ (aka "fused") CPU multiplier (regardless of effective multiplier manipulated by EIST or Turbo) - this way TSC can stay isotropic in time and enable legacy software that relied on this behavior and also the reason you need to disable Turbo in the BIOS to be able to measure real core cycles with RDTSC properly

-Max

zhangxiuxia · ‎02-26-2012

Thanks very much ,Max . You give me the answer why clock varies. I know how to implement it .

zhangxiuxia · ‎02-26-2012

I use inline assembly to test.

67 for(i=0; i68 {
69 __asm__ __volatile__(
70 "pushq %%rbx \n\t"
71 "pushq %%rcx \n\t"
72 "xorl %%eax, %%eax \n\t"
73 "cpuid \n\t"
74 :::"%rax","%rbx","%rcx");
75
76 __asm__ __volatile__(
77 "rdtsc \n\t"
78 :"=a"(saxp),"=d"(sdxp));
79
80
81 __asm__ __volatile__(
82 "popq %%rcx \n\t"
83 "popq %%rbx \n\t"
84 :::"%rbx","%rcx");
85 // your test codes
86
87 __asm__ __volatile__(
88 "pushq %%rbx \n\t"
89 "pushq %%rcx \n\t"
90 "xorl %%eax, %%eax \n\t"
91 "cpuid \n\t"
92 :::"%rax","%rbx","%rcx");
93
94 __asm__ __volatile__(
95 "rdtsc \n\t"
96 :"=a"(saxp[i+1]),"=d"(sdxp[i+1]));
97
98 __asm__ __volatile__(
99 "popq %%rcx \n\t"
100 "popq %%rbx \n\t"
101 :::"%rbx","%rcx");
}

This time it varies a little.
Overhead is 84 clock cycle on Sandybridge

If not use cpuid , overhead is 27 clock cycle on Sandybridge.

SergeyKostrov · ‎02-27-2012

Quoting zhangxiuxia

I use inline assembly to test.

67 for(i=0; i<times;i=i+2)

[SergeyK] How many times did you execute your test to get these numbers?

...
Overhead is 84 clock cycle on Sandybridge
If not use cpuid , overhead is 27 clock cycle on Sandybridge.
...

Best regards,
Sergey

drMikeT · ‎03-24-2012

Two things, among others, that may introduce timing variation at this level:

1) read or write cache misses: use a startup phase were you read all your input variables already to avoid read cache misses; you can proceed to measure after input values have already been fetched; if you have to save values of output data, make sure that you do at least one write to allow this core get the "ownership" of the cache line to receive the data and NO other core can write to this cache line in the meantime;

2.1) interrupts from OS to do time keeping; every so often (10 millisec in Linux, 100 in UNIX, not sure in windows) the kernel interrupts the running code to make sure that the right process is using the processor; it is conceivable that your code may be removed from using the core by the OS dispatcher to switch in another runnable task;

2.2) h/w external interrupts handled by this particular core

To deal with 2) you should as other people mentioned at leas use a Real-Time priority and if possible not allow this core handle interrupts.

some suggestions maybe more to come

Michael

Max_L · ‎03-25-2012

Michael - yes, one should run the code that's being measured 100 or 1000 times and take the _minimum_ timing out of all the results (as I suggested above, among other minimum necessary recommendations) - one simply cannot rely on a single run due to a number of potential effects like some you pointed out. One can also try getting a geomean of 1000 received timings instead of a minimum, but excluding results that were obviously affected by external events, e.g. those >10X greater than a minimum ... still minimum timing out of 1000 usually represents that ideal/best possible pipeline timing of a sequence one is looking to find.

-Max

SergeyKostrov · ‎03-26-2012

Quoting drMikeT

...some suggestions maybe more to come...

Michael

I remember that somebody recommended to set a thread affinity mask ofthe test process to a 2nd CPU ( in case of a multi-core system ).

Best regards,
Sergey

drMikeT · ‎03-26-2012

Hi Max,

if the minimum duration is sought after then repeated trials gives you an idea of a minimal time. My question was can one fight the non-determinancy by minimizing interference of unrelated events?

In an ideal environmnet one uses one core runing nothing else except the code under test. I am not sure how easy is to do this with stock Linux kernels (it was doable with UNIX on certain platforms).

One viable approach which reduces (but does not elimintate) interference is to let the code under test run as a thread in the RT scheduling class with a priority higher than other kernel threads in the RT class. This makes sure that the kernel will not switch this thread out of this core. And to avoid thread migrations this thread should be bound to the specific core. Data interference could stop at L3 if one makes sure that no shared variables in cache lines of that core are written to by other cores. This approach eliminates all but the h/w interrupts.

The question is how stringent the timing requirements are and how confident one wants to get with the accuracy of the masurements. Another question is if the initial phases in the code processing have to be included. With SandyBridge I thing that there is an initial decoding into micro-ops which are then stored in the trace cache. Does it make sense to throw this time away ?

I guess Intel has low level utilities for high resolution timings like these? I am just wondering .... :)

regards
michael

drMikeT · ‎03-26-2012

Yes, you should bind this thread onto a core AND not forget to bind all other threads (including OS) to the other cores....

michael

drMikeT · ‎03-26-2012

One question is how do you plan to use these measurements and in which context/environment. How accurate would you like to be ?

The exact determination of the critical path of a sequence of instructions in clock cycles can quickly get very uwieldy for at least two reasons: 1) nehalem/wesmere/SB processors are supescalar and as such allow several operations to be in progress at a time within their execution units often predicting the execution path and issuing data prefetch operations and 2) there may be interference from external events (such a h/w interrupts or OS checking if it has to schedule another process on this core).

You can minimize 2) by letting the thread under observation use the RT scheduling mode and assign to it the highest pririty among all threads in this class. As you mentioned you should also bind this thread to a specific core to avoid thrad migration.

One single core system using the RT class you always run the risk of making the system unresponsive ... :( but nowdays nothing is single core anymore.

As for 1) one thing you could try something like this (I have not tried this myself) is to fill the pipeline with NOOPS and then issue rdtsp ; your instruction sequence; rdtsp ; cpuid; rdtsp ;

What is that are you trying to get out of these measurements ?

SergeyKostrov · ‎03-26-2012

Quoting drMikeT

...if the minimum duration is sought after then repeated trials gives you an idea of a minimal time. My question was can one
fight the non-determinancy by minimizing interference of unrelated events?

[SergeyK] This is a really hard task and it is almost impossible to achievein modern preemtive
operating systems.

In an ideal environmnet one uses one core runing nothing else except the code under test...

[SergeyK]Did you try to do some performance tests in MS-DOS operating system? :)

Best regards,
Sergey