- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alger fog use this following method
_rdpmc
{
cpuid
rdpmc
cpuid
}
first test overhead of these instruction by finding the minum time interval between two _rdpmc .
But I find it varies greately . Iested 1000 times. Most cases , the minumum overhead is 420 cyles.
While sometimes it is 188 cycle. It varies greately . What cause the difference ?
I use taskset -C to test this to reduce shift beteen process.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see that VS2010 added the __rdtscp() builtin. Note that it doesn't support empty parameter list like __rdtsc(). I've never seen experimental study of the pros and cons between them.
I agree that I never saw the advantage of using cpuid to serialize __rdtsc, but that's a personal preference. I simply put __rdtsc in a separately compiled function, but don't expect to get repeatable timing closer than 80 nanoseconds or to measure the pipeline flush time. It may appear to work to smaller tolerances on one combination of compiler/OS on one machine, but the same code will not do so with a change of platform.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since procssors are out-of-order , rdpsc may test false time(at prior or after your tested codes).
So I think using cpuid is accurate .
Sanybridge now support rdtscp. I used rdtscp as well ,and found the same problem.
Time varies greately. Overhead of rdtscp in most cases is 76 . While when test hundreds times, the minimum
31 clock cycle appears.
"cpuid" and "rdtscp" may need to flush the pipeline. Since I test the same code , should not the pipeline be the same ? If so , the time takes to flush the pipeline should not change a lot .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Most cases , the minumum overhead is 420 cyles. While sometimes it is 188 cycle. It varies greately.
>>What cause the difference ?
>>...
In order to measure time intervals or overheadswith as better as possible accuracy Iraise a priority ofan
application, or a thread, to Realtime. I admit that an overheadof 420 cycles looks too big.
I just ran a test and here is an example of measuring a time interval of a0.5 second at a Normal Priority
for an application:
Sub-Test 1 - [ CrtClock ]
RTuint64 Value: 501.389 ms
Sub-Test 2 - [ CrtClock ]
RTuint64 Value: 501.374 ms
Sub-Test 3 - [ CrtClock ]
RTuint64 Value: 501.372 ms
Note: I didn't measure an overhead and I simply wanted to verify accuracy.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://software.intel.com/en-us/forums/showthread.php?t=101379&o=a&s=lr
for different implementations of time interval measuring helper functions.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, I wanted to make a Feature Request for an intrinsic function "around" an assembly
instruction 'rdtsc' and actually it is already done by Microsoft.
All three widely used Visual Studios 2005, 2008 and 2010 have a declaration of '__rdtsc' intrinsic
function in 'intrin.h' header file:
...
__MACHINEI( unsigned __int64 __rdtsc( void ) )
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It executed 3000 cycles each time with data is small scale. That's why I care so much about clock cycles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. Disable EIST and Turbo in the BIOS - otherwise you will get variable bus multiplier (while TSC is calculated as
2. Use XOR EAX, EAX; CPUID; RDTSC to get timing - back to back it should take ~20 cycles on Nehalem and later CPU's and is a constant overhead that you will need to subtract - I think you missed zeroing EAX and thus getting variable results (CPUID can be very slow with some args)
3. Run timing many times and get the _minimum_ to get the best case pipeline timing
4. Run a sample kernel with roughly known # of cycles (like a LEA EAX, [EAX+1]; LEA EAX, [EAX+1]; ... LEA EAX, [EAX+1];dependency chain in a loop, it cannot be more than 1 instruction/cycle) to validate
feel free to send me an email at maxim.locktyukhin intel com if you cannot make it work - I may share some sample code too ...
-Max
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
from 31 cycles to 84 cycles. Even if I usedintrinsic calls and inline assembler there are alwaysa couple of
instructions between 'rdtsc'calls.Here is a piece of code I used:
...
RTuint64 uiOverhead = 0xffffffffffffffff;
RTuint64 uiDelta = 0;
union ClockV
{
RTuint64 uiClockV;
RTuint32 uiClockV32[2];
};
ClockV cvS = { 0 };
ClockV cvE = { 0 };
RTuint32 uiCvSL = 0;
RTuint32 uiCvSH = 0;
RTuint32 uiCvEL = 0;
RTuint32 uiCvEH = 0;
CrtPrintf( RTU("REAL TIME\n") );
SysSetPriorityClass( ::GetCurrentProcess(), REALTIME_PRIORITY_CLASS );
CrtPrintf( RTU("TIME CRITICAL\n") );
SysSetThreadPriority( ::GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );
for( RTint t = 0; t < 10000000; t++ )
{
// Version 1
// cvS.uiClockV = __rdtsc();
// cvE.uiClockV = __rdtsc();
// Version 2
_asm rdtsc
_asm push eax
_asm push edx
_asm rdtsc
_asm mov dword ptr [ uiCvEL ], eax
_asm mov dword ptr [ uiCvEH ], edx
_asm pop edx
_asm pop eax
_asm mov dword ptr [ uiCvSL ], eax
_asm mov dword ptr [ uiCvSH ], edx
cvS.uiClockV32[0] = uiCvSL;
cvS.uiClockV32[1] = uiCvSH;
cvE.uiClockV32[0] = uiCvEL;
cvE.uiClockV32[1] = uiCvEH;
uiDelta = cvE.uiClockV - cvS.uiClockV;
if( uiOverhead > uiDelta )
uiOverhead = uiDelta;
}
SysSetPriorityClass( SysGetCurrentProcess(), NORMAL_PRIORITY_CLASS );
SysSetThreadPriority( SysGetCurrentThread(), THREAD_PRIORITY_NORMAL );
CrtPrintf( RTU("Overhead Value: %10.3f cycles\n"), ( RTfloat )( cvE.uiClockV - cvS.uiClockV ) );
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the lowest granularity and hence the number you can get is the _default_ CPUs bus multiplier - i.e. if say you are running Intel Core i5 2400 Processor (3.10 GHz) - you get 31, since bus is at 100MHz - you can see 20-ish number on a laptop - the reason is that TSC produces its number by multiplying bus cycles by the _default_ (aka "fused") CPU multiplier (regardless of effective multiplier manipulated by EIST or Turbo) - this way TSC can stay isotropic in time and enable legacy software that relied on this behavior and also the reason you need to disable Turbo in the BIOS to be able to measure real core cycles with RDTSC properly
-Max
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
67 for(i=0; i
69 __asm__ __volatile__(
70 "pushq %%rbx \n\t"
71 "pushq %%rcx \n\t"
72 "xorl %%eax, %%eax \n\t"
73 "cpuid \n\t"
74 :::"%rax","%rbx","%rcx");
75
76 __asm__ __volatile__(
77 "rdtsc \n\t"
78 :"=a"(saxp),"=d"(sdxp));
79
80
81 __asm__ __volatile__(
82 "popq %%rcx \n\t"
83 "popq %%rbx \n\t"
84 :::"%rbx","%rcx");
85 // your test codes
86
87 __asm__ __volatile__(
88 "pushq %%rbx \n\t"
89 "pushq %%rcx \n\t"
90 "xorl %%eax, %%eax \n\t"
91 "cpuid \n\t"
92 :::"%rax","%rbx","%rcx");
93
94 __asm__ __volatile__(
95 "rdtsc \n\t"
96 :"=a"(saxp[i+1]),"=d"(sdxp[i+1]));
97
98 __asm__ __volatile__(
99 "popq %%rcx \n\t"
100 "popq %%rbx \n\t"
101 :::"%rbx","%rcx");
}
This time it varies a little.
Overhead is 84 clock cycle on Sandybridge
If not use cpuid , overhead is 27 clock cycle on Sandybridge.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
67 for(i=0; i<times;i=i+2)
[SergeyK] How many times did you execute your test to get these numbers?
...
Overhead is 84 clock cycle on Sandybridge
If not use cpuid , overhead is 27 clock cycle on Sandybridge.
...
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) read or write cache misses: use a startup phase were you read all your input variables already to avoid read cache misses; you can proceed to measure after input values have already been fetched; if you have to save values of output data, make sure that you do at least one write to allow this core get the "ownership" of the cache line to receive the data and NO other core can write to this cache line in the meantime;
2.1) interrupts from OS to do time keeping; every so often (10 millisec in Linux, 100 in UNIX, not sure in windows) the kernel interrupts the running code to make sure that the right process is using the processor; it is conceivable that your code may be removed from using the core by the OS dispatcher to switch in another runnable task;
2.2) h/w external interrupts handled by this particular core
To deal with 2) you should as other people mentioned at leas use a Real-Time priority and if possible not allow this core handle interrupts.
some suggestions maybe more to come
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Michael - yes, one should run the code that's being measured 100 or 1000 times and take the _minimum_ timing out of all the results (as I suggested above, among other minimum necessary recommendations) - one simply cannot rely on a single run due to a number of potential effects like some you pointed out. One can also try getting a geomean of 1000 received timings instead of a minimum, but excluding results that were obviously affected by external events, e.g. those >10X greater than a minimum ... still minimum timing out of 1000 usually represents that ideal/best possible pipeline timing of a sequence one is looking to find.
-Max
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Michael
I remember that somebody recommended to set a thread affinity mask ofthe test process to a 2nd CPU ( in case of a multi-core system ).
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if the minimum duration is sought after then repeated trials gives you an idea of a minimal time. My question was can one fight the non-determinancy by minimizing interference of unrelated events?
In an ideal environmnet one uses one core runing nothing else except the code under test. I am not sure how easy is to do this with stock Linux kernels (it was doable with UNIX on certain platforms).
One viable approach which reduces (but does not elimintate) interference is to let the code under test run as a thread in the RT scheduling class with a priority higher than other kernel threads in the RT class. This makes sure that the kernel will not switch this thread out of this core. And to avoid thread migrations this thread should be bound to the specific core. Data interference could stop at L3 if one makes sure that no shared variables in cache lines of that core are written to by other cores. This approach eliminates all but the h/w interrupts.
The question is how stringent the timing requirements are and how confident one wants to get with the accuracy of the masurements. Another question is if the initial phases in the code processing have to be included. With SandyBridge I thing that there is an initial decoding into micro-ops which are then stored in the trace cache. Does it make sense to throw this time away ?
I guess Intel has low level utilities for high resolution timings like these? I am just wondering .... :)
regards
michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The exact determination of the critical path of a sequence of instructions in clock cycles can quickly get very uwieldy for at least two reasons: 1) nehalem/wesmere/SB processors are supescalar and as such allow several operations to be in progress at a time within their execution units often predicting the execution path and issuing data prefetch operations and 2) there may be interference from external events (such a h/w interrupts or OS checking if it has to schedule another process on this core).
You can minimize 2) by letting the thread under observation use the RT scheduling mode and assign to it the highest pririty among all threads in this class. As you mentioned you should also bind this thread to a specific core to avoid thrad migration.
One single core system using the RT class you always run the risk of making the system unresponsive ... :( but nowdays nothing is single core anymore.
As for 1) one thing you could try something like this (I have not tried this myself) is to fill the pipeline with NOOPS and then issue rdtsp ; your instruction sequence; rdtsp ; cpuid; rdtsp ;
What is that are you trying to get out of these measurements ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
fight the non-determinancy by minimizing interference of unrelated events?
[SergeyK] This is a really hard task and it is almost impossible to achievein modern preemtive
operating systems.
In an ideal environmnet one uses one core runing nothing else except the code under test...
[SergeyK]Did you try to do some performance tests in MS-DOS operating system? :)
Best regards,
Sergey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page