Simple Question

drMikeT · ‎03-28-2012

Hello,

I have a "simple" question concerning Intel64 microarchitectures (Nehalem and newer) :

I would like to know precisely the number of clock steps a particular sequence of machine instructions requires from start to finish in the core's pipeline.

I understand that these cores are superscalar and thus the particular instruction sequence may get mixed with other possibly unrelated instructions in the out-of-order execution engine.

How precise may I get to collect exact timestamps as close as possible to the clock cycle the 1st instruction enters the pipeline for decoding through the clock cycle the last instruction commits back to visible state ?

I am assuming the following: in the core under observation only a particular thread may run (say I have bound it there and all other threads, including kernel ones, are bound to other cores).

Can I avoid handling external interrupts by this core to minimize external interference / contamination of the pipeline by unrelated instructions ? There are platforms on which h/w interrupts can be routed to specifc cores only and avoid others. As for dispatching the single thread on that core I could use the SCHED_FIFO RT sched class with high enough priority and let this thread run to completion.

Under the above conditions, I could supply a long enough sequence of NOOPS to fill out completely the pipeline then

....
noop
noop
getTimeStamp_1 ;; TS begining of instr sequence
ins1; ins2; ins3 .....; insk
getTimeStamp_2a ;; TS end of instr sequence pre-flush
CPUID ;; flush microps out
getTimeStamp_2b ;; TS end of instr sequence post-flush

Any suggestions or comments would be appreciated

thanks
michael

Patrick_F_Intel1 · ‎04-01-2012

Hello Michael,
You strategy is sound but it may be difficult to avoid NEVER getting some sort of interrupt.
Increasing the priority of the thread will keep other threads from running(providing that the high priority thread doesn't do something to get blocked (like IO)).
Are you using linux?
You can monitor which cpu is getting interrupts on linux using 'cat /proc/interrupts'.
You can control the interrupt affinity with the info in http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux .
I don't know if you worked with linux ftrace or windows ETW but with these trace tools you could (with some work) see whether or not your threads are getting swapped, interrupts, etc. But it comes with a cost of more overhead and it isn't too easy to use.
It might be simpler to run it multiple times and then take the smallest 'TSC_end-TSC_begin' difference.
Hopefully the variation in 'TSC_end-TSC_begin' is small.
Pat

drMikeT · ‎04-02-2012

Hi Patrick, good pointer to the linux interrupt handling article. Yes I am using Linux.

I am not so up to speed as to how much control one may have on Intel xeon platforms. If one could reduce as much as possible the interference from irrelevant h/w events then they have lower noise to deal with.

The question was motivated by the interest in measuring as precisely as possible the critical path length of code through the pipeline. Due to the out-of-order, superscalar properties of modern Xeons this is already hard enough to measure. When you add the interference from external events then measuring clock cycles until completion could get really imprecise.

Thanks for the reply

Mike