reliable time-stamp counter (TSC) in multicore environment
I am working on a project to measure the execution time of a program in multicore environment. I realized that TSC of one core is not synchronized with another core. How can I get an uniform and reliable TSC across 8 cores lets say.
From my understanding some of the newer processors use the clock for the Front Side Bus. Although this will improve synchronization of clocks, in larger NUMA systems it is likely you cannot assure synchronization. So you may have to accept that timestamp counting will be approximate.
You can improve TSC by pinning threads to cores. But this still does not assure accuracy because the selected core may experience pre-emption by the O/S and/or interrupts and/or cache interference (either positive or negative) due to other activities on the system (shared cache).
Try to time relatively long running sections of code for threads pinned to logical processor. Have each thread gather statistics as opposed to a monitoring thread. You can then perform a summation of the individual threads execution time (TSC ticks). Then re-run the test several times and reject the timing for significant deviation from the norm. However, you may also want to examine the cause for the deviation. Should it be within your control (sequencing of operations between thread) then you might be able to use that to your advantage. When it is out of your control (pre-emption of thread) then you may wish to discard the timing data.
As Jim said, the cores on a CPU share the same TSC, which counts FSB or QPI ticks and multiplies by the nominal clock ratio. On dual CPU motherboards, the BIOS is responsible for synchronizing the CPUs at power-on, but the TSCs will differ by a few FSB ticks. The minimum number of CPU clock ticks to access TSC varies from near 100 on early P4 down to less than 10 on certain current CPUs, where it may even be possible to verify the built-in multiplier by checking the granularity of TSC results. While the recommended procedure on multi-socket systems would be to use affinity settings and compare only TSC counts obtained by the same thread, this doesn't necessarily improve results on current systems.
It may be the arguably "best" technique is to NOT make extreamly precise measurments of many very small time sequences (code sequences). Instead, make experienced gross measurements of the aggrigate of the time sequences over high iteration counts.
IOW make a change to a function, run the complete application for a substantial number of iterations, check the wall clock difference. When the wall clock time is less - you made an improvement.
Due to cache interactions, the application with the fastest run time is not necessarily the application containing the sum of the minimums of each part. Parts interact with each other without explicit communication.