topic I have not seen a lot of in Software Tuning, Performance Optimization & Platform Monitoring

TSCs per logical processor? Per socket?

Corey_R_ — Thu, 22 Dec 2016 09:54:56 GMT

Given that one can use WRMSR to adjust the TSC per-logical-processor, this implies each logical processor has at least its own counter. Are the increments of these counters completely synchronized on a single socket? The case I'm worried about is TSC drift (which is small but non-existent) causing the TSCs on separate cores to slowly diverge.

A related question is if the TSCs across sockets are also always synchronized, assuming no one uses WRMSR to futz with it. I wouldn't expect separate sockets to share an ART, but I might be wrong.

This is all assuming a recent chip with an invariant TSC.

I have not seen a lot of

McCalpinJohn — Fri, 23 Dec 2016 00:10:26 GMT

I have not seen a lot of definitive statements on this topic. (Opinion: Intel generally likes to keep their descriptions as fuzzy as possible to maximize their options for future implementations....)

In recent processors I have not seen any evidence of TSC drift across cores on a single socket. My current (tentative) interpretation is that reading the TSC involves retrieving a value from the "master" timer in the uncore and then doing an interpolation on that value. If this is the case, then there is no mechanism to generate drift.

I have also not seen evidence of TSC drift across sockets in 2-socket systems. These should be receiving the same 100 MHz reference clock, so again there is no obvious mechanism to generate drift.

It is challenging to design experimental methodologies to say anything about simultaneity in parallel computing systems, but I routinely look at TSC data across tightly synchronized threads and have not seen any systematic variation by core or socket that would lead me to be concerned about drift.

Thanks again John! I guess

Corey_R_ — Fri, 23 Dec 2016 07:20:00 GMT

Thanks again John! I guess that's good enough for me. I did come up with an experiment to measure TSC drift using multiple systems and some stats, but this one was stumping me. If I had anything approximating a test bench I could measure the drift on the BCLK pins as an approximation of the TSC drift (hopefully it wouldn't be *worse*), but alas, I am but a poor software person.

Clock drift across nodes is a

McCalpinJohn — Fri, 23 Dec 2016 17:57:50 GMT

Clock drift across nodes is a completely different topic, with a long history.... In the early 1990's while I was an assistant professor at the University of Delaware, I worked a bit with David Mills (also at U.Del.) who designed the Network Time Protocol. That protocol has to deal with extremely large latency variability, with minimum latencies that are not particularly small. It is a lot easier in a single machine room, where the variability is (usually) modest, and where the minimum latencies can be quite low.

MPI has an option for "globally synchronized" output from MPI_Wtime(). It is not completely clear how well synchronized these timers will be, especially given the typical MPI_Wtime() resolution of 1 microsecond. The MPI 2.2 standard says:

A collection of clocks is considered synchronized if explicit effort has been taken to synchronize them. The expectation is that the variation in time, as measured by calls to MPI_WTIME, will be less then one half the round-trip time for an MPI message of length zero. If time is measured at a process just before a send and at another process just after a matching receive, the second time should be always higher than the first one.

One half of the round trip time for modern high-performance fabrics is in the 1 microsecond range -- about the same as the MPI_Wtime() resolution that I typically see. If the target is achieved, then one can treat results from MPI_Wtime() calls as accurate to about one "unit" (1 microsecond), no matter which node made the measurement.

If I recall correctly, the SGI Origin2000 had hardware support for clock synchronization that enabled 1 microsecond consistency across the entire NUMA system, but this was designed into the NUMA interconnect and did not depend on external/3rd-Party interconnects.

I haven't been trying to

Corey_R_ — Sat, 24 Dec 2016 03:02:13 GMT

I haven't been trying to synchronize nodes (yet), certainly not at the TSC level. I don't think it's feasible, from software, to within more than a few hundred ns. My experiment was merely designed to estimate how much the TSC frequency varied over time across machines. I haven't run it yet, because I haven't written the software needed yet.

Not mentioned above is the

TimP — Sat, 24 Dec 2016 20:35:32 GMT

Not mentioned above is the mechanism for starting the TSC clocks on multiple socket systems. The OS startup should send signals nearly simultaneously to all CPUs to restart their TSC clocks. In my experience with Intel platforms, this appears to keep them within 150 ns of synchronization, but they tell us not to rely on this. On past AMD platforms, there didn't appear to be any such synchronization among sockets.

Methods for keeping various nodes on a cluster synchronized don't appear to take care of TSC timers. I'm not sure of the mechanism, but within a single node the MPI_Wtime() and omp_get_wtime() timers have appeared in my tests to be synchronized within the resolution of those timers, which didn't appear to be as good as 1 microsecond (not nearly as good on Windows as on linux). Both current Intel and gnu implementations of omp_get_wtime() for Windows use the Query_Performance API, which won't be synchronized across nodes. Windows gfortran uses that also for the system_clock intrinsic.

Quote:Tim P. wrote:

Corey_R_ — Mon, 26 Dec 2016 05:57:30 GMT

Tim P. wrote:

Not mentioned above is the mechanism for starting the TSC clocks on multiple socket systems. The OS startup should send signals nearly simultaneously to all CPUs to restart their TSC clocks. In my experience with Intel platforms, this appears to keep them within 150 ns of synchronization, but they tell us not to rely on this. On past AMD platforms, there didn't appear to be any such synchronization among sockets.

I'm woefully unaware of multisocket issues. Is this the INIT IPI?