I need to profile an HPC application on multiple nodes with very low overhead impact. In the application code, I need to monitor MPI synchronization points (barrier, alltoall, etc.). I'm using invariant TSC (RDTSC/RDTSCP instruction) because I cannot rely on clock_gettime() due high overheads of syscalls. I knew that TSCs should be synchronized among cores and sockets on the same node, hence I should have no problems for intra-node timing synchronization.
But I have the following concerns:
1) How can I synchronize TSCs among different nodes with a very fine-grain accuracy (sub-microsecond accuracy)? I think that developers of "Intel Trace Analyzer and Collector" should had similar problems.
2) I suppose that TSCs on different nodes increment always at a fixed nominal frequency. Do you think that invariant clock oscillators can have little drifts? I suppose to yes, but in this case for long application runs, profilers on different nodes can produce inconsistent inter-node timing information. Moreover, If TSCs are affected to clock drifts, I cannot transform time stamp in seconds.
My target system is an HPC machine composed to double-socket Broadwell nodes interconneted with an Omni-Path network.
Thanks to all in advance,
IEEE standard 588-2002 (revised under 588-2008) has been defined to provided this high resolution clock synchronization for LAN networked systems. It uses IP as the transport layer since this is an Internet protocol. It should be available as yet another IP over OPA protocol. Intel could have already implemented it more efficiently directly over OPA but this is something that Intel developers may answer.
Clock synchronization is one of the well-known problems in distributed control systems. I hope that Intel can provide any more specifics on this protocol over the OPA fabric.
TSCs are definitely subject to drift, and every node is going to generate its own estimate of the TSC frequency. The drift is large enough that TSC cannot be used as a direct replacement for wall-clock time for cross-node computations on nodes that have been up for more than a few hours.
Although "clock_gettime()" reports results in seconds and nanoseconds, I have not seen any implementations that actually track the time at resolutions below one microsecond. The IEEE "Precision Time Protocol" (1588-2002 and 1588-2008) is intended to enable sub-microsecond synchronization on local area networks, but this can be challenging to test. I have seen evidence that the clocks within each of our clusters (which have PTP enabled) are synchronized at approximately 1 microsecond resolution.
The good news is that you can probably get most of what you need by using RDTSC for the fine-resolution sampling, but collecting both TSC and "gettimeofday()" (or "clock_gettime()") every few seconds. The gettimeofday overhead should be in the microsecond range, so no trouble for overhead at 1 seconds intervals, while the TSC should not drift much in a few seconds.