The KNL book says (pg. 335) "Care must be taken with RDTSC to ensure it gets executed on the given core by pinning the thread to the corresponding core using taskset or some other mechanism." Then on pg. 336 it says "The values of TSC on each core are synchronized, so it is possible to compare TSC values generated by different cores (e.g, if you want to know if one event happened before another in a parallel program)." It seems to me that only one of these statements should be true? Educate me please?
It seems indefinite enough that you may need to test it. Usual method for synchronizing TSC counters is to send them all a signal to reset during boot, relying on the basic clock rates to stay synchronized. If that doesn't work correctly, the discrepancy could grow day by day in the absence of reboot. If there is overhead in burying your rdtsc in a omp master region, that may account for more than the difference between the various clocks.
Intel processors with an "invariant TSC" need to compute the TSC value based on a clock that does not vary. Recent systems (Sandy Bridge and newer) use a 100 MHz reference clock to derive all other clocks, and this is one of the inputs to the TSC computation.
One possible implementation of an invariant TSC would be to simply multiply the number of 100 MHz reference clock ticks by the TSC ratio (bits 15:8 of MSR_PLATFORM_INFO, MSR 0xCE). This appears to be how the "fixed-function" performance counter IA32_FIXED_CTR2 "CPU_CLK_UNHALTED.REF" (a.k.a., MSR_PERF_FIXED_CTR2, MSR 0x30B) is implemented on the systems I have tested recently. The results are always exactly divisible by the TSC ratio, so differences are also exact multiples of the TSC ratio. (This counts at the same rate as the TSC, but only counts while the processor is not halted, so it serves a different purpose.)
A different implementation is used for the TSC as accessed by the RDTSC and RDTSCP instructions. The TSC values returned by these counters are not exactly divisible by the TSC ratio. Differences between consecutive reads are a function of TSC ratio, current core ratio, and perhaps also the uncore clock ratio (I don't recall if I have tested this). My interpretation is that the RDTSC/RDTSCP instructions are computing an interpolated value based on a combination of the external clock and the internal CPU core clock, but it may be more complex than this.
Since these interpolated values are based on the same external reference clock, they do stay in sync.
BUT, most Intel processors also support a "per-logical-processor" TSC adjustment MSR (IA32_TSC_ADJUST, MSR 0x3B), while allows an operating system to adjust the value returned by the RDTSC/RDTSCP instructions on that core without modifying the underlying invariant TSC value.
So if the IA32_TSC_ADJUST values are all the same (typically all zero), then the RDTSC/RDTSCP values should remain synchronized across cores in a single-socket system. Pinning is still recommended to prevent the OS from migrating threads, since this has significant direct costs (the kernel rescheduling event) and indirect costs (pulling the thread's state from the cache that it used to be running on to the cache that it is now running on), but it should not be required for RDTSC/RDTSCP differences to be valid timing values.
Pinning is, of course, required for RDPMC instructions that read local core performance counters, since these are intrinsically independent on each logical processor.