It is possible to monitor cross-processes using RDTSC or RDTSCP?

Zirak · ‎02-16-2017

RDTSC or RDTSCP instructions are widely used in computer performance metrics and computational security such as side and covert channel attacks. They can be used to extract information about the usage of CPU caches or main memory by another process when the two independent processes are accessing a range of shared memory addresses using shared libraries. My question is, is it possible that process A is able to monitor process B without having any shared resources between them?

McCalpinJohn · ‎02-17-2017

The RDTSC/RDTSCP instructions can only execute on the local core and can't provide any direct information about any other core. There is always some indirect information leakage because there are no Intel processors that have no shared resources!

If two logical processors are in the same hardware SMP, then they are sharing the coherence fabric (via the "uncore" and QPI resources) and (generally) DRAM resources. If two logical processors are in the same package, then they are (in most recent products) also sharing an L3 cache.

As an example of information leakage, in typical operation the "uncore frequency" is determined by the type and amount of uncore traffic present in the last millisecond. The uncore frequency is a component of the L3 cache hit latency and of the memory latency for both local and remote accesses. So a process A that is measuring memory latency (using very low rates of memory access) should be able to detect if process B is doing enough L3/memory accesses to to cause the uncore frequency to increase. This can be the case even if process A and process B are running on different chips! From an extensive set of measurements, I created a model of memory latency on Xeon E5-2xxx (Sandy Bridge) processors that showed that the local memory latency includes 12 cycles in the uncore clock domain of the other socket. A shift of the remote uncore from the minimum 1.2 GHz frequency to the maximum (typically about 3 GHz, depending on the model), therefore shifts the local memory latency by about 6 ns, which is relatively easy to detect under lightly loaded conditions.

The implementation of the "non-stop", "invariant" TSC may require access to the uncore (since the core clocks don't increment when a core is halted, and the TSC must continue incrementing). I know that the latency and repeat rate of the RDTSC/RDTSCP instructions are dependent on the local core frequency, but it is possible that they are also dependent on the uncore frequency. This is relatively easy to test on any processor that controls the uncore frequency using MSR 0x620. (https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600913#comment-1872473)

Zirak · ‎05-06-2017

Thank you very much