How to monitor LLC cache occupancy inline with fast performance?

Gong__Junzhi · ‎12-17-2018

Hi All,

I want to record the LLC cache occupancy inline, where I want to add some codes to do it in DPDK. I tried to use rdmsr/rdpmc instruction to get performance counters, but I cannot find a way to record LLC cache occupancy. I guess part of the reason is that I am new to this field and the Intel Programming Manual is a little complicated to me. I cannot find the event number for LLC cache occupancy in the manual. My CPU core is Intel(R) Xeon(R) CPU E5-2650 v4.

Another alternative way I think is to use PCM api: https://github.com/opcm/pcm. Since it provides C++ API, I have to write an independent program to monitor the cache occupancy. The codes are here:

std::vector<int> cores; // a vector of cores to be monitored

PCM * m = PCM::getInstance();

CoreCounterState b[12][2];

for (int cid : cores)
		b[cid][0] = getCoreCounterState(cid);

for (int t = 0; running; t^=1){
		// get data
		for (int cid : cores){
			b[cid][t^1] = getCoreCounterState(cid);
		}

                uint64_t inst, l3miss, l2hit, l3hit, l3occupancy, local_mem
_bw;

		for (int i = 0; i < cores.size(); i++){
			int cid = cores;
			inst = getInstructionsRetired(b[cid], b[cid][t^1]);
			l3miss = getL3CacheMisses(b[cid], b[cid][t^1]);
			l2hit = getL2CacheHits(b[cid], b[cid][t^1]);
			l3hit = getL3CacheHits(b[cid], b[cid][t^1]);
			l3occupancy = getL3CacheOccupancy(b[cid][t^1]);
			local_mem_bw = getLocalMemoryBW(b[cid], b[cid][t^1]);
		}

}

However, the performance of this program is quite low, about 1000 records/second. It is far slower than the DPDK processing rate. Could anyone please tell me why or any problem in this code?

I also tried to look at the codes in Intel CMT/CAT monitoring. But there is little documentation on how to use the API provided, and I also don't know the performance.

In a word, could I monitor LLC cache occupancy inline with fast performance? Or alternatively, could I monitor it using another program with fast performance?

Thank you!

Junzhi Gong

McCalpinJohn · ‎12-19-2018

(1) It may or may not be possible to measure what you are interested, depending on exactly what you mean by "cache occupancy". The uncore performance counters can measure the "queue occupancy" of transactions accessing the L3 cache. In this case "occupancy" means that the counter is incremented by the number of active entries in the buffer each cycle. This can be measured for either the "ingress queue" (RxR_OCCUPANCY) or for the "Table of Requests" (TOR_OCCUPANCY). Most transactions should spend more time in the TOR than in the ingress queue, but you should probably measure both to see what the typical ratios look like. Sections 2.3.1.1 and 2.3.5 of the Xeon E5 v4 Uncore Performance Monitoring Reference Manual (document 334291-001, April 2016) do a pretty good job of explaining how to compute the average number of entries in a queue and the average duration of transactions in the queue. (2) The RDMSR instruction can only be executed in kernel space, so a user space application has to cross into the kernel to obtain MSR values. The /dev/cpu/*/msr device driver interface only provides one MSR result per call, so reading all four CBo counters in all twelve CBos is going to require 48 system calls (per socket). The overhead of the kernel calls will depend on whether they are local or remote, as well as how busy the system is when you make the calls. "Typical" values range between ~4000 cycles and >10,000 cycles, or (roughly) 2 microseconds to 5 microseconds per call. My code that uses a single thread to read all of the core and uncore performance counters on a 2-socket Xeon Platinum 8160 system takes just over 1 millisecond, so your sample rate of 1000 per second is not surprising. The project https://github.com/LLNL/msr-safe includes a function ("msr_batch.c") that allows the user to request many MSRs in a single kernel call. I think it is limited to MSRs from a single core at a time, but the uncore MSRs can be read from any core on the chip, so it should be possible to get all of the CBo counters in a single call.