I have a Skylake server.
As the title suggests, I am aiming to find out CHA a physical address is mapped to. For this purpose what I do as follows:
1) I allocate cache aligned (cache size is 64 bytes on my system) memory via posix_memalign() syscall. (this memory space will contain 8 long long variables, 64 bytes in total.)
2) I then set uncore performance counters for each CHA to read LLC_LOOKUP event with umask set so that I measure only LOCAL lookups. (Note that my system is single-socket. So, I guess reading REMOTE lookups would not make sense. I am saying this based on my assumption that REMOTE lookups are cross-socket.)
3) I pin my main (and only) thread to a core.
4) I read LLC_LOOKUP counters for each CHA and record them. Let's call recorded value in each CHA CHAx_RECORD_START, x representing the CHA number.
5) Increment the value of the first long long variable stored in allocated space 100 million times via a for loop. In each iteration, I am flushing the cache line via _mm_clflush call right after incrementing the variable so that in the next iteration it will be necessary to be fetch it from DRAM again.
6) I read LLC_LOOKUP counters for each CHA again and record them. Let's call recorded value in each CHA CHAx_RECORD_END, x representing the CHA number.
7) I compare the differences CHAx_RECORD_END - CHAx_RECORD_START for each CHA and record them.
I thought that while incrementing the value in the variable, in each iteration, core will see that address is not present in its cache (Because we flushed it from cache in the previous iteration). So, the core will query the distributed directory (I believe it will query ONLY THE CHA that is responsible for managing coherency of the allocated address, by running a hash function on the physical address of the allocated space). Is my thinking correct here? Since the managing CHA does not have the address in its cache as well (again, because we flushed it in the previous iteration) I expect that CHA to emit LLC_LOOKUP event.
Is my flow correct? Am I using the correct event (LLC_LOOKUP) for this purpose? Is there any other events I can make use of like SF_EVICTION, DIR_LOOKUP or an event related with LLC misses?
Thank you for contacting Intel Customer Support.
In order to better assist you can you please post your question here:
Intel Customer Support Technician
For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios
This looks like the right approach. I have found that the code is much more reliable if MFENCE instructions are added before and after the CLFLUSH instruction. I typically use 1000 iterations of load/mfence/clflush/mfence for each address, but I have some additional "sanity checks" on the results, with the code re-trying the address if the data looks funny. The technical report linked below describes this in more detail.
You can find the address to LLC mapping equations for SKX/CLX processors with 14, 16, 18, 20, 22, 24, 26, and 28 LLC slices at https://repositories.lib.utexas.edu/handle/2152/87595 -- the technical report is the second entry in the list of links of the left side of the page -- the other links provide "base sequence" and "permutation selector masks" for the various processor configurations. The equations are valid for any address below 256 GiB, except for the 18-slice SKX/CLX processors, where the masks are only good for the first 32 GiB. (I have updated the masks for the 18-slice processors to be valid for the first 256 GiB, but have not pushed it out to a public place yet.)
The report also includes results for the Xeon Phi 7250 (KNL, 68-core, with all 38 CHAs active), and for a 28-slice Ice Lake Xeon (Xeon Gold 6330). Unlike previous product transitions which have left the mapping unchanged for fixed slice counts, the Ice Lake Xeon 28-slice processor uses a different hash than the SKX/CLX 28-slice processors. I am still gathering data on the ICX 40-slice Xeon Platinum 8380 processor.