For my research project, I will be looking into developing a tool that will assist in proving the correctness of algorithms and data structures in persistent (non-volatile) memory. Since CPU caches are volatile and perform their write-back in any arbitrary order, this introduces the potential for inconsistency and so the tool I will be writing needs to be able to determine, at any particular point in time, what in the cache has not been written back to persistent memory.
To determine whether a program is in an inconsistent state, I would like to instrument a user program such that periodically it will enforce a 'stop-the-world' effect where each thread will dump the contents or at some indication of the cache line(s) being held in the cache, not to memory, but to some buffer (?) that can be analyzed later. I don't mind going super low-level here, but ideally if there exists some kind of DMA that can do this, without flushing the CPU cache, that would be ideal and make my job a lot easier.
I was thinking that if there isn't a way to use DMA (even if its not accessible from user-space using DMAEngine, but also from kernel-level) perhaps there might be some way to 'trick' the CPU into giving this information away? One way that I thought of is to create a new system call that would mark all pages as read-only (so that the spinning stop-the-world threads will not trigger a page fault) and so that when the buffered writes complete, they each trigger a page fault which can then be recorded, and as I remember when I was working on my hobby OS that page faults provide the address a process has faulted on.
However before I go farther, I would like to know whether or not there was an immediate and simpler approach to this problem?
There are a couple of approaches that can obtain the *quantity* of dirty data in the caches, but obtaining the specific addresses is more challenging. It would probably be easiest to do in simulation, as long as you don't expect the answers to be exact. (There are lots of details of the Intel cache hierarchy that are not documented and not feasible to reverse-engineer.)
If your workload is small enough to fit on the DIMMs in a single channel and you have a few hundred thousand dollars for lab equipment, you could put a DRAM logic analyzer on the channel and trigger it to start sampling writes just before the kernel executes a WBINVD instruction. The DRAM analyzer can then capture the addresses of the writes, which will presumably be dominated by writebacks of dirty cache lines. (If you have very large piles of money, you could put logic analyzers on multiple DRAM channels.) With the WBINVD instruction, there is no indication of when the writebacks have finished, so some heuristics would be required to decide when to stop recording the DRAM traffic.
I have not been tracking Intel's instruction set extensions in preparation for non-volatile memory, so there could be other useful features, but I would be surprised if these include an easy way to get addresses of dirty data in the caches.