Using L1/L2 cache as a scratchpad memory

Al-jawaheri__Hayder · ‎02-02-2015

Dear all,

Explicitly cache control is a one of important feature in Xeonphi (MIC). How could I use the L1 or L2 as scratchpad memory and also sharing them data between the cores?

In addition, is there any way to hack the MESI state of the cache line in the distributed tag directory (DTD)?

Thanks in advance.

Regards

McCalpinJohn · ‎02-02-2015

As in most microprocessor-based systems with caches, the best you can do is attempt to control the cache so that it acts more like a scratchpad memory. There are no documented mode switches for Xeon Phi that would allow the cache to be directly controlled as a scratchpad memory.

For the DTD's there is almost no documentation of the functionality, and certainly no indication that there is any way to change the protocol. It is possible that some aspects can be modified via firmware, but most of the functionality is probably hard-coded into the silicon.

Al-jawaheri__Hayder · ‎02-03-2015

Dear John D. McCalpin,

Thanks for fast response.

But still my question how to use explicit instruction of cache control to allow programmers to map some part of L1 or L2 cache in MIC specifically as shearable scratchpad memory?. (by example please)

Actually, I don't want to change the protocol of the cache coherency. I need to track the changing in tag status for some shared variables in each thread and record them state to help me to design some runtime that works friendly with cache coherency.

I am looking to hearing you.

jimdempseyatthecove · ‎02-03-2015

There is no way to map, as an example, an address to the cache.

The cache system sits between your program and the memory system. In a general sense it (they: L1, L2, L3) keep a most recently used (read or written) cache line worth of data that the program reads or write. Stored into the cache with the data is the address. When your program issues a fetch for some data at a given address, (in gross terms) the address is passed to the memory controller which simultaneously checks to see if the data is residing in the L1, L2, L3 (other cache), if none report a hit the request is passed on to the memory bus. The actual procedure is quite complicated, and varies from processor to processor.

Many embedded processors have different memory types and address ranges on the memory system and which also may have a rudimentary means of performing the mapping. This may be what you are thinking of. In a high level processor, such as the Xeon Phi, the cache(s) may assume arbitrary addresses of physical memory. The user does not directly program these.

There are some instructions that interact with the cache. As an example you can instruct the cache to flush a cache line to RAM (provided it is in cache and modified). On Xeon Phi you can also specify that a cache line (by way of address) is to be evicted from the cache system, written to RAM if necessary. You can also instruct that memory stores are not to be written into cache (you do this when you do not intend to re-reference the data a short time later).

Jim Dempsey

McCalpinJohn · ‎02-03-2015

I suppose I was not clear -- there is no (documented) mechanism for controlling the cache behavior. Caches are deliberately made invisible to the user, so the best you can do is use "hints" associated with load, store, and prefetch instructions to attempt to keep the desired data in the cache and to attempt to minimize the displacement of data you want to keep by data that you don't want to keep.

In general you can influence retention of data in the cache(s) with a "hint" on a memory reference (i.e. before use) or by explicitly evicting the data from the cache when you know that you are finished using it. Xeon Phi has both mechanisms, in limited forms.

The Xeon Phi architecture allows each memory reference to contain cache control "hints". These are described in section 3.7 of the Xeon Phi Coprocessor Instruction Set Architecture Reference Manual (document 327364-001, September 2012). It is important to read that section carefully, as the cache control hints are interpreted quite differently for prefetch instructions, for load instructions, and for store instructions.

As Jim Dempsey noted, the Xeon Phi also has explicit instructions to evict cache lines from either the L1 or the L2. (Note that evicting from the L2 will also evict from the L1 because the L2 is inclusive of the L1 on this processor.) In principle these should be easier to use on Xeon Phi than on a strongly out-of-order Xeon processor because ordering between data use and data eviction is expensive to force in out-of-order processors. That does not mean that these will be easy to use -- there is insufficient documentation of the specifics of the interaction between the L2 cache and the DTDs to know when these CLEVICT instructions will actually be helpful for a specific purpose. At least the instructions are documented, so experimentation is possible, and there are core performance counters for L1 and L2 accesses and hits that should provide guidance on whether the processor is doing what you expect.

For the DTDs, I am not aware of any (documented) mechanism to get information from these. I can't think of any system that allowed the user to read tag information directly. (That is not exactly true -- while working on processor development teams I have worked with a number of debugging mechanisms that allowed access to all the tag bits, but these mechanisms were disabled on production parts.)

None of this is easy, and it gets harder with every processor generation. As a basic example, I would not start out with the assumption that an invalid line in the cache will automatically be chosen as the victim in the event of a cache miss to the same set. "Invalid Victim Select" is an implementation option, not a law of nature. Similarly, I would not start out assuming that I understood which pseudo-LRU algorithm a particular cache uses, and I would certainly not assume that the pseudo-LRU replacement algorithms used by the L1 cache and the L2 cache are the same, or that the interaction between the two would be straightforward to understand. All of this is made much more complex on Xeon Phi because the coherence protocol is not clearly documented and because there are no performance counters for the DTDs.

Al-jawaheri__Hayder · ‎02-03-2015

Dear Jim Dempsey and John D. McCalpin,

Thanks again for the feedback.

Actually, I know how the cache system working in the system and in additionally, how the scratchpad memory used in embedded as well.

When I was reading some document about memory management on Intel Xeon Phi (MIC). I found some hint about using L1/L2 cache explicitly as scratchpad memory. Now, it is clear to me how the system working.

May I asked you about more details or documents about core performance counters for L1 and L2 accesses?

Thanks

jimdempseyatthecove · ‎02-04-2015

If you really want to use L1/L2 as scratchpad memory, then consider selecting a processor design that supports transactional memory system. On Intel processors the feature is called TSX and RTM.

This feature was expressly designed to handle (relatively) large transactions to shared memory by multiple threads in an atomic manner.

In your case "scratchpad", it implies you want fast exclusive use of the memory (residing in some cache). IOW you would not be interested in shared memory, so why would you look at transactional memory features. The reason is, when you start a transaction, data written within the transactional region is written to cache, like it would be if you were not in the transactional region, however all writes to RAM are deferred until you exit the transactional region. Thus it seems feasible that you could start a transaction region, perform many R/M/W to non-registered data.

Caution, if the hardware thread gets interrupted during the transaction, the entire transaction is voided.

Xeon Phi (Kinghts Corner) does not have transactional memory support. I do not know about the next generations, John may know.

John, do you have a CPU with TSX/RTM? If so, would you mind constructing a simple "scratchpad" test program.

Jim Dempsey

McCalpinJohn · ‎02-04-2015

I don't know if any of my processors support TSX/RTM -- not enough brain cells to look into that right now.

Xeon Phi performance counters are documented in the Intel Xeon Phi Performance Monitoring Units guide (document 327357-001).

A few additional events are used by Intel's VTune product and are mentioned in a forum discussion at https://software.intel.com/en-us/forums/topic/515073.