I looked around but couldn't see any cache management instructions for Xeon processors (I am working on Xeon E7-8860 v4). I found that we can use _mm_clevict for MIC architecture.
Is there a similar way to do this on Xeon E7-8860 v4? What I am looking to do is, reduce priority of some cache line so that it will be one of the first ones to get evicted. For instance;
int* arr = new int[ length ];
for ( int i = 0; i < length; ++i )
// use arr
if ( ( i - 1 ) % CACHE_LINE_SIZE == 0 )
reduce_priority( arr[ i - 1] ); // reduces the priority of the cache line in which a[ i - 1 ] resides in
If not, can I achieve this by different means?
Any suggestion will be greatly appreciated.
Thanks a lot.
Matara Ma Sukoy
The CLEVICT instructions only exist on the first generation Xeon Phi processors.
For most other processors the only explicit user-mode cache management instruction is CLFLUSH.
The PREFETCH instructions include support for locality hints, but I don't recall if there is evidence that these hints actually change the behavior.
Starting with Haswell processors there is a variant of the PREFETCHW called PREFETCHWT1 that adds a "T1" locality hint to the "prefetch with intent to write" operation. I don't know if the new version behaves any differently than the older version.
There are some new cache management instructions in the manual, but they are not supported in the Sandy Bridge to Broadwell and/or KNL processors that I have access to. I assume that support for these instructions starts in Skylake, but I don't have any client Skylake platforms to test and the Skylake Xeon platforms are not yet released. The instructions include
- CLFLUSHOPT -- the same as CLFLUSH, but with weaker ordering restrictions.
- CLWB -- forces a cache line to be written back to main memory if it is dirty, but leaves the subsequent state of the line up to the implementation. (It is not allowed to stay in the "Modified" state, but an implementation may invalidate the line or leave it in any of the available "clean" states.)
The AVX512PF instruction set (supported only on second-generation or later Xeon Phi processors) includes locality hints in some of its gather prefetch and scatter prefetch instructions. The versions with the "T0" hint should bring the data into the L1 cache, while the versions with the "T1" hint should bring the data into the L2 cache. I have not tested these.
Hello, for CLWB, most of the articles say it is not supposed to invalidate the cache line. Though in intel documentation it's written that hardware may decide whether retain or invalidate this line. From what I saw, even in case of very simple, stupid program, it seems that after using clwb cache line is invalidated. Is there more detailed description documented to define this behavior correctly?
This is a typical case -- the Intel documentation makes it clear that an implementation might invalidate the line and it might not. The behavior may vary across platforms and may vary from one execution to the next on the same platform. All that is guaranteed is that any dirty data is written back to memory.
CLFLUSH can be used to guarantee that the line is invalidated after being written back, although (as noted in the instruction description in Volume 2 of the Intel Architectures SW Developer's Manual) it is not possible to prevent the hardware from fetching the line back into the cache at any time.
The CLWB instruction was developed to ensure that dirty data was written back to memory. This is critical for persistent memory, as described at https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.
More fine-grained control over cache states is problematic. Even though CLWB without invalidation seems simple, at the lowest level of the protocol, it may not map easily to existing command types. For example, an L1 or L2 cache can initiate the writeback of a dirty line (with either a clean or invalid end state), but the only transaction that is guaranteed to exist is a writeback to the next level of cache. (It is also possible that there are no coherence protocol commands that have exactly the same semantics as an autonomous victim writeback.) Every vendor of proprietary processors has a (largely undocumented) low-level protocol internal to the chip. The transactions that are available via this protocol may not include the specific desired behavior (write back all the way to memory without invalidation), and may differ from one processor model to the next. The set of available low-level transactions may differ based on mode settings that are only minimally documented to end users. Examples include "home snoop" vs "early snoop", 1-socket vs 2-socket vs more-than-2-socket, "sub-NUMA cluster" mode, enable/disable of the HitME cache, etc. The vendor is disinclined to document these in detail because they need the flexibility to change them. (Users are also highly unlikely to understand them well enough to exploit them properly.)
On some platforms, you might be able to convince yourself that a particular set of transactions provides what you want, but since cache operation is supposed to be invisible, the burden is on you to verify this behavior in each new processor generation (and in each processor configuration).