question about caches in mic

Wei_W_2 · ‎05-17-2013

From the online resources, each core has a 512K L2 cache, can they shall the L2 caches? My program is cache sensetive, so I really have to deal with cache very carefully, more cache reuse is better.

My case is:

I have 2 tasks, each task stay in a core. Let's say, T1(task 1) stays in core 1, and T2 stays in core 2. T1 load data into his L2 cache, and then T2 requires the same data as T1. So will T2 get data from core 1's L2 or will it get data from memory?

Another issue: followed my last question, if T2 can get data from core 1's L2, then, if accessing a neighbor's L2 is faster than a core far way. For example, T1 stays in core 1, T2 stays in core 2 and T3 stays in core 20, T2 and T3 both require the same data in T1's L2, then it is faster for T2 than T3 to get the data since T2 is closer to T1 than T3, RIGHT?

robert-reed · ‎05-17-2013

The L2 caches of different cores do not share L2. However, the nature of the instruction decoder in the Intel Xeon Phi coprocessor code name Knights Corner is such that running only one thread per core cannot use every executions cycle--only every other one--so why not put T1 and T2 both on C1, a common core, where they WILL share L2? Having two threads on the core will maximize the resources available on the core and give the desired trailblazing of a fetch to L2 from one thread benefitting the other thread.

Wei_W_2 · ‎05-17-2013

robert-reed (Intel) wrote:

The L2 caches of different cores do not share L2. However, the nature of the instruction decoder in the Intel Xeon Phi coprocessor code name Knights Corner is such that running only one thread per core cannot use every executions cycle--only every other one--so why not put T1 and T2 both on C1, a common core, where they WILL share L2? Having two threads on the core will maximize the resources available on the core and give the desired trailblazing of a fetch to L2 from one thread benefitting the other thread.

I am sorry I can not put T1 and T2 in the same core, since I have a lot of tasks in my program, and some tasks share some data (for example, I have 240 tasks, a portion of data required by T1-T10 are the same, same as T11-T20, ...). When I implement it in CPU, I need to care about cache reuse.

So if Core 2 can not get data from Core 1's L2, but get from memory, then it would be meaningless for me consider cache reuse.

Sumedh_N_Intel · ‎05-17-2013

The caches in the Intel Xeon Phi coprocessor are distributed caches not shared caches. What this implies is that each core creates its own copy of data instead of maintaining a single shared copy. For example, conside thread T1 running on Core 1 and T2 running on Core 2. If T2 wants to fetch data which already present in the Core 1 cache then it will create a copy of this data in its own L2 and and fetch this data directly from the Core 1 cache instead of the main memory. This way your accesses are faster than fetches from memory and in a sense you can reuse the cached data. I hope this clears things up.

robert-reed · ‎05-17-2013

Well, OK, if T2 must be on C2, it will need to acquire the cache line from L2. And while the cache coherence is supported by a basic MESI protocol, it is also bolstered by a GOLS (Globally Owned, Locally Shared) protocol implemented in the Distributed Tag Directory, in effect, implementing an Ownership state among the L2 caches. While I don't have any performance numbers to quantify the effect of this structure, it should make access from T2 of a C1 cached line faster than having to go all the way to memory. There is a description of this in the System Software Developers Guide (under the Tools & Downloads tab at http://software.intel.com/mic-developer) if you want more details.

Wei_W_2 · ‎05-17-2013

robert-reed (Intel) wrote:

Well, OK, if T2 must be on C2, it will need to acquire the cache line from L2. And while the cache coherence is supported by a basic MESI protocol, it is also bolstered by a GOLS (Globally Owned, Locally Shared) protocol implemented in the Distributed Tag Directory, in effect, implementing an Ownership state among the L2 caches. While I don't have any performance numbers to quantify the effect of this structure, it should make access from T2 of a C1 cached line faster than having to go all the way to memory. There is a description of this in the System Software Developers Guide (under the Tools & Downloads tab at http://software.intel.com/mic-developer) if you want more details.

Thanks very much, so you mean T2 should require data from C1's L2 cache, instead of go all the way to memory?

robert-reed · ‎05-17-2013

That's the essence of Ownership (MEOSI, not just MESI). As I said, you can find details and state diagrams in the document I referred to above.

TimP · ‎05-18-2013

The VTune KNC general analysis category displays statistics about L2 misses satisfied by read from memory or from (presumably other) L2 caches. As of the update 6 version, there wasn't much accessible advice about interpretation of these data.

I guess that a high rate of L2 write cache misses satisfied from cache could indicate false sharing.

If you have an application which uses just 2 cores effectively, running different tasks, it's hard to believe it will run efficiently on MIC.

Wei_W_2 · ‎05-18-2013

TimP (Intel) wrote:

The VTune KNC general analysis category displays statistics about L2 misses satisfied by read from memory or from (presumably other) L2 caches. As of the update 6 version, there wasn't much accessible advice about interpretation of these data.

I guess that a high rate of L2 write cache misses satisfied from cache could indicate false sharing.

If you have an application which uses just 2 cores effectively, running different tasks, it's hard to believe it will run efficiently on MIC.

thanks, I havent used vtune before, I will have a look at it.

i am not using only 2 cores, I will distribute bunch of tasks into all the 60 cores. the old program is in CPU, I plan to port it to mic and optimize the cache reuse according to mic's architecture.

McCalpinJohn · ‎05-20-2013

The latency for transferring data between caches is not very intuitive because of the use of distributed duplicate tags.

For example, if a thread running on physical core 1 loads a data item from a thread on physical core 0, it is "close", but the cache line can be mapped to any of the 64 distributed duplicate tags. Some of these are "close" and some are all the way on the other side of the ring.

When averaged over many different addresses, the *average* cache-to-cache latency is almost independent of the location of the caches on the ring, so it is almost independent of the logical processors involved. The *best case* is quite a bit faster if the cores are close to each other on the ring, but this only applies to the small fraction of the addresses that are mapped to a distributed duplicate tag directory that is *also* close to the cores.

If it is a matter of sharing data, there is probably no point in worrying about locality on the ring. If it is a matter of placing synchronization variables, it might be worth looking into finding addresses that map to distributed duplicate tag directories that are close to the cores that you are trying to synchronize.