Jim thanks for posting this

Guangming_T_ · ‎04-11-2013

Hi All,

I am writing an application on MIC architecture, I want to know the theorical bandwith between each memory device.

Like bandwidth between core and L1, L1 and L2, L2 and memory. I want these information to evaluate my application.

So I want to know how many Load can be issued each clock cycle. ?

How many cycles needed to translate a 64byte cache line from L2 to L1 ?

I want to know the theorical value regardless of application?

Thank you ~

Bernard · ‎04-11-2013

I think that VTune for Linux can be helpful in your case by simply using it to profile MIC.

McCalpinJohn · ‎04-11-2013

Guangming:

Most of the information that you are looking for is included Chapter 2 of the "Xeon Phi Coprocessor System Software Developers Guide", which you can currently find at: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide

There are a number of subtleties that you have to watch out for when interpreting the bandwidths, but these are pretty well covered in sections 2.1.1, 2.1.2, 2.1.3, 2.1.8.2, 2.1.9, and 2.1.10.

Assuming that we are concerned with vector instructions, each *core* can issue at most one per cycle. A vector load will transfer an entire cache line from the L1 Dcache to the core (though all the bits are stored into the target register only when the load is cacheline-aligned). This gives a peak L1 Data Cache bandwidth of 64 Bytes/Cycle/core. But a *thread* can only issue instructions (at most) every other cycle, so you have to be running at least 2 threads on a core to be able to issue a vector instruction every cycle.

Table 2.4 and the preceding paragraph make interesting reading. It is especially fun to find all the inconsistencies between the text and the table, and then try to figure out which (if either) is correct.... It looks like the peak L2 bandwidth is 1/2 of the peak L1 bandwidth (and my measurements are consistent with this interpretation), but for cases with memory loads going into the L2 & L1 caches and castouts coming from the L1 and L2 caches, the last line of table 2.4 is a reminder that things may be more complex than they initially appear.

Other interesting complications:

(1) Vector loads that are not aligned typically require two instructions -- one to load the "upper" elements from the "lower" cache line into the "lower" part of the target register, and a second to load the "lower" elements from the "upper" cache line into the "upper" part of the target register. It helps to draw pictures.... :-)

(2) There is an L2 hardware prefetcher, so L2 cache misses will generate automatic prefetches to bring data from memory to the L2. But there is no L1 hardware prefetcher, so a sequence of loads that miss the L1 but hit in the L2 will not generate automatic prefetches to move the data from the L2 to the L1. Since the L2 hit latency is relatively high, you typically need to use software prefetches to fill the pipeline for L2 to L1 transfers. Unfortunately these software prefetches compete for the single-instruction-per-cycle issue restriction, so it is not clear that you could load more than one cache line every other cycle even if the bandwidth were available.

Peak memory bandwidth for the whole chip is 5.5 billion transfers per second times 16 channels times 32 bits/channel = 352 GB/s. This is not attainable for a bunch of reasons that Intel has not disclosed in detail. I have generated STREAM benchmark values in the 175 GB/s range using large pages and the magic compiler options that Intel recommends. Based on my understanding of the implementation, kernels with more reads and fewer writes should be able to do a bit better, but I have not demonstrated this yet.

For a single core performance is limited by the available concurrency. According to the Xeon Phi System SW Developer's Guide, a core can have up to "about 38 outstanding requests per core (combined read and write)". I have measured the average memory latency at about 275 ns, so a single core can move 38 cache lines * 64 Bytes/cache line every 275 ns, which is about 8.8 GB/s. This is only 8 Bytes/cycle, so it is easily handled by the 64-Byte wide data ring. My latency measurement probably does not actually exploit open page mode (because the accesses are so infrequent, the memory controller probably closes the page before I get back to it), so the best possible latency might be a bit lower, leading to a slightly higher maximum possible concurrency-limited bandwidth. On the other hand, it ain't easy to generate all those concurrent memory transactions and the latency under load will certainly go up, so the best memory bandwidth seen by a single core is typically about 1/2 of the concurrency-limited value.

Bernard · ‎04-11-2013

Jim thanks for posting this information.

Guangming_T_ · ‎04-20-2013

Thanks very much for jdmccalpin 's answer, your answer is what I need.

zhangxiuxia · ‎04-20-2013

John D. McCalpin wrote:

Guangming:

Most of the information that you are looking for is included Chapter 2 of the "Xeon Phi Coprocessor System Software Developers Guide", which you can currently find at:   http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-syst...

There are a number of subtleties that you have to watch out for when interpreting the bandwidths, but these are pretty well covered in sections 2.1.1, 2.1.2, 2.1.3, 2.1.8.2, 2.1.9, and 2.1.10.

Assuming that we are concerned with vector instructions, each *core* can issue at most one per cycle. A vector load will transfer an entire cache line from the L1 Dcache to the core (though all the bits are stored into the target register only when the load is cacheline-aligned). This gives a peak L1 Data Cache bandwidth of 64 Bytes/Cycle/core.   But a *thread* can only issue instructions (at most) every other cycle, so you have to be running at least 2 threads on a core to be able to issue a vector instruction every cycle.

Table 2.4 and the preceding paragraph make interesting reading. It is especially fun to find all the inconsistencies between the text and the table, and then try to figure out which (if either) is correct.... It looks like the peak L2 bandwidth is 1/2 of the peak L1 bandwidth (and my measurements are consistent with this interpretation), but for cases with memory loads going into the L2 & L1 caches and castouts coming from the L1 and L2 caches, the last line of table 2.4 is a reminder that things may be more complex than they initially appear.

Other interesting complications:

(1) Vector loads that are not aligned typically require two instructions -- one to load the "upper" elements from the "lower" cache line into the "lower" part of the target register, and a second to load the "lower" elements from the "upper" cache line into the "upper" part of the target register. It helps to draw pictures.... :-)

(2) There is an L2 hardware prefetcher, so L2 cache misses will generate automatic prefetches to bring data from memory to the L2.   But there is no L1 hardware prefetcher, so a sequence of loads that miss the L1 but hit in the L2 will not generate automatic prefetches to move the data from the L2 to the L1.    Since the L2 hit latency is relatively high, you typically need to use software prefetches to fill the pipeline for L2 to L1 transfers. Unfortunately these software prefetches compete for the single-instruction-per-cycle issue restriction, so it is not clear that you could load more than one cache line every other cycle even if the bandwidth were available.

Peak memory bandwidth for the whole chip is 5.5 billion transfers per second times 16 channels times 32 bits/channel = 352 GB/s.   This is not attainable for a bunch of reasons that Intel has not disclosed in detail.    I have generated STREAM benchmark values in the 175 GB/s range using large pages and the magic compiler options that Intel recommends.   Based on my understanding of the implementation, kernels with more reads and fewer writes should be able to do a bit better, but I have not demonstrated this yet.

For a single core performance is limited by the available concurrency. According to the Xeon Phi System SW Developer's Guide, a core can have up to "about 38 outstanding requests per core (combined read and write)".   I have measured the average memory latency at about 275 ns, so a single core can move 38 cache lines * 64 Bytes/cache line every 275 ns, which is about 8.8 GB/s.    This is only 8 Bytes/cycle, so it is easily handled by the 64-Byte wide data ring.   My latency measurement probably does not actually exploit open page mode (because the accesses are so infrequent, the memory controller probably closes the page before I get back to it), so the best possible latency might be a bit lower, leading to a slightly higher maximum possible concurrency-limited bandwidth.    On the other hand, it ain't easy to generate all those concurrent memory transactions and the latency under load will certainly go up, so the best memory bandwidth seen by a single core is typically about 1/2 of the concurrency-limited value.

I am quite interested in your latency testing program, could you send it to me?

my emailbox is zhangxiuxia@ict.ac.cn . Thank you!

mic bandwidth