Jim -- sorry I did not

Mian_L_ · ‎10-27-2014

Hi All,

I wonder anyone knows that how to measure the Xeon Phi's local and remote L2 cache latency in VTune (or any other available tools)?

Thanks very much!

McCalpinJohn · ‎10-27-2014

VTune uses a sampling methodology for performance characterization, so it is not really appropriate to measure cache latencies. (It might be possible to estimate cache hit and miss latency by combining results from sampling on time and on various cache-related performance counters, but this is indirect at best.)

Hardware load latencies are traditionally measured using a pointer-chasing benchmark, such as the "lat_mem_rd.c" program contained in the "lmbench" suite (either version 2 or version 3).

The original pointer-chasing benchmarks set up a circular pointer chain with a fixed stride between pointers. Although this was acceptable when it was first introduced, almost all recent systems will recognize sequences of loads and issue hardware prefetches to bring the data into the cache in advance. There are two approaches to dealing with this problem: (1) disable the hardware prefetch mechanisms; or (2) permute the pointer chain so that the hardware prefetch engine will not be activated.

For Xeon Phi there are no L1 hardware prefetchers, so the original approach can be used for local L2 cache accesses. I measure average L2 hit latencies of about 25 cycles. These are scalar loads (since the data is going to be immediately used as an address), but vector loads are reported to pay only one extra cycle of latency for L1 hits, so the difference in L2 hit latency should be very small compared to 25 cycles.

For L2 misses you do need to pay attention to the hardware prefetchers. For Xeon Phi the mechanism to disable the hardware prefetchers does not appear to be well documented. On the other hand, Intel has provided enough information about the L2 hardware prefetchers to enable a user to create a permutation of load addresses that will not trigger any prefetches. The Xeon Phi System Software Developer's Guide (document 328207) says that the L2 hardware prefetcher on each core can monitor and prefetch 16 address streams (on different 4KiB pages). Using this information, I set up a circular pointer chain that permutes the addresses across 32 4KiB pages, so the L2 hardware prefetcher "forgets" that it has seen a page by the time I go back to it (with a different cache line address inside the page). Various L2 performance counters can be used to verify that the desired behavior has been achieved (e.g., L2_DATA_READ_CACHE_FILL counts L2 load misses that are satisfied from a different L2 cache).

One aspect of the Xeon Phi that is quite unusual is that the RDTSC instruction executes much faster than a load that misses the L1 cache. This can be used to directly measure the latency of individual memory references. If you are careful to save the results in an array that is L1-contained, you should be able to get measurement overhead down to about 8-10 cycles per load.

For L2 misses, the distributed tag directories are used to determine where to obtain the data (either from another L2 or from memory). Physical addresses are hashed to the 64 Distributed Tag Directories in an undocumented fashion, and then mapped to the 16 DRAM channels. The latter distribution is discussed in my forum posting at https://software.intel.com/en-us/forums/topic/517462. ; Latency is a strong function of the "distance" between the parties involved. When the data is in another L1 or L2 cache, the parties involved are the requesting core, the distributed tag directory, and the core with the data in its L2 cache. When the data is in memory, the parties involved are the requesting core, the distributed tag directory, and the memory controller. *Average* L2 hit miss latencies are in the range of 300 cycles for either cache-to-cache transfers or loads from memory, but the specific values vary by a large amount depending on the relative locations of the participants. I have measured values anywhere between ~120 cycles and ~400 cycles. These are repeatable, but because the mapping of physical addresses to DTDs is not published, they are difficult to predict in advance.

jimdempseyatthecove · ‎10-27-2014

John,

What are your thoughts on the following testing strategy:

Configure an application to have one HT thread within each core performing a ++someVolatileVarible at 4KiB intervals (assuming 4KiB page size).

Have an additional thread running on one of the cores (core 0?) timing each read with the RDTSC, accumulating latencies.

Note, the sampling thread will experience an L1 hit on the cell ++'d by the local core but should experience an L2 hit for all the other cores. The L2 latency would (should) also vary by ring distance.

The intra-L2 hits could be measured by a different test.

McCalpinJohn · ‎10-27-2014

I don't see any advantage in trying to get another thread to collect the measurements -- the overhead of synchronization would almost certainly be much larger than the cost of doing inline measurements. If I understand what you are suggesting here, the in-order execution within a thread would ensure that loading the incremented memory location before executing the RDTSC would hold the RDTSC until after the data has arrived for the thread under observation, but the converse would not be true --- the memory access could be started by the measurement thread (rather than by the thread under observation) unless two-sided synchronization is employed.

The shared L1 Data Cache should make it possible to do fast two-sided synchronization between threads on the same physical core, but the dependence on store to load forwarding will probably make it slower than you might expect. This synchronization would have to be hand-coded -- the OpenMP barrier latency between two threads sharing the same logical processor is reported to be over 1300 cycles (using version 2 of the OpenMP microbenchmarks from EPCC -- https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmp-micro-benchmark-suite).

jimdempseyatthecove · ‎10-27-2014

John,

This is not a synchronization situation. Pseudo code

// one thread per core here (producer threads)
for(int i=0; true; ++i)
volatileIntArray[4096/sizeof(int)*iThreadNumberDoingSets] = i; // or ++

// sampling thread
__int64 junk = 0;
for(int i=0; i < nSamples; ++i) {
for(int j = 0; j < nProducerThreads; ++ j) {
   __int64 t0 = __rdtsc();
    junk |= volatileIntArray[4096/sizeof(int)*j];
    arrayOf__int64Timers += __rdtsc() - t0; } }

if(junk == 0) printf("%d", junk); // innocuous nop to prevent loop optimization removing cell test.

// report
for(int j = 0; j < nProducerThreads; ++ j) {
printf("Core %d , time %lld\n", j, volatileIntArray[4096/sizeof(int)*j] / n Samples); }

The test is intended to read a location held in remote L2 and not anything in particular. The cores of the producer threads should assure cell of array held in its L2/L1.

Jim Dempsey

jimdempseyatthecove · ‎10-27-2014

You would also have code inserted to assure the sampling thread does not begin sampling until after all the producer threads have written at least once.

Jim Dempsey

TaylorIoTKidd · ‎10-27-2014

Mian,

Do you have the answer you need?

Regards
--
Taylor

Mian_L_ · ‎10-27-2014

Hi All,

Thanks very much for your replies. So it seems there is no official numbers about the latency. However, according to John's presentation, I got a few numbers that are very useful to me.

1. The local cache hit latency is 25 cycles on average

2. If the local L2 misses, the average latency is 300 cycles (120-400 cycles) for either cache-to-cache transfer or loads from the memory.

Then it seems the ring-based interconnection does not help the performance a lot? As the remote cache access latency can be similar to the main memory access latency.

TimP · ‎10-28-2014

Where adjacent cores need to share cache lines, not only is the access significantly faster on ring buss, but such transfers presumably don't contribute to saturating memory bandwidth.

If you have in mind using only the Cilk(tm) Plus model (with no affinity facilities), perhaps you're right about ring buss not being usable to full potential.

McCalpinJohn · ‎10-28-2014

The average latency for the cache-to-cache transfers is only a little bit smaller than the average latency for DRAM accesses because both are fundamentally limited by the cache coherence transactions, not the physical latency of getting the data from one place to another.

The cache-to-cache transfers on the rings enable the system to avoid DRAM accesses on data that is available in other L2 caches. All systems support cache-to-cache transfers of modified data (otherwise they would get the wrong answers!), but Xeon Phi also supports cache-to-cache transfers of "clean" data. Supporting "clean" cache-to-cache interventions reduces the load on the memory system. This will reduce the average memory latency and therefore increase the typical memory throughput. Cache-to-cache interventions should reduce the power consumption as well, since no off-chip signalling is required.

TaylorIoTKidd · ‎10-28-2014

John,

"Cache-to-cache interventions should reduce the power consumption as well, since no off-chip signalling is required."

How does this reduce power consumption? I figure that cache to cache transfers across the ring will increase package power consumption but reduce latency; and reducing the need to access memory off chip will reduce global power consumption but not specifically on the package. Are you talking about the need for driving more current through the package's pins?

Regards
--
Taylor

McCalpinJohn · ‎10-28-2014

Jim -- sorry I did not understand your code the first time.

For testing cache-to-cache intervention latency I sometimes use a multi-threaded approach that looks somewhat like yours and sometimes use a single-threaded approach with "sched_set_affinity" calls to move the thread after writing the data and before reading it. I have run into about the same amount of trouble with either approach. :-(

An alternate approach that does not depend on high-resolution timing is a "ping-pong" algorithm that passes a data item back and forth repeatedly between two cores/caches (typically incrementing it at each step, then waiting for the other thread to increment before repeating). While this may work out fine, there are some systems that can experience slowdowns when a cache line is fetched back immediately after it is intervened/invalidated by another core, so I try to avoid this version when I can.

One thing that I have not looked at is how the DTDs handle the cases in which a cache line is contained in multiple L2's. The data will be obtained from only one of the L2's, but it is not clear how the DTDs decide which one to source the cache line. There are cases for which optimizing for latency and optimizing for ring traffic give different answers, and it looks like it is easy to run into "hot spots" if many DTDs pick the same cache to source cache lines for many different consumers.

McCalpinJohn · ‎10-28-2014

Taylor,

My claim about power consumption refers primarily to the reduction in off-chip signalling. It is my opinion and not based on any direct Intel disclosures. (It could be investigated using the power monitoring facilities, but I have not tried this yet.)

(1) Both the cache-to-cache accesses and the DRAM accesses will use the rings to transfer commands/responses/data, with the number of "hops" depending on the specific core number, DTD, and memory controller being accessed. Since cores, DTDs, and memory controllers are distributed all around the chip, I don't expect the average number of "hops" to be significantly different for cache-to-cache transfers and DRAM accesses, though individual values will be all over the place.

(2) Off-chip signalling to GDDR5 memory will take anywhere between O(20) pJ/bit and >>100 pJ/bit (depending on page hits/misses/conflicts) while on-chip signalling is typically O(1) pJ/bit for an 8mm global "wire" on the die. Any complete transaction will require transferring lots of bits different distances, but if these transfers are approximately the same for cache-to-cache and DRAM accesses, then the DRAM power is always an adder, so the average power for DRAM accesses should always be higher.

There is a fair amount of power used in the ring, but the use of the DTDs is necessary to enable bandwidth scaling (since the caches can't snoop fast enough to keep up with all the cache misses on the chip), and the pseudo-random hash of addresses to DTDs is necessary to avoid "hot spots" in the DTDs.

TaylorIoTKidd · ‎10-28-2014

John,

Thanks for the explanation. I assume the energy usage per bit are from the GDDR5 specs plus some reasonable assumptions concerning bus and support circuitry design.

Regards
--
Taylor

McCalpinJohn · ‎10-29-2014

It is not easy to find GDDR5 energy consumption specifications, but the numbers I used are based on commercially available parts (not the same vendor as the parts in our Xeon Phi SE10P coprocessors). The low end of the energy consumption range is based on high utilization with very high open page hit rates, while the high end of the energy consumption range is based on high utilization with very low open page hit rates. In the former case most of the energy is spent in the reading the data from the sense amps and sending it out the drivers, while in the latter case most of the energy is spent moving rows of data back and forth between the DRAM arrays and the sense amps (inside each DRAM chip) and (when ECC is enabled) performing the additional memory references to obtain the ECC data.

For the on-chip signalling, my estimates are based on a survey of papers from the International Solid State Circuits Conference (ISSCC) and related conferences for the last 3-4 years. The numbers are necessarily fuzzy: for any "global" wires on a chip there is a tradeoff between power consumption and latency -- more repeaters can be used to reduce the impact of RC delays and decrease the latency, but at the cost of increased power. An example discussing some of these issues is http://dx.doi.org/10.1109/VLSIC.2012.6243846. ; It appears that the lead author now works for Intel research.

TaylorIoTKidd · ‎10-29-2014

Re: energy consumption figures. I was thinking more about power estimates taken from GDDR voltage and current specs combined with some assumptions concerning design (e.g. for determining power supply requirements).

Re: survey of papers. Are you referring to survey articles? Or to a personal survey you performed? If to a personal survey, I appreciate the pointer to the conference. If you are referring to a specific survey article(s), what is the reference? I'd like to read it.

As always, we appreciate your knowledgeable and excellent contributions.

Regards
--
Taylor

McCalpinJohn · ‎10-29-2014

Stupid browser just crashed and lost my response....

Short answer -- my methodology for GDDR5 power consumption is based on Micron's white paper and spreadsheet for DDR3 power estimation. This adds up standby power, background power, activate/precharge power, refresh power, and read/write power, driver power, and active termination power to get an estimate for each scenario.

For DDR3 the coefficients for the spreadsheet are readily available from most vendors. For GDDR5 the data available with sufficient detail is for older process technologies. So I had to estimate improvements in GDDR5 values based on improvements in DDR3 values.

For STREAM on Xeon Phi, I measured about 65 Watts on the 1.5V supply at an average DRAM bandwidth of about 160 GB/s. This corresponds to about 51 mW/Gbs (== 51 pJ/bit). This is higher than the GDDR5 minimum of ~20 pJ/bit for at least three reasons:

I was only running at 45% DRAM utilization, so the background power plays a larger role than in a busier case.
The combination of many memory access streams at the memory controllers almost certainly prevents optimum exploitation of open page mode. Power consumption rises rapidly as the open page hit rate decreases below the "perfect" rate of 63/64 = 98.4%
The 65W was measured at the 1.5V voltage regulator, so it almost certainly includes power consumption in the Xeon Phi memory controllers as well as power consumption in the DRAMs. (Since the DRAMs run at 1.5V, the memory controller drivers have to use a 1.5V supply for at least the output drivers.)
1. Minimum power at the DRAMs is obtained for DRAM reads (i.e., the DRAM is writing data to the DRAM bus and the processor is receiving the data), but this means that the memory controller needs to provide active termination, which can increase the power consumption by 25%-33%.

Uncore power consumption in that same experiment was less than 2/3 of the DRAM power.

McCalpinJohn · ‎10-29-2014

My review of the ISSCC results was informal -- I looked through the sessions on high-frequency signalling and high-performance processors for the 2012, 2013, and 2014 conferences and collected all the energy efficiency numbers that were published.

Most were based on just a portion of the full transmit/receive path, but a few provided numbers for the full chip-to-chip interconnect operating in a realistic context. An example is http://dx.doi.org/10.1109/ISSCC.2013.6487636, which reports 11 mW/Gbs (== 11 pJ/bit) for short-range links (i.e., a few inches across the circuit board between a memory buffer and a processor) running at 12.8 Gbs and 16.7 mW/Gbs (== 16.7 pJ/bit) for long-range links (i.e., up to 10's of inches between processors on a large circuit board).

These values are improving at a fairly slow rate, since they are based on a simple P=V^2/R physics. Signalling Voltage is decreasing quite slowly as we approach the threshold voltage for silicon FETs, and the effective Resistance is based on matching the impedance of transmission lines -- capacitance and inductance in circuit boards don't change as the transistors on the chips get smaller. :-(

jimdempseyatthecove · ‎10-30-2014

It will be interesting to see when we get to the point of having photons interconnecting the processor and external memory. Some progress is being made in this area.

Jim Dempsey

measure local and remote L2 cache latency