Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Prefetching in Xeon Phi

Saami_R_
Beginner
1,298 Views

Hi,

I would like to disable both software and hardware prefetching on the Xeon Phi. Software prefetching can be disabled via the compiler, but I don't know how to disable the hardware prefetcher. Any help will be much appreciated.

Thanks!

0 Kudos
8 Replies
Frances_R_Intel
Employee
1,298 Views

The short answer is there isn't a way to disable hardware prefetch at this time. The long answer is, you might want to look over the forum post https://software.intel.com/en-us/forums/topic/520185. It has an interesting discussion on disabling the hardware prefetcher and why it is generally not what you want to do anyway. If you are trying to do some comparison tests of different prefetching schemes, you might want to try accessing a large number of data streams at one time, as Dr. McCalpin suggests, or try randomizing memory access using linked lists.

0 Kudos
McCalpinJohn
Honored Contributor III
1,298 Views

Section 2.1.3 of the Xeon Phi System Software Developer's Guide (document 328207, revision 003, March 2014) says that the L2 cache has a streaming hardware prefetcher that can prefetch up to 16 streams into the L2 cache.    By interleaving accesses across 32 different 4KiB pages, you can avoid activating the hardware prefetcher, even if the accesses within each of those pages are contiguous.

In my testing, it appears that the Xeon Phi L2 hardware prefetcher is much less aggressive than the L2 hardware prefetchers in the mainstream (e.g., Sandy Bridge) processors.  

Using a standard pointer-chasing code with a fixed stride and a circular buffer of 1 MiB or larger, I found that strides of 256 Bytes or more provided the same latency as cases with no hardware prefetching.   With a stride of 128 Bytes the average latency per load was reduced by a factor of 4.2 to 4.3 and with a stride of 64 Bytes the average latency per load was reduced by a factor of 8.5 to 8.7.   

For a stride of 64 Bytes and no software prefetches, I expect two full memory latencies at the beginning of each page before the HW prefetcher starts bringing in data.  If the prefetcher is "perfect", then every subsequent load will be an L2 hit.   Using nominal latencies of 277 ns for the average memory latency and 24 cycles (21.82 ns) for the L2 hit latency, this gives an estimate of 1907 ns/page -- about 6% lower than the observed value of  2034 ns/page implied by the 31.787ns average latency of the stride-64 case on 2MiB pages.   If the L2 hit latency is assumed to be 26 cycles instead of 24, then the model is within 1% of observations.

The case with 128 Byte strides can be modeled as 5 memory latencies plus L2 hits for the rest of the loads in each 4KiB page.   It makes sense for there to be at least three memory latencies --- if the first two accesses to the page are non-contiguous, it is common to wait for a third access to see if there is a constant stride.   It is not clear where the remainder of the latency comes from, but it is certainly possible that in the 128-byte-stride case the prefetches are not ramped up fast enough to fully cover the memory latency for the 4th, 5th, 6th loads.  I am not aware of any detailed documentation on the ramping sequence used by the hardware prefetchers on any Intel systems, but I have not looked very hard for it.

0 Kudos
Saami_R_
Beginner
1,298 Views

Thanks for the responses, Francis and John. The numbers and explanations in John's post are interesting, and have given me a better understanding of the memory system in Xeon Phi.

I am trying to characterize performance and energy consumption of the Xeon Phi using several benchmarks. The first step of this is to observe the performance/energy gains using various prefetching configurations. My intent is to use the configuration with no form of prefetching as the baseline for all other configurations. Since I will be using several benchmarks, it may not be feasible to modify the source of each benchmark programs to deactivate the hardware prefetcher. If anyone has suggestions on this, I'd like to know them.

I have talked with two researchers working on Xeon Phi, and apparently Intel has a way of turning off the prefetcher explicitly, but the tool is not available publicly.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,298 Views

The problem with deactivating the hardware prefetcher is, other than for the test program, virtually no other application would happen to fall into this access pattern. This makes any reports you make virtually useless as to the benefit (or lack thereof) of prefetching either hardware, software, or both.

Add to this, the fact that Knights Landing architecture (KNL) is imminent and your study will be old technology (current MIC is Knights Corner, KNC).

Jim Dempsey

0 Kudos
McCalpinJohn
Honored Contributor III
1,298 Views

The main advantage of deactivating the hardware prefetcher is that the behavior becomes a bit more controllable and therefore somewhat easier to understand.  

I find it particularly valuable when trying to tease apart the roles of temporal locality and spatial locality in cache hit rates for applications that are too complex to model analytically.

0 Kudos
yuyinyang
Beginner
1,298 Views

John D. McCalpin wrote:

The main advantage of deactivating the hardware prefetcher is that the behavior becomes a bit more controllable and therefore somewhat easier to understand.  

I find it particularly valuable when trying to tease apart the roles of temporal locality and spatial locality in cache hit rates for applications that are too complex to model analytically.

Hi Dr. Bandwidth,

May I know the prefetch latency of the software prefetch instructions on Xeon Phi in the number of cycles, including both the prefetch from memory to L2 and from L2 to L1, respectively. Actually I also want to know the latency of many vector instructions on Xeon Phi. However, I couldn't find them in any of Intel's data sheets.

Thanks a lot.

0 Kudos
McCalpinJohn
Honored Contributor III
1,298 Views

 

Cache and DRAM latencies on Xeon Phi have been discussed many times in this forum.  Latencies for SW prefetches should be a few cycles less than latencies for loads (since the prefetches don't bring the data all the way to the registers), but it is difficult to measure this in isolation.  L2 latencies are about 24-26 cycles and (idle) memory latencies average about 300 cycles with a range of ~140 to ~400 cycles.  The idle memory latency depends on the location of the core making thre request, the location of the DTD handling coherence for the physical address, and the location of the memory controller that owns the physical address.  Remote L2 latencies average a bit lower than memory latencies, but since both are governed primarily by cache coherence, the difference is small. An average value of 275 ns works pretty well.

0 Kudos
McCalpinJohn
Honored Contributor III
1,298 Views

I have not seen a complete list of Xeon Phi instruction latencies, but a fair amount of information is available, e.g.,

Careful review of these sources will show a number of inconsistencies, but there is broad agreement -- most computational instructions have a four-cycle latency (i.e., the result can be used in the 4th cycle after execution of the instruction that generates the result), while conversion, permutation, and Extended Math Unit (EMU) instructions have higher latencies.

The inconsistencies are not surprising -- part of the confusion is related to the definition of latency, part is related to the "not in consecutive cycles" limit on instruction issue for a single thread on the Xeon Phi, and part is related to lower-level complexities in the processor such as the AGU interlock and the various bypasses (or lack thereof) between different pipelines.

Given the relatively high memory latency and the difficulty in generating enough concurrency to reach asymptotic levels of memory bandwidth, getting a code tuned well enough to even notice the vector instruction latencies is already a significant achievement.

0 Kudos
Reply