Disable HW prefetch in mic

Surya_Narayanan_N_ · ‎08-10-2014

Does –no-opt-prefetch disable hw prefetch too? If, no...how to disable HW prefetch?

TimP · ‎08-10-2014

That removes compiler generated software prefetch. Hardware prefetch is not very aggressive and populates only L2. I'm not aware of cases where it will be undesirable. Maybe if you would explain someone will comment.

Surya_Narayanan_N_ · ‎08-10-2014

I am not sure in xeon-phi how much does prefetch help when 4 smt threads are run in a core. For benchmarks like PARSEC and SPLASH I observe that by increasing the number of threads spawned in a physical core from 1-4 increases the L2 cache miss/thread. This behavior is expected as now per thread L2 size reduces by approx a factor of 4. If these, L2 misses/thread increase is due to cache line reuse (i,e eviction of a needed data from cache) I would be interested in knowing the contribution of s/w or h/w prefetch for this.

I ran experiment with very large input set size by varying number of threads and all threads are pinned and spawned in SMT way i,e 1-236.

no_vec_pre = vectorization and s/w prefetch disabled.

no_vec = vectorization disabled, s/w prefetch enabled.

full = vectorization, s/w prefetch enabled.

From the table we can notice that Full is better only till 8 threads and after that its no_vec_pre configuration is better. To my understanding, usually increase in threads does increase required bandwidth but Xeon Phi has higher sustainable bandwidth that application needs. But on the other hand every L2 miss memory fill is very expensive. I feel that the no_vec_pre performs well due to lesser L2 misses compared to Full configuration. Can someone tell me whether my hypothesis is right?

threads
	no_vec_pre/lu_cb	no_vec/lu_cb	full/lu_cb
1	6m 46.54s	6m 58.40s	5m 35.67s
2	4m 18.09s	4m 18.51s	3m 41.31s
4	2m 50.25s	3m 27.67s	2m 57.30s
8	1m 32.42s	1m 47.79s	1m 38.81s
16	0m 57.85s	1m 4.83s	1m 0.13s
32	0m 38.71s	0m 43.25s	0m 40.84s
64	0m 31.29s	0m 33.75s	0m 32.43s
128	0m 28.91s	0m 30.32s	0m 29.49s
236	0m 29.66s	0m 29.98s	0m 29.74s

Surya_Narayanan_N_ · ‎08-10-2014

i have attached the table as text file. please refer to that.

TimP · ‎08-10-2014

It's not unusual to see performance peaking at 3 threads per core, even when there is some effective cache sharing among the 3 local threads. Hence the provision of options such as KMP_PLACE_THREADS=59c,3t OMP_PROC_BIND=close (on the assumption that neighboring threads are the ones which need to share a core and L2 cache). Note that KMP_PLACE_THREADS also sets number of threads, unless you specify it separately.

If you don't apply affinity to retain cache locality in the usual cases where that can be done easily, I suppose it's possible that too much prefetching will hurt by wasting bandwidth. We have seen cases where even optimized applications (but with fairly short loop counts) are prefetching 30% additional data beyond what is consumed effectively. Without affinity, the application may not be consuming data from the cache to which it was first prefetched.

If you have a cache capacity problem due to even the adjacent threads using many distinct cache lines, it's possible to see performance peak at 1 or 2 threads per core, and while leaving 1 or 2 cores open for system activity. I don't see that cutting back on prefetch should be a solution to cache capacity. Fiddling with prefetch distances does imply that prefetching too early may aggravate capacity problems, while in other cases with long data runs it is better not to be stingy about prefetch distances.

On the host CPU, the most common case for disabling one of the hardware prefetchers is in applications where the "alternate" or "2nd" sector prefetch, which always initiates a companion cache line read, will be ineffective, either due to the adjacent sector seldom being used in temporal proximity, or to false sharing issues where a thread is writing data just one cache line away from where another thread is working.

Performance peaking at 3 threads per core is likely to be associated with the fact that that number of threads is sufficient to keep the vector processing unit pipelines full. I suspect that MKL library functions which can use all thread slots on all cores effectively are doing so by effectively parallelizing data shuffles with floating point operations, where the shuffles are used to improve cache data locality by tiling or transposing for unit stride.

I have seen examples in which increased use of streaming-stores, as by adding the vector nontemporal pragmas, could open up space for prefetch to work more effectively, until the application was organized for better cache locality, after which it was not desirable to force-evict data which might be used again later.

McCalpinJohn · ‎08-11-2014

Intel has not publicly documented the mechanisms to disable the hardware prefetchers in any of their processors since somewhere back in the Pentium era. For most Intel processors, the information has clearly been made available to BIOS writers, since many BIOS's support options to disable some or all of the hardware prefetchers. Xeon Phi does not have a 3rd-party BIOS, so this route to obtaining the information does not apply.

It is possible to defeat the L2 hardware prefetcher on Xeon Phi by accessing more than 16 data streams (on different 4 KiB pages) concurrently. This does not help with user benchmarks, but it can help with characterizing the behavior of the prefetchers.

In my experiments I found that hardware prefetch and software prefetch provide close to the same performance for simple test cases (like STREAM), when using 4 KiB pages.

BUT, it is important to note that on Xeon Phi software prefetches are dropped if they miss in the TLBs (i.e., if they cause a page table walk):

Since the L1 Data TLB only maps 64 4KiB pages, and since software prefetches are often generated several KiB ahead of the current load pointer, it does not take very large data sets to get into the situation where 1/2 or more of the SW prefetches are dropped.
Large pages improve this in two ways:
- Large pages cover 2 MiB, so prefetching a few KiB ahead of the current pointer will only cause a very small fraction of software prefetches to be dropped.
- The Level 2 TLB maps 64 large pages, so you can repeatedly access ranges up to 128 MiB on each core before incurring TLB misses.

There is actually a good reason for this behavior. Unlike other Intel processors, for 4KiB pages the Xeon Phi level 2 TLB holds "Page Directory Entries" (PDEs) instead of "Page Table Entries" (PTEs). Each PDE points to a page of physical memory holding 512 PTEs, so by using the L2 TLB entries to hold PDEs instead of PTEs a much larger range of addresses can be accessed using the information in the L2 TLB. Holding PDEs in the L2 TLB provides the same mapping range as using 2 MiB pages -- 128 MiB. The direct cost of this approach is that the hardware page table walker must be activated, but the table walk is very fast since the PDE information is usually in the L2 TLB. The indirect cost is that if the memory access causing the table walk is a software prefetch, that instruction is dropped.

As far as I can tell, both the use of the L2 TLB to hold PDEs and the decision to drop software prefetches that cause table walks are unique to Xeon Phi (among recent processors). It would be interesting to experiment with a processor that could allow different combinations of these features.

TimP · ‎08-11-2014

John made some interesting points. According to this, the situation is opposite on host and MIC KNC with 4KB pages, where hardware prefetch on host stops at page boundary, while on MIC it's software prefetch which is interrupted there.

Transparent Huge Pages can give automatic benefit of prefetch crossing 4KB boundaries on both host and MIC.

By web search, you would find MSR hints for changing the various prefetchers (which may also be available in BIOS setup) for host but not MIC, and recommendations Intel made about settings for various categories of host workload. Several of those categories wouldn't be appropriate for running on MIC KNC. The host setting changes have to be done with root privilege and will apply to everyone running on the platform, so aren't suitable for mixed usage.

Surya_Narayanan_N_ · ‎08-18-2014

Thank you for the reply tim and john. I also have another doubt.

If i use the compiler flag -opt-prefetch-distance=4/8/16/32/64 will the compiler implement the mentioned distance prefetches in every case? Suppose, if the loop count is not so big, then the compiler will place the prefetch instruction always at 8 irrespective the command line option? I mean is there any default case which will override the flags mentioned? In some benchmarks I literally see no difference by varying their prefetch distance.

jimdempseyatthecove · ‎08-18-2014

John,

Put your hypothetical thinking cap on for the moment and comment on the following:

Suppose there were a prefetch option that would cause the compiler to (attempt to) determine the page interval of the loop and number of pages that will be entered. Then at the appropriate loop interval it would issue something like

compute number of pages NP
if NP .gt. 1
loopNPless1:
    mov [onePageAhead],scratchReg ; reg not used or used much further down
   loopInPage:
      doWorkInPage
      prefetch distance ahead
endLoopInPage
end loopNPless1
loopRemainder:
doWorkInPage
end loopRemainder

Essentially force the otherwise missed TLB to be loaded. Of course, this places pressure on the available TLBs that can be mapped.

The above can be done by hand, but I think it would be best done with a #pragma or !DEC$.

The reason for the loop nesting is that unlike prefetch, you have to avoid attempting to touch memory beyond the extent of the array.

An alternate way to avoid the nesting is to assure that the Virtual Memory page, representing the highest page currently mapped, is not available in the heap. Thusly any address generated in the loop, while potentially causing a page fault, would not generate an invalid address.

Jim Dempsey

McCalpinJohn · ‎08-18-2014

You will need to look at the assembly code to see exactly what the compiler generates for prefetch distances. When the compiler generates multiple versions of a loop, a VTune sampling run should be able to map back to the assembly instructions so that you can see which version(s) of the code were actually executed at run time.

As I mentioned previously, hardware prefetch and software prefetch do pretty much the same thing for contiguous accesses, so it is not surprising to run across cases where changing the prefetch-ahead distance has little impact. It is also important to note that on small pages, a prefetch-ahead distance that corresponds to 64 cache lines (4 KiB) will result in *all* of the software prefetches being dropped.

On the application side, don't forget that homogeneous threaded code shares the associativity of the cache, as well as its capacity. It is not unusual to find that increased miss rates are primarily conflict misses rather than capacity misses -- the 8-way associative L2 on Xeon Phi can act like a 2-way associative cache for each of the four thread contexts. (Whether the cache misses are primarily due to conflicts or capacity depends on both the access patterns of the threads and the synchronization frequency of the code. Although cache models are relatively easy for a single thread, the ability of multiple threads to execute asynchronously makes multi-thread cache modeling much trickier.)

I have not tried "pre-loading" TLB entries on Xeon Phi -- the in-order core means that you stall at that initial load, so it does not look particularly helpful. On the other hand, I did not consider the side benefit of taking the stall early so that later software prefetches would not be dropped. This might be an approach that could reduce the performance differential between large and small pages -- though using large pages is almost certainly easier. Another set of optimizations that could be considered would be to unroll the loop to cover an entire page and pull more of the software prefetches up to the beginning of the page. Software prefetches use issue slots, but don't stall the thread, so they hurt less than pre-loads. Once the L2 hardware prefetchers have gotten ramped up, the VPREFETCH1 software prefetches may no longer be necessary. Some experimentation would clearly be necessary, since Intel has not disclosed the details of the L2 prefetcher ramping function or the details of the interactions between demand loads (or RFOs), software prefetches, and the hardware prefetch engines.