Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

understanding the behavior logic of the L2 stream prefetcher

grayxu
Novice
560 Views

as far as i know,

> Detects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain one forward and one backward stream can be maintained

 

I'm curious if these 32 streams can be targeted at 32 different 4K pages? Because in my experiments, I observed a significant performance drop after exceeding 32 pages of streaming, and `l2_lines_out.useless_hwpf` also decreased rapidly.
 
My platform is an Intel(R) Xeon(R) Gold 6240 processor.
 
Thanks in advance.
0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
422 Views

The wording of the documentation is irritatingly ambiguous, but it is clear that these streams are primarily concerned with different 4KiB pages.

In the most common case one has only forward streams, and this description can be interpreted as allowing one forward stream in each of 32 separate 4KiB pages.   We have definitely seen significant improvements in performance for a small number of codes that accessed ~50 streams at once by fissioning the loops to get this number down under 32 at a time.

If your code is performing both forward and backward streams into the same 4KiB region, then you might be limited to sixteen 4KiB regions before performance drops due to oversubscription.

0 Kudos
Reply