Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Intel MLC v3.6 Single-core "ALL Reads" bandwidth lower than "Stream-triad like" BW

Li__Yilong
Beginner
2,950 Views

Hi,

When I used Intel memory latency checker v3.6 to measure single-core max bandwidth, I found that the "ALL Reads" traffic pattern gives much lower throughput than the other traffic patterns.

================================

yilongl@work:~/mlc_v3.6$ sudo ./mlc_avx512 --max_bandwidth -Y -m80
Intel(R) Memory Latency Checker - v3.6
Command line parameters: --max_bandwidth -Y -m80

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes

Measuring Maximum Memory Bandwidths for the system
Will take several minutes to complete as multiple injection rates will be tried to get the best bandwidth
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      12708.07
3:1 Reads-Writes :      14660.70
2:1 Reads-Writes :      15042.31
1:1 Reads-Writes :      16720.35
Stream-triad like:      16624.46
===================================

 

When I used all cores (hyperthreading off), the "ALL Reads" pattern giveS the highest throughput, which is what I expected.

====================================

yilongl@work:~/mlc_v3.6$ sudo ./mlc_avx512 --max_bandwidth -X -Y
Intel(R) Memory Latency Checker - v3.6
Command line parameters: --max_bandwidth -X -Y

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes

Measuring Maximum Memory Bandwidths for the system
Will take several minutes to complete as multiple injection rates will be tried to get the best bandwidth
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using only one thread from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      53959.55
3:1 Reads-Writes :      48381.93
2:1 Reads-Writes :      47685.10
1:1 Reads-Writes :      44037.28
Stream-triad like:      48496.81
=============================================

 

Can anyone help me understand why the single-core all-read bandwidth is lower than the other traffic patterns? Thanks!

========System configuration============

HPE ProLiant XL170r Gen9

CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz

RAM: 4 * HP 809082-091 16GB (1 x 16GB) Single Rank x4 DDR4-2400 CAS-17-17-17 Registered Memory Kit

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
2,950 Views

None of this is documented in detail, but here is my current interpretation:

  • Loads that miss the L1D cache allocate an LFB, which is occupied for approximately the same duration as the read latency (to whatever level of the cache hierarchy the target line comes from).
    • If the L1D victim is dirty, the L1D to L2 Writeback takes much less time than the L1D Read miss, so it should have close to zero cost (in time).
  • "Ordinary" (allocating) Stores that miss the L1D cache allocate an LFB in exactly the same way as loads, and for approximately the same duration.
    • Dirty L1D victim Writebacks should also be very close to "free" in this case.
    • Stores that *hit* in the L1D cache just update the line in the L1 and do not require an LFB.
  • L1D to L2 Writebacks probably do not use the LFB's.   Even if they do use the LFB's, the duration of occupancy will be close to an L2 latency (rather than close to a memory latency), since the address typically has a location reserved in the L2 cache already from the initial read.
  • In the example "a = 2*a", the occupancy of the LFB's will be dominated by reading the data. 
    • If the LFB's are not used at all during the L1D to L2 Writebacks, then I would expect the bandwidth to be just under 8.2 GB/s Reads plus just under 8.2 GB/s of Writebacks, for a total of just under 16.4 GB/s.
      • This case will experience some reductions due to read/write turnarounds and memory controller mode switching, but it is hard to estimate how much that will be without much more detailed measurements.
    • If the LFB's are used during the L1D to L2 Writebacks, there will be some reduction in the availability of LFB's for reads, which will reduce the read bandwidth slightly.  
      • L2 latency on this platform is 12 cycles at 3.4 GHz, or about 3.5 ns.
      • Memory latency on this platform is probably about 80 ns.
      • This suggests a very small reduction in read bandwidth of about 4%.
      • The observed reduction of 15% is probably dominated by DRAM turnarounds and memory controller mode switches, but might include some reduction due to LFB occupancy.
  • There are indications that streaming (non-temporal) stores use the LFBs. 
    • It is difficult to tell what the duration of LFB occupancy should be in this case, for several reasons:
      • we don't know if all of the LFBs can be used for streaming stores (some may reserved for reads or TLB misses to prevent deadlocks),
      • we don't know the details of the path for the data,
      • we don't know the details of where the transaction needs to be held pending snoop responses -- streaming stores are not strongly ordered with respect to ordinary stores, but there are still coherency and ordering requirements and there is still the need to be able to implement the various FENCE instructions.
    • In some processors, single-thread bandwidth is higher with allocating stores than with streaming stores.
      • This makes sense if streaming stores occupy an LFB for about the same duration as reads -- then the total bandwidth will be similar to the read bandwidth.
      • L2 HW prefetch can reduce the duration of LFB occupancy for reads and for allocating stores (by moving the line into the L3 or L2 in advance), but cannot reduce the duration of LFB occupancy for streaming stores.

View solution in original post

0 Kudos
6 Replies
McCalpinJohn
Honored Contributor III
2,950 Views

The all-core bandwidth is higher for the "all reads" case because the memory controller is able to perform much better scheduling of the DRAM bus.  The bus must stall whenever switching directions, and this is not needed when there are only reads.

The single-core read bandwidth is limited by the concurrency that can be produced by a single core and its associated L2 HW prefetcher.  
An overview of the issues are discussed in a series of blog posts starting at http://sites.utexas.edu/jdm4372/2010/11/03/optimizing-amd-opteron-memory-bandwidth-part-1-single-thread-read-only/ ;

In round numbers, your Xeon E5 v4 should have a peak memory bandwidth of 76.8 GB/s (4 channels * 8 Bytes/channel/transfer * 2.4 Gtransfers/sec) and a memory latency of about 80 ns.   The number of concurrent cache misses required to fully tolerate this latency is 80 ns * 76.8 GB/s / 64 Bytes/line = 96 cache lines.  Each Xeon E5 v4 core supports only 10 L1 Data Cache misses.  If you disabled the L2 HW prefetchers, you would probably get an "all reads" performance of about 10 cache lines * 64 Bytes/line / 80 ns = 8 GB/s.   Your Intel MLC result is almost 60% higher than this value, which means that the L2 HW prefetchers are generating additional concurrency -- averaging about 16 cache lines "in flight" at all times.   As I discuss in my series of blog posts, with careful code generation it is possible to get an effective concurrency of better than 16 cache lines -- I have seen an average of about 20 cache lines in flight with very carefully tuned code in the first-generation Xeon E5 (Sandy Bridge EP).

The L2 HW prefetchers operate within 4KiB address ranges -- starting operation near the beginning of the page and stopping at the end of the page.   Test kernels with both read and write streams are typically accessing more pages, which makes it easier for the L2 HW prefetcher to generate a larger average number of prefetches. 

The details get very complex when streaming stores are considered, but mostly I think what you are seeing is that Intel has not fully optimized the code for the single-thread case.  There are other issues when using all cores -- especially if the number of threads exceeds the number of DRAM banks -- but your use of the "-X" flag should give the best results for your system.

0 Kudos
Li__Yilong
Beginner
2,950 Views

Hi John,

Thanks for pointing out that there will be less switch stalls in the "all reads" pattern even though I didn't ask explicitly. I read your posts some time ago and have learned a lot!

I forgot to mention that the Xeon E5-2640 v4 processor only supports up DDR-2133 so the peak bandwidth is about 68.3GB/s (or ~85 concurrent cache lines to fill the pipeline). Based on your suggestion, I disabled all 4 prefetchers using `sudo wrmsr -a 0x1a4 15` and re-ran the single-core benchmark (with option -e to avoid modifying prefetcher settings). Here is the result:

yilongl@work:~/mlc_v3.6$ ./mlc_avx512 -e --max_bandwidth -Y -m80
Intel(R) Memory Latency Checker - v3.6
Command line parameters: -e --max_bandwidth -Y -m80 

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes

Measuring Maximum Memory Bandwidths for the system
Will take several minutes to complete as multiple injection rates will be tried to get the best bandwidth
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      8183.73
3:1 Reads-Writes :      9792.53
2:1 Reads-Writes :      10719.11
1:1 Reads-Writes :      14067.94
Stream-triad like:      9210.01

 

The single-core "all reads" bandwidth is indeed very close to 8GB/s. Thanks again! Look at the other numbers, I realized that it does make sense for other traffic patterns to generate higher bandwidth because Intel MLC is not using streaming stores by default and the memory bandwidth is measured on the memory controller. So for each L1 data cache miss, we can generate 64B read traffic and 64B write traffic. However, the ~14000 MB/s bandwidth number given by "1:1 reads-writes" pattern seems suspiciously high to me because that amounts to 9378 MB/s (= 14067 MB/s * 2 / 3) read bandwidth and 4689 MB/s write bandwidth. Since I have disabled all 4 hw prefetchers, I don't see how we can get more than 8GB/s single-core read bandwidth. Perhaps the average memory latency drops from ~80ns to ~68ns somehow? What do you think?

 

Best,

Yilong

0 Kudos
McCalpinJohn
Honored Contributor III
2,950 Views

I have not looked at the implementation of the Intel kernels, but I would assume that the [1:1, 2:1, 3:1] Reads-Writes are configured to update an array, so there will be no write allocate traffic.  For example:

    for (i=0; i<N; i++) {
         a = 2.0 * a;
    }

This will generate one read and one writeback.  If this is correct, the 14067.94 MB/s would consist of 7034 MB/s of read traffic plus 7034 MB/s of write traffic, which is quite reasonable....

0 Kudos
Travis_D_
New Contributor II
2,950 Views

McCalpin, John (Blackbelt) wrote:

This will generate one read and one writeback.  If this is correct, the 14067.94 MB/s would consist of 7034 MB/s of read traffic plus 7034 MB/s of write traffic, which is quite reasonable....

So does that mean that the write traffic takes a different path to memory than reads? Otherwise, and if the occupancy time is the same, one would not expect the increasing total bandwidth with higher write ratios.

Do write use buffers other than the LFBs to the L2? Beyond that?

0 Kudos
McCalpinJohn
Honored Contributor III
2,951 Views

None of this is documented in detail, but here is my current interpretation:

  • Loads that miss the L1D cache allocate an LFB, which is occupied for approximately the same duration as the read latency (to whatever level of the cache hierarchy the target line comes from).
    • If the L1D victim is dirty, the L1D to L2 Writeback takes much less time than the L1D Read miss, so it should have close to zero cost (in time).
  • "Ordinary" (allocating) Stores that miss the L1D cache allocate an LFB in exactly the same way as loads, and for approximately the same duration.
    • Dirty L1D victim Writebacks should also be very close to "free" in this case.
    • Stores that *hit* in the L1D cache just update the line in the L1 and do not require an LFB.
  • L1D to L2 Writebacks probably do not use the LFB's.   Even if they do use the LFB's, the duration of occupancy will be close to an L2 latency (rather than close to a memory latency), since the address typically has a location reserved in the L2 cache already from the initial read.
  • In the example "a = 2*a", the occupancy of the LFB's will be dominated by reading the data. 
    • If the LFB's are not used at all during the L1D to L2 Writebacks, then I would expect the bandwidth to be just under 8.2 GB/s Reads plus just under 8.2 GB/s of Writebacks, for a total of just under 16.4 GB/s.
      • This case will experience some reductions due to read/write turnarounds and memory controller mode switching, but it is hard to estimate how much that will be without much more detailed measurements.
    • If the LFB's are used during the L1D to L2 Writebacks, there will be some reduction in the availability of LFB's for reads, which will reduce the read bandwidth slightly.  
      • L2 latency on this platform is 12 cycles at 3.4 GHz, or about 3.5 ns.
      • Memory latency on this platform is probably about 80 ns.
      • This suggests a very small reduction in read bandwidth of about 4%.
      • The observed reduction of 15% is probably dominated by DRAM turnarounds and memory controller mode switches, but might include some reduction due to LFB occupancy.
  • There are indications that streaming (non-temporal) stores use the LFBs. 
    • It is difficult to tell what the duration of LFB occupancy should be in this case, for several reasons:
      • we don't know if all of the LFBs can be used for streaming stores (some may reserved for reads or TLB misses to prevent deadlocks),
      • we don't know the details of the path for the data,
      • we don't know the details of where the transaction needs to be held pending snoop responses -- streaming stores are not strongly ordered with respect to ordinary stores, but there are still coherency and ordering requirements and there is still the need to be able to implement the various FENCE instructions.
    • In some processors, single-thread bandwidth is higher with allocating stores than with streaming stores.
      • This makes sense if streaming stores occupy an LFB for about the same duration as reads -- then the total bandwidth will be similar to the read bandwidth.
      • L2 HW prefetch can reduce the duration of LFB occupancy for reads and for allocating stores (by moving the line into the L3 or L2 in advance), but cannot reduce the duration of LFB occupancy for streaming stores.
0 Kudos
Li__Yilong
Beginner
2,950 Views

Thank you, John. It makes a lot more sense now. Really appreciate your informative reply!

0 Kudos
Reply