Solved: Too low memory bandwidth utilization.

Jaeyoung__Choi · ‎07-25-2019

Hi,

I have made the program that conduct 10 independent pointer chase and I have verified 99% of pointer chase step goes to main memory.

My system has maximum memory bandwidth as 59.61GB/s

When I use memory bandwidth monitoring program, one thread can generate about 10GB/s memory bandwidth.

When I increase the number of process, memory bandwidth that each process generate is reduced and I have added all the memory bandwidth value that each process generate, I can only obtain about 32GB/s memory bandwidth.

I can't get more than 32GB/s memory bandwidth even though I increased the number of thread.

So, I think this result comes from bank conflict so it leads to low utilization of memory bandwidth.

Is this reasonable explanation?? or Should I consider another factor??

Thank you.

McCalpinJohn · ‎07-29-2019

That is an unusual configuration. I assume that you populated the 3 DIMMs in the three channels of one of the two memory controllers, which should allow 3-way cacheline interleaving. The peak BW should be 2.666*3*8 = 64 GB/s (decimal).

The Intel Memory Latency Checker Results suggest that you have a 2-socket system with the same DRAM configuration in each socket. The read-only bandwidth values are abut 89% of peak, which is a bit low, but may be too to using too many memory access streams. I recommend trying the Intel Memory Latency Checker with the options "-X" and "-Z" to see if that helps a little bit.

Depending on the details of your DRAMs and the pattern(s) of your pointer-chasing code, you may be running into a DRAM performance limitation that few people are familiar with, called the "four-active-window" limit.

First, a brief review:
DDR4 (like DDR, DDR2, and DDR3) uses a multi-step procedure to access memory.
1. Each DRAM chip is divided into a number of "banks" (16 for DDR4), with the memory divided evenly among the banks.
2. Each bank consists of a two-dimensional array of memory cells, indexed by "row" and "column".
3. The memory cells cannot be read directly -- instead the data must be transferred (destructively) from one row of the array to the "sense amps" near the edge of the DRAM chip. Transferring a row from the array to the sense amps is performed using the ACTIVATE command.
4. Once a row is in the sense amps, a set of columns can be read out using a "Column Address Strobe" (CAS) command. (Writing is similar.) A sequence of read CAS commands (or write CAS commands) to the same bank can be serviced back-to-back at full DRAM speed (and at the lowest possible power consumption).
5. If you need to get data from a different row in the same bank, the data in the sense amps must first be copied back to its home location in the array. Because the initial transfer was destructive, this must be done even if the data was only read and not written. Writing the contents of the sense amps back to their home in the array is called a PRECHARGE.
6. In DDR4, the 16 banks are divided into 4 bank groups with 4 banks each. Accesses that are interleaved across bank groups go a full speed, while repeated accesses to different banks in the same bank group can incur a delay. (I don't think this is an issue here, but it depends on the sequences of addresses accessed by your pointer-chasing code.)
7. Each bank can only execute a full "cycle" (PRECHARGE, ACTIVATE, CAS) every T_RC ns. This value has been decreasing very slowly, with recent 8 Gbit DDR4/2666 DRAMs having a T_RC of about 46 ns. Each CAS read generates a 4-cycle (3.0 ns) burst and delivers 8 Bytes from each DRAM chip. Since the row size in each DRAM chip is 1KiB, it takes 128 CAS reads (372 ns) to read an entire row, so there is no problem with satisfying the limit of 1 ACTIVATE per 46 ns. On the other hand, if you only load one cache line from the row, this only takes 3.0 ns. Interleaving across the full set of 16 banks would keep the bus busy for 48 ns -- enough (barely) to satisfy the limit of one ACTIVATE command per bank per 46 ns. Note that this requires nearly perfect interleaving of accesses across the banks. Unfortunately, this is not quite enough and another limiter then shows up....
8. In order to limit the current drawn by DRAM chips, the chip has an additional sliding window limit called the "four-active-window". This limits the number of banks that can receive an ACTIVATE command to four in a specified time window.

As an example that may be relevant to your results, an 8 GiB single-rank registered ECC DIMM will be composed of 9 DRAM chips, each with a capacity of 8 Gbits, and each contributing 8 bits to the 72-bit output. For a typical Micron part (MT40A1G8), the speed grade that is rated to run at DDR4/2666 rates has a T_FAW of "the greater of 20 clocks or 21 ns". At DDR4/2666 speeds, 20 clocks is 15 ns, so T_FAW is 21 ns. Four ACT commands in 21 ns is 5.2 ns per ACT. 3 channels * 64 bytes/ACT / 5.2 ns/ACT = 36.9 GB/s -- less than 58% of peak. Your DRAMs may have a different T_FAW value, leading to a higher or lower peak throughput for independent cache lines. When DDR4 was new, it was common that the four-active-window limited throughput for random accesses to 50% of peak -- very close to what you are seeing. Over time as the vendors improve their process technology, the four-active-window becomes less of a constraint (only on new parts, of course).

There are other possibilities, but if your pointer-chasing pattern is random, this is a likely cause of the throughput limitation....

View solution in original post

McCalpinJohn · ‎07-26-2019

What is the processor and memory configuration?

Did you disable hardware prefetching? https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

For a single thread, how does the bandwidth scale with the number of independent pointer chasing chains? (Is the 10 GB/s number from your pointer-chasing program, or from another measure?)

What are the latency and bandwidth results from the Intel MemoryLatencyChecker program? https://software.intel.com/en-us/articles/intelr-memory-latency-checker

Jaeyoung__Choi · ‎07-26-2019

Dear John

First , I really appreciate paying attention for my question.

1) What is the processor and memory configuration?

I used the Intel Xeon Gold 6142 processor and I used three 8GB single rank DDR4 2666 DIMM.

when I removed all the single rank DIMM and used the one 32GB 2 rank DDR4 2666 DIMM using single channel I could get 90% of theoretical maximum bandwidth ( this is why I think the major problem is bank conflict)

2) Did you disable hardware prefetching?

When I get a 10GB/s of memory bandwidth, hardware prefetcher was enabled but when I disabled the hardware prefetcher I could get only 7GB/s of memory bandwidth for a single thread.

3) For a single thread, how does the bandwidth scale with the number of independent pointer chasing chains? (Is the 10 GB/s number from your pointer-chasing program, or from another measure?)

I saw the bandwidth was scaled linearly by the number of independent pointer chase.(hardware prefetcher was disabled) For example, When I just conduct one independent pointer chase, the memory bandwidth that single thread generate was about 750MB/s ( When I consider the memory access latency is 81ns this value is reasonable) and when I conduct two independent pointer chase, memory bandwidth that single thread generate was about 1300MB/s.

The memory bandwidth that I measure is result from intel-cmt-cat program.

3) What are the latency and bandwidth results from the Intel MemoryLatencyChecker program? https://software.intel.com/en-us/articles/intelr-memory-latency-checker

Memory latency checker program report
Using Read-only traffic type
Numa node
Numa node 0 1
0 57046.7 33258.2
1 33222.8 57296.2

Thank you again.

McCalpinJohn · ‎07-29-2019

That is an unusual configuration. I assume that you populated the 3 DIMMs in the three channels of one of the two memory controllers, which should allow 3-way cacheline interleaving. The peak BW should be 2.666*3*8 = 64 GB/s (decimal).

The Intel Memory Latency Checker Results suggest that you have a 2-socket system with the same DRAM configuration in each socket. The read-only bandwidth values are abut 89% of peak, which is a bit low, but may be too to using too many memory access streams. I recommend trying the Intel Memory Latency Checker with the options "-X" and "-Z" to see if that helps a little bit.

Depending on the details of your DRAMs and the pattern(s) of your pointer-chasing code, you may be running into a DRAM performance limitation that few people are familiar with, called the "four-active-window" limit.

First, a brief review:
DDR4 (like DDR, DDR2, and DDR3) uses a multi-step procedure to access memory.
1. Each DRAM chip is divided into a number of "banks" (16 for DDR4), with the memory divided evenly among the banks.
2. Each bank consists of a two-dimensional array of memory cells, indexed by "row" and "column".
3. The memory cells cannot be read directly -- instead the data must be transferred (destructively) from one row of the array to the "sense amps" near the edge of the DRAM chip. Transferring a row from the array to the sense amps is performed using the ACTIVATE command.
4. Once a row is in the sense amps, a set of columns can be read out using a "Column Address Strobe" (CAS) command. (Writing is similar.) A sequence of read CAS commands (or write CAS commands) to the same bank can be serviced back-to-back at full DRAM speed (and at the lowest possible power consumption).
5. If you need to get data from a different row in the same bank, the data in the sense amps must first be copied back to its home location in the array. Because the initial transfer was destructive, this must be done even if the data was only read and not written. Writing the contents of the sense amps back to their home in the array is called a PRECHARGE.
6. In DDR4, the 16 banks are divided into 4 bank groups with 4 banks each. Accesses that are interleaved across bank groups go a full speed, while repeated accesses to different banks in the same bank group can incur a delay. (I don't think this is an issue here, but it depends on the sequences of addresses accessed by your pointer-chasing code.)
7. Each bank can only execute a full "cycle" (PRECHARGE, ACTIVATE, CAS) every T_RC ns. This value has been decreasing very slowly, with recent 8 Gbit DDR4/2666 DRAMs having a T_RC of about 46 ns. Each CAS read generates a 4-cycle (3.0 ns) burst and delivers 8 Bytes from each DRAM chip. Since the row size in each DRAM chip is 1KiB, it takes 128 CAS reads (372 ns) to read an entire row, so there is no problem with satisfying the limit of 1 ACTIVATE per 46 ns. On the other hand, if you only load one cache line from the row, this only takes 3.0 ns. Interleaving across the full set of 16 banks would keep the bus busy for 48 ns -- enough (barely) to satisfy the limit of one ACTIVATE command per bank per 46 ns. Note that this requires nearly perfect interleaving of accesses across the banks. Unfortunately, this is not quite enough and another limiter then shows up....
8. In order to limit the current drawn by DRAM chips, the chip has an additional sliding window limit called the "four-active-window". This limits the number of banks that can receive an ACTIVATE command to four in a specified time window.

As an example that may be relevant to your results, an 8 GiB single-rank registered ECC DIMM will be composed of 9 DRAM chips, each with a capacity of 8 Gbits, and each contributing 8 bits to the 72-bit output. For a typical Micron part (MT40A1G8), the speed grade that is rated to run at DDR4/2666 rates has a T_FAW of "the greater of 20 clocks or 21 ns". At DDR4/2666 speeds, 20 clocks is 15 ns, so T_FAW is 21 ns. Four ACT commands in 21 ns is 5.2 ns per ACT. 3 channels * 64 bytes/ACT / 5.2 ns/ACT = 36.9 GB/s -- less than 58% of peak. Your DRAMs may have a different T_FAW value, leading to a higher or lower peak throughput for independent cache lines. When DDR4 was new, it was common that the four-active-window limited throughput for random accesses to 50% of peak -- very close to what you are seeing. Over time as the vendors improve their process technology, the four-active-window becomes less of a constraint (only on new parts, of course).

There are other possibilities, but if your pointer-chasing pattern is random, this is a likely cause of the throughput limitation....

Jaeyoung__Choi · ‎07-29-2019

Dear John

I really appreciate teaching me this valuable knowledge.

I understood your explanation.

So, If there is too many row conflict, we can't read more than 256 byte in 21ns from Dram due to four-active-window timing parameter.

I think this limitation explains my problem very well. Because my pointer chasing code does not have locality and there is many thread that conduct same thing so I think there will be a interference. So these things eventually generate a lot of ACTIVATE command.

Thank you again!!