Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Memory Bandwidth per core of Ivy Bridge

Benjamin_S_
Beginner
928 Views

Hi all,

For an academic project we have to make a roofline plot of our machine. We are considering single threaded programs that run on a single Core. We have a Core i7-3720QM Ivy Bridge CPU. On the product page of this cpu, the bandwidth is mentioned to be 25.6 GB/s. We were wondering of the whole Bandwidth is available, if we run a memory intensive program on a single core? We couldn't find any hints in the manual...

Thanks a lot for the help!

0 Kudos
6 Replies
McCalpinJohn
Honored Contributor III
928 Views

I have not tested this particular system, but my Xeon E3-1270 (with a very similar uncore) got its best bandwidth using a single core -- about 84% of the peak of 21.33 GB/s using the STREAM benchmark compiled with streaming stores.

The key parameter is the latency-bandwidth product, which determines the number of cache line transfers that have to be "in flight" concurrently to fill the memory pipeline.  For my Xeon E3-1270, the latency was 55ns and the BW was 21.33 GB/s, giving a latency-bandwidth product of 1173 Bytes, or about 18.3 cache lines.  A single core can have 10 L1 cache misses in flight, but the L2 hardware prefetchers usually provide additional concurrency.   My measurement of 17.9 GB/s corresponds to about 15.4 lines in flight, but at 84% utilization the bandwidth is probably limited by DRAM bus stalls (both read/write turnarounds and rank-to-read stalls) rather than by the concurrency that the processor could generate. 

Assuming your latency is about the same, 55 ns * 25.6 GB/s = 1408 Bytes = 22 cache lines.  Assuming a similar ~85% DRAM utilization would reduce this to 18.7 cache lines in flight, which should be attainable using a single core.

Of course, it is easy to *not* reach these levels of concurrency -- having too few address streams or too many address streams can cause problems with the L2 prefetchers and/or with DRAM bank conflicts -- but for a "ballpark" estimate like a roofline model, assuming a single core can drive 80%-85% of the peak DRAM bandwidth is reasonable.

 

 

0 Kudos
Animesh_J_
Beginner
928 Views

Hi,

Is it possible to monitor the memory Bandwidth using performance counters on Haswell machine. My machine model name is

model name    : Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz

I tried using Intel PCM (https://software.intel.com/en-us/articles/intel-performance-counter-monitor). But it seems that it is not supported for Haswell machine. I want to use perf stat utility and obtain memory bandwidth.

 

My pursuit led me here - https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd-3rd-and-4th-generation-intel

But I am not sure how to read these registers in a program and get the numbers. Can you please help me in this regard?

Regards,

Animesh Jain

 

0 Kudos
McCalpinJohn
Honored Contributor III
928 Views

Accessing these counters is definitely an advanced topic.... 

The short answer is that you have to:

  1. Look in PCI configuration space to find the Base Address Register (BAR) for the memory controller device.  You can do this with "setpci -s 0:0.0 0x48.l" (run as root).  This will give you a 32-bit physical address in Memory-Mapped IO space that maps to the memory controller.
  2. Double-check the address against the mappings given by "cat /proc/iomem".
  3. In a program run by the root user, execute an mmap() function on /dev/mem at the offset specified above.  This will return a pointer to a virtual address that will point to that memory-mapped IO range.  I recommend assigning this pointer to an array of "uint32_t".
  4. Read the initial value of the DRAM_DATA_READS counter by accessing array element 0x5050/4 = 0x1414 = 5140 (decimal).  If you simply assign that array value to a scalar the compiler should generate a 32-bit load operation, which is the only memory access type that will give you the correct answer for 32-bit fields in Memory-Mapped IO regions.
  5. Read the initial value DRAM_DATA_WRITES counter by accessing array element 0x5054/4=0x1415 = 5141 (decimal).
  6. Optionally read the DRAM_GT_REQUESTS, DRAM_IA_REQUESTS, and DRAM_IO_REQUESTS using the same approach.   Read the documentation on the monitoring-integrated-memory-controller web page very carefully so that you understand what these mean.
  7. At intervals of 10 seconds or less, read the counters again, compute the differences, correct for overflow (if applicable), and add the increments into 64-bit integers.  A background process that executes a ten-second "sleep()" call should be safe -- under realistic conditions the counters won't wrap in under 11-12 seconds, so the small amount of uncertainty in the wakeup time from the "sleep()" call is OK.
  8. Repeat step 6 until your workload is completed.    My code includes a signal handler to accept the "SIGCONT" signal from user space that I send when the workload is completed (using "kill -SIGCONT pid_of_monitoring_program") since SIGCONT is the only signal that a process owned by root will accept from a non-root user process.  (Be careful of using signal numbers, the SIGCONT signal is assigned to different numerical values in different Linux versions.)

I have tested this on a Xeon E5-1270 (Sandy Bridge) and believe that DRAM_DATA_READS and DRAM_DATA_WRITES counters are accurate.  I have not tested this on any more recent Xeon E5-12xx or Core i3/5/7 products with the same uncore as the Xeon E5-12xx.

0 Kudos
Paul_S_
Beginner
928 Views

John McCalpin wrote:
The key parameter is the latency-bandwidth product, which determines the number of cache line transfers that have to be "in flight" concurrently to fill the memory pipeline.  For my Xeon E3-1270, the latency was 55ns and the BW was 21.33 GB/s, giving a latency-bandwidth product of 1173 Bytes, or about 18.3 cache lines.  A single core can have 10 L1 cache misses in flight, but the L2 hardware prefetchers usually provide additional concurrency.   My measurement of 17.9 GB/s corresponds to about 15.4 lines in flight, but at 84% utilization the bandwidth is probably limited by DRAM bus stalls (both read/write turnarounds and rank-to-read stalls) rather than by the concurrency that the processor could generate.

Does Intel specify the amount of additional in-flight cachelines that are provided by the HW prefetcher operating on the L2 level?

0 Kudos
McCalpinJohn
Honored Contributor III
928 Views

I have not seen much useful detail about the L2 HW prefetcher capabilities.   Some information is implicit in the performance counters that count "occupancy" at various places in the memory hierarchy.  For example, the Xeon E5 v4 Uncore Performance Monitoring Guide says that the CBo occupancy counters can increment by a maximum of 20 per cycle, so the maximum concurrency available at that level of the hierarchy is 20 transactions times the number of CBo slices.  There is also a core counter for OFFCORE_REQUESTS_OUTSTANDING, but I can't remember if I have seen a limit on how many times it can increment per cycle on the various processor generations.

The L2 HW prefetchers are limited in how many pages they can track, but again the public descriptions are vague.   The absolute limits are probably not particularly interesting because the behavior is so dynamic -- the number of L2 HW prefetches that are generated depends on the number of pages being accessed, the types of the accesses, the "busyness" of the L2 cache, and other factors.   If the OFFCORE_REQUESTS_OUTSTANDING counter works correctly, then running a sequence of tests with increasing values of the CMASK field should quickly show the maximum number of concurrent requests outstanding for any particular test code.

0 Kudos
Paul_S_
Beginner
928 Views

Thank you; I was hoping that you could have pointed me to an official Intel document such that one could justify a theoretical upper bound on the single-threaded main-memory BW (without measuring).

I also found the following paper: https://tu-dresden.de/zih/forschung/ressourcen/dateien/abgeschlossene-projekte/benchit/2009_PACT_authors_version.pdf

It shows --among other things-- the bandwidth gain due to HW prefetching for a single thread.

0 Kudos
Reply