Solved: Accuracy of Memory Latency Benchmark?

MatteoOlivi · ‎02-02-2024

Hello,

I'm trying different tools and benchmarks to measure memory latency.

I came across https://github.com/caps-tum/memlat.

There are three major things that leave me perplexed. I'm not (at all) an expert on the topic therefore I'm asking here:

1. the `rdtsc` instruction is wrapped by `mfence` instructions . But `mfence` doesn't provide ordering guarantees relatively to `rdtsc` as far as I understand. Could it happen that `rdtsc` is reordered, invalidating the measurements? Should `rdtsc` be wrapped by `cpuid` rather than `mfence`?

2. When the measurement overhead is computed, the minimum value out of multiple iterations is chosen. Wouldn't it be better to take an average?
3. The README claims that the benchmark measures load latency. But the measured instruction not only reads the content of a variable: it also writes it to a variable. So isn't the benchmark measuring load and store latencies (well, if I look at the assembly, the `mov` instruction is used).

Are my perplexities well-grounded, or is the benchmark accurate instead?

Info about the system where I plan to run the benchmark:
OS: Ubuntu 22.04
HW: dual-socket Intel® Xeon® Silver 4114 Processor , each socket is connected to three 32 GiB DIMMs of DDR4 RAM. Sockets are connected via UPI.

Thanks,
Matteo.

McCalpinJohn · ‎02-06-2024

"Memory latency" is one of those concepts that seems straightforward, but which actually contains ridiculous levels of complexity.

The complexity shows up in many areas:

How is "memory latency" defined?
- Does it include non-minimal address translation overheads (L1 TLB Miss, STLB miss, Page Walker miss in each level of cache, etc).
- Does it include non-minimal DRAM page control overheads? (Open page hit vs empty page vs page conflict)
- Is it being defined for a specific core accessing a specific physical address (mapped to a specific L3 slice and a specific memory controller and channel)?
- If it is an average, what are the independent variables being varied? Core, L3 slice, memory controller and channel.
- If it is an average over multiple addresses, what are the spatial (bit pattern) and temporal relations of the addresses?
  - These will impact L3 slice and memory controller mappings and DRAM open page hits/misses.
  - Are there fixed strides, variable strides, random strides, user-generated/filtered pseudo-random strides?
  - What bits change between addresses in a sequence? (This is related to strides, but is more focused on the mapping of addresses into caches, non-fully-associative buffers, memory controller/channel/rank/bank/row/column.)
- Does "memory latency" assume a specific combination of core and uncore frequency?
  - Do averages include any core frequency changes or core throttling events?
- Does it assume that no caches in the system have any mappings of the address(es) being loaded?
  - and in multi-socket systems, a specific uncore frequency in the other socket(s)?
- Does "memory latency" assume that the address is not mapped in any caches in the socket?
  - or in any other sockets in the system? or in the memory directory of multi-socket systems?
- Should the average memory latency include measurements that overlap with the DRAM Refresh cycles?
How is the latency measured?
- You are correct to note that RDTSC is not ordered with respect to MFENCE. CPUID is not usually the right answer (overhead is too high and depends on input arguments).
- Some notes at Comments-on-timing-short-code-sections-on-intel-processors
- Measurement of the overhead of different timers and different access methods & coding styles are contained in the LowOverheadTimerTests of https://github.com/jdmccalpin/low-overhead-timers

View solution in original post

McCalpinJohn · ‎02-06-2024

"Memory latency" is one of those concepts that seems straightforward, but which actually contains ridiculous levels of complexity.

The complexity shows up in many areas:

How is "memory latency" defined?
- Does it include non-minimal address translation overheads (L1 TLB Miss, STLB miss, Page Walker miss in each level of cache, etc).
- Does it include non-minimal DRAM page control overheads? (Open page hit vs empty page vs page conflict)
- Is it being defined for a specific core accessing a specific physical address (mapped to a specific L3 slice and a specific memory controller and channel)?
- If it is an average, what are the independent variables being varied? Core, L3 slice, memory controller and channel.
- If it is an average over multiple addresses, what are the spatial (bit pattern) and temporal relations of the addresses?
  - These will impact L3 slice and memory controller mappings and DRAM open page hits/misses.
  - Are there fixed strides, variable strides, random strides, user-generated/filtered pseudo-random strides?
  - What bits change between addresses in a sequence? (This is related to strides, but is more focused on the mapping of addresses into caches, non-fully-associative buffers, memory controller/channel/rank/bank/row/column.)
- Does "memory latency" assume a specific combination of core and uncore frequency?
  - Do averages include any core frequency changes or core throttling events?
- Does it assume that no caches in the system have any mappings of the address(es) being loaded?
  - and in multi-socket systems, a specific uncore frequency in the other socket(s)?
- Does "memory latency" assume that the address is not mapped in any caches in the socket?
  - or in any other sockets in the system? or in the memory directory of multi-socket systems?
- Should the average memory latency include measurements that overlap with the DRAM Refresh cycles?
How is the latency measured?
- You are correct to note that RDTSC is not ordered with respect to MFENCE. CPUID is not usually the right answer (overhead is too high and depends on input arguments).
- Some notes at Comments-on-timing-short-code-sections-on-intel-processors
- Measurement of the overhead of different timers and different access methods & coding styles are contained in the LowOverheadTimerTests of https://github.com/jdmccalpin/low-overhead-timers

MatteoOlivi · ‎02-26-2024

Thanks for the thorough reply!

I must confess I don't have an answer for many of these questions/points (indeed, things are more complex than I thought).

The few answers I have:

Does "memory latency" assume a specific combination of core and uncore frequency?
Do averages include any core frequency changes or core throttling events?

Yes, I locked both core and uncore frequency to specific values for any core and socket.

Regarding caching: I want to avoid cache hits as much as possible. I assume that even if each address is accessed more than once, if I make sure that between two accesses to the same address there have been accesses to other addresses that amount to data that's bigger than the whole capacity of the caches, the loads will never hit the any cache.

I'm interested in average latency and percentiles, the stride is fixed, but the addresses change.

Since I asked the question, I discovered Intel mlc, and have been using that. So far it's good enough for my use case.