Thanks, Dr. McCalpin - great

Salame__David1 · ‎11-04-2018

Hi,

Assuming I have an app with simple buffers allocated by calls to malloc. (hence cached, "writeback" buffers).
The app writes to the buffers. once.

1) Assuming the CPU is idle, and the buffers are not larger than the L2 cache, no other apps or cores touching the same memory etc, and no DMA with a snoop signal.

Will the cpu ever write these dirty cache lines "over time", or, if there's no reason to do so it won't ?
The question applies to both L2->L3 dirty lines write, and, to L3->RAM dirty lines write).

2) Is this a realistic thought that even when the CPU is "idle" (that is - no other tasks consumes it, the "CPU load" is idle most of the time, and my app is in an idle loop ever since the write),
Is it a realistic thought that the CPU cache (even the L3) won't reach to a point in which it needs to use these cache lines? (some thoughts in regards : even the kernel's code & kernel threads and maintenance stuff "trashens" the cache to some extent aren't they?)

3) Just an assumption in regards - if the answer to 1 & 2 is "correct", that is also means that multiple writes to the same cache line (same address) will simply overwrite that already-dirty line, and won't cause a write-back of that dirty line's content to it's parent storage (L3/RAM) before overwriting it...right ? since it's the same address.

The main target is XEON processor but the question is intel-wide in general.

Many thanks in advance

David

McCalpinJohn · ‎11-05-2018

The architecture makes no guarantees, and the implementations have many different behaviors (some documented and some not documented).

In general, a dirty cache line can stay in a cache for a "long" time. It can also be evicted from the cache at any time, for reasons that are documented or for reasons that are not documented.

Salame__David1 · ‎11-07-2018

Thanks John,

Reason I'm asking is,
I'm researching a performance issue with a product I'm developing, so the above app is just a test app to simulate the bottleneck I've identified in the product.

What I'm trying to estimate is,
So the app will do some processing on a source buffer and then it will move it forward to a dest buffer.
The write is obviously very fast since both dest & source are cached, but if you consider a server doing lots of work in a given second with many tasks in parallel, although the source->dest movement will finish fast enough from a code-execution point of view, I some have to calculate the overall penalty of that source->dest data transfer, and that will have to include the time (and hence the contribution to the bottleneck) that it will take for the core to move the data from the L2->L3 and (perhaps even) at some point from L3->RAM, because from a bottleneck point of view that could be considered as some sort of a "pipeline"...
because even though L2->L3 & L3->RAM are "asynchronous" to code execution as they are autonomous parts of the chip, they will stall the code execution at some point, when L2 is overloaded and stuff needs to get evicted.

And so on actual client loaded server I need to somehow estimate the "real overall time cost" of that source->dest data movement, and simply timing the execution of the "copy" code is not enought.

because if that cost won't be neglected enough, a new redesign of the code might be needed to place the resulted data directly on the source buffer and avoid that data movement. will still generate dirty lines but at least less pressurize the cache.

McCalpinJohn · ‎11-08-2018

Migration of dirty data "outward" toward memory does not directly involve the core, so there is no direct "cost" for these operations.

It sounds like you are modeling a processing pipeline including one or more "producers" and one or more "consumers".

Assuming that the producer and consumer of the data are on different cores, then there is typically little difference in performance if the data has moved from the producers's L1 and/or L2 to the shared L3.

If the data has been evicted all the way to memory the major cost will be the higher latency experienced by the consumer.

Another (usually smaller) cost can occur when the producer goes to re-write the buffer (since the data must be read before being overwritten). This will usually be quite small if the buffer is still in the cache from the consumer's read, but there are lots of possible sequences of cache transactions that may occur, and the behaviors are not always well-documented. If the buffer has migrated all the way out to DRAM (and/or the consumer's clean read of the buffer has been evicted), then the producer may stall on the stores.

Another (usually much smaller) cost is the slight increase in average memory latency due to writebacks of data from L3 to DRAM. Intel memory controllers typically give priority to reads while buffering writes in the memory controller. When the memory controller write buffer reaches a high water mark, the memory controller switches modes and gives higher priority to writes until the write buffer reaches a low water mark. Although this gives excellent *average* read latency, the worst-case read latency can be much higher than expected if the read arrives immediately after the memory controller has switched modes to prioritize writes.

Travis_D_ · ‎11-10-2018

McCalpin, John wrote:
Another (usually much smaller) cost is the slight increase in average memory latency due to writebacks of data from L3 to DRAM. Intel memory controllers typically give priority to reads while buffering writes in the memory controller. When the memory controller write buffer reaches a high water mark, the memory controller switches modes and gives higher priority to writes until the write buffer reaches a low water mark. Although this gives excellent *average* read latency, the worst-case read latency can be much higher than expected if the read arrives immediately after the memory controller has switched modes to prioritize writes.

This last part is very interesting. I hadn't hear of it before. Do you have any reference, or is this based on tests you have performed yourself?

McCalpinJohn · ‎11-11-2018

The "read major mode" and "write major" mode are mentioned in the uncore performance monitoring guides for the last several generations of processors (Xeon E5 v1/v2/v3/v4 and Xeon Scalable Processors). There is not a lot of information, but the descriptions of the performance counter events make the overall intent fairly clear. For memory access patterns composed of mixed reads and writes the measured open page hit rate is consistent with the memory controller doing almost all reads for a while, then almost all writes for a while. My interpretation of these numbers is that they are not consistent with interleaving the reads and writes at a fine (sub-DRAM page) granularity, but I don't have the logic analyzer I would need to prove it.

Travis_D_ · ‎11-17-2018

Thanks, Dr. McCalpin - great info as usual!

Spontaneous Writeback eviction