L1D flush overhead

arngnr · ‎10-16-2022

Hello!

I have a rather niche question, I hope someone here can help me out!

I have been looking for resources on the overhead flushing the L1D cache using the IA32_FLUSH_CMD MSR. I have found a couple benchmarks that measure the performance difference with L1D flushing enabled, but what I'm interested in is the actual execution time, or the cycles it takes, to perform the flush. I wasn't able to find any information on that online so far. Would really appreciate the help!

McCalpinJohn · ‎10-18-2022

If the implementation of the state machine for invalidating the lines is good, then performance should depend primarily on the amount of dirty data in the L1 D cache. The L1D_FLUSH operation is defined to be limited to the L1D cache, so dirty data has to be flushed to the L2. The bandwidth between the L1D and L2 caches depends on the processor generation, but is 64 Bytes in Skylake Xeon and later cores. For Skylake processors, the L1D is 32KiB or 512 cache lines, so the minimum time for writebacks is 512 core cycles is all data in the L1D is dirty. This increases to 768 cycles for Ice Lake and Golden Cove cores (48KiB L1D).

Of course the writebacks could be slower than 1/cycle, but they should not be slower than 2 cycles each.

Processing time for clean lines (invalidate only) is probably limited to 2 lines per cycle by the L1D tag access. (This could be accelerated with magic hardware, but I would be surprised if it was worth it.)

From my user-land perspective, the time will be dominated by crossing into the kernel to write the MSR.

Of course I have not measured any of this, so I could be completely off base.