Guidance with writing our own cache memory BW and latencies micrp-benchmarks

drMikeT · ‎02-15-2021

It would be great if we could get some guidance with writing our own memory BW and latencies microbenchmarks for L1, L2, L3 BWs and latencies.

Is it advisable to use Intel C intrinsics or straight assembly? Any suggestions would be great.

Sharing the theoretical and measured cache memory BWs and latencies would even be greater.

Thank you!

Michael

RaeesaM_Intel · ‎02-16-2021

Hi,

Thanks for reaching out to us. We are discussing your query with the Analyzer experts and will get back to you on the updates.

Regards, Raeesa

Kevin_O_Intel1 · ‎02-22-2021

Let me check what information I can share. I

We have the stream benchmark in our oneapi samples github.

It should cover much of what you need:

https://github.com/omartiny/oneAPI-samples/tree/master/Tools/Benchmarks/STREAM

drMikeT · ‎02-23-2021

Thanks Kevin!

My question applies to measuring accurately L1, L2 and L3 read/write BWs and if possible latencies.

Would using Intel intrinsics allow user code to measure these accurately or should one go down to assembly? It would be great of course if C compilers are smart enough to let one measure these with straight C code.

Best!

Michael

Kevin_O_Intel1 · ‎02-23-2021

Intrinsic do give you more control. But the compiler have gotten very good.

I would recommend you look at the Intel memory latency checker tool. It can be extremely helpful.

https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html

drMikeT · ‎02-28-2021

Thanks, I am familiar with Intel MLC. Could it reproduce the cache BW numbers that Intel thread advisor shows on the roofline curves? MLC does not seem to have a pure read-only test for caches as it allocates 2 buffers : one to read from and the other to write to, as in the " --peak_injection_bandwidth " option.

My objective is to be able to generate myself the roofline curves so I can automate the analysis and postprocessing of performance numbers of large number of code variations using scripting and batch jobs. Doing so in interactively is prohibitively time consuming!

Is there any document to understand a little better how the cache BW numbers are collected?

thanks

McCalpinJohn · ‎02-28-2021

It is possible to do cache-contained tests with Intel MLC by choosing a sufficiently small array size, but I don't think that makes a lot of sense -- there is too much uncertainty about what MLC is actually doing in these cases, so it is impossible to know whether it is doing what you want...

For performance modeling, you might want to look at the ECM model ("Execute, Cache, Memory"). There is an introduction and a bit of bibliography at https://hpc.fau.de/research/ecm/

The ECM model can be thought of as a more sophisticated variation of the additive performance models that I discuss at The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models. Both the ECM results and my results show that roofline models are almost always overly optimistic -- computation and memory access don't overlap as well as one might hope (and as well as the roofline model assumes).

McCalpinJohn · ‎02-23-2021

This is not an easy topic, especially given the complexity and lack of documentation in modern processors.

The seemingly simple concepts of "latency" and "bandwidth" diverge into remarkably large set of special cases once you start getting involved in the details.... For L1-resident benchmarks, timing becomes challenging, and idiosyncrasies of branch prediction contribute non-negligible variations.

There are a lot of write-ups relating to latency and bandwidth measurements at my blog: https://sites.utexas.edu/jdm4372/

Timing can be tricky: see Comments on timing short code sections on Intel processors

For code generation, I find it easiest to start with simple C code, compile to assembly, then tweak the assembly code as needed.

This approach was used extensively in this project: A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Intel publishes peak and sustainable cache BW numbers for some of their processors in Chapter 2 of their Optimization Reference Manual. The results are presented without any detail at all, so it is likely that the read:write ratios required to obtain the stated sustainable BW values at different levels of the cache hierarchy are not the same. For the L1 DCache, for example, maximum sustainable BW appears to correspond to 2R:1W, while for the L2 and L3 caches, maximum sustainable BW appears to correspond to all reads.

drMikeT · ‎02-28-2021

Dr Mc Calpin, thanks for the informative response!

Yes, nowadays the cores are too complex and their performance varies as a function of their power management, ISA used and turbo boost state. Intel advisor is a great tool but it is mostly interactive which is too time consuming or prohibitive for analyzing large number of code variations in an automated fashion.

I have been also involved with investigating low level raw methods to get performance measurements but this is a full-time occupation for a group of people who have access to all the non-public details. Processor generations and different families have different power, voltage and clock management logic and it is just too time consuming to learn by poking the black box from different angles...

Nice blog