Community
cancel
Showing results for 
Search instead for 
Did you mean: 
drMikeT
Novice
126 Views

Guidance with writing our own cache memory BW and latencies micrp-benchmarks

It would be great if we could get some guidance with writing our own memory BW and latencies microbenchmarks for L1, L2, L3 BWs and latencies. 

Is it advisable to use Intel C intrinsics or straight assembly? Any suggestions would be great.

Sharing the theoretical and measured cache memory BWs and latencies would even be greater.

Thank you!

Michael

0 Kudos
5 Replies
RaeesaM_Intel
Moderator
97 Views

Hi,


Thanks for reaching out to us.

We are discussing your query with the VTune experts and will get back to you on the updates.


Regards,

Raeesa


Kevin_O_Intel1
Employee
60 Views

Let me check what information I can share. I

We have the stream benchmark in our oneapi samples github.

It should cover  much of what you need:

https://github.com/omartiny/oneAPI-samples/tree/master/Tools/Benchmarks/STREAM

drMikeT
Novice
43 Views

Thanks Kevin!

My question applies to measuring accurately L1, L2 and L3 read/write BWs and if possible latencies.

Would using Intel intrinsics allow user code to measure these accurately or should one go down to assembly? It would be great of course if C compilers are smart enough to let one measure these with straight C code.

Best!

Michael

Kevin_O_Intel1
Employee
40 Views

Intrinsic do give you more control. But the compiler have gotten very good.

I would recommend you look at the Intel memory latency checker tool. It can be extremely helpful.

https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html

McCalpinJohn
Black Belt
25 Views

This is not an easy topic, especially given the complexity and lack of documentation in modern processors.

The seemingly simple concepts of "latency" and "bandwidth" diverge into remarkably large set of special cases once you start getting involved in the details....   For L1-resident benchmarks, timing becomes challenging, and idiosyncrasies of branch prediction contribute non-negligible variations.

There are a lot of write-ups relating to latency and bandwidth measurements at my blog: https://sites.utexas.edu/jdm4372/

Timing can be tricky: see Comments on timing short code sections on Intel processors

For code generation, I find it easiest to start with simple C code, compile to assembly, then tweak the assembly code as needed.

Intel publishes peak and sustainable cache BW numbers for some of their processors in Chapter 2 of their Optimization Reference Manual.   The results are presented without any detail at all, so it is likely that the read:write ratios required to obtain the stated sustainable BW values at different levels of the cache hierarchy are not the same.   For the L1 DCache, for example, maximum sustainable BW appears to correspond to 2R:1W, while for the L2 and L3 caches, maximum sustainable BW appears to correspond to all reads.