Can we get some details on the micro-benchmarks that Intel Advisor (2020.02 or 2021.x) deploys to understand L1, 2, 3 and DRAM level BW and FP/INT capabilities of the underling cores?
Are these available by themselves? Or, do they derive from open-source ones (and which)?
Thanks for posting in Intel Community Forums.
Based on our understanding, there is no such Benchmarks available with Intel Advisor. We will discuss this with Subject Matter Experts and let you know the updates. As a work around, we would suggest you to try the Memory-Level Roofline Analysis in Intel® Advisor. Memory-Level Roofline analysis shows interactions between different memory levels(L1,L2,L3 & DRAM) available on your system and provides automatic guidance to improve your application performance.
Please refer to the below documentation for more details:
The Intel Advisor, uses microbenchmarks measures L1/2/3 and DRAM max BW and max FP/INT ops/sec per core on the system it runs on while it is profiling user codes. It uses these figures to generate the Roofline models/plots to guide the optimization of user codes.
I would like to know and if possible to get these microbenchmarks and run them at the command line so I can automate the profiling and optimization process via scripts.
Are these microbenchmarks available or Open Source? That would be great for my work.
Happy New Year !
Intel thread Advisor reports per core performance limits (L1, L2, L3, DRAM vector/scalar BW and FP/INT vector/scalar add/FMA) limits at the roofs settings in the Roofline tab.
One important question I have is how do I validate that the reported numbers are what the underlying h/w supports?
Conversely, I would like to generate these figures manually and use them to do my own roofline analysis of codes in an automated fashion as opposed to using the GUI for larger numbers of test cases. We often have to analyze runs with large number of different inputs from parameter ranges. Using the GUI is prohibitive.
Using the command line advixe-cl is not that much more productive since we cannot easily automate the profiling and evaluation of large number of application runs when we vary its input arguments.
Can we get the codes in Intel Advisor that produce these per core performance limits ? It would assist us to automate performance profiling and tuning.
If you really want to play with benchmark tool you can use 'mem_bench' from bin subdirectory of advisor. On the other hand you could use PythonAPI to get an access to Advisor project. I'm not sure that you will get all raw data you want (such as unprocessed benchmark results) but you can try to find more information about Advisor PythonAPI here
Thanks Ruslan_M !
I have actually noticed the mem_bench and I gave it a few runs guessing the arguments. Does Intel Thread Advisor use mem_bench to obtain the perf statistics for L1, L2, L3 and DRAM BW?
Is there any documentation or can you provide command lines for running mem_bench for L1, L2 and L3?
Are mem_bench algorithms related to those in mlc (Intel Memory Latency Checker in https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html?wap...
Any suggestions with how to run either tool (or any other) at the command line for L1, L2 and L3 BW measurements? That would help me a lot to automate things.
mem_bench is old utility to run benchmarks and it requires specific parameters. Intel Advisor is not using it now.
Intel Advisor benchmarks are not open source. I'm sorry but there is no way to run benchmarks without profiling an application. We can suggest you
To run benchmarks
1) Each time run roofline collection from command line on light application
advixe-cl -c roofline –project-dir<project_path> -- <application> <application parameters>
2) Run advixe-runtrc tool for project-dir with pre-collected survey result (to not repeat survey collection)
advixe-cl -c survey –project-dir<project_path> -- <application> <application parameters>
advixe-runtrc -r <project_path>/e000/trc000 --benchmarks=dram,mcdram,cpu_cache,flop_peak,int -- <application> <application parameters>
You can stop application after benchmarking pressing “ctrl+c” after the message “advixe: Peak bandwidth measurement finished” was printed.
After that add “e000/trc000” subdir to <project_path> specified on survey.
To print results
1) advixe-cl -r roofs –project-dir<project_path>
Please let us know what exactly is not printed by -r roofs? It should print all measured benchmarks.
2) Use Python API script roofs.py, which is available in install_dir/pythonapi/examples
$ advixe-python install_dir/pythonapi/examples/roofs.py <project_path>
To get more information about Python API https://software.seek.intel.com/LP=18275
$ advixe-runtrc --help
-cache-config=<string> Cache Configuration
-enable-cache-simulation Model CPU cache behavior on your application
If I specify the -cache-config= parameters while running the benchmarks, will it measure cache BWs as if the cache had that specification (cache size, associativity, etc.)?
No, these parameters are for profiling your application and not benchmarks. Benchmarks are not parameterized using advixe-runtrc as they are automatically adjusted based on current hardware. Maybe you can do it based mem-bench, but it is old version of benchmarks.
As about cache-simulation parameters, Intel Advisor uses this functionality to get application's loop/function traffic for all levels of memory subsytem. Intel Advisor Memory Level Roofline can show you this breakdown. https://software.intel.com/content/www/us/en/develop/articles/memory-level-roofline-model-with-advis...
BTW, Intel Advisor is a great and quite useful tool! I have been involved with computer architectures and performance analysis and I understand the challenges and all the hard work that goes on behind the scenes.
I assume that the numbers of INT/FP scalar/vector arithmetic and L1, 2, 3 BWs reported per core have been verified? I am mostly interested in the L1, 2 and 3 figures. I am using low assembly level benchmarks (AVX512 and AVX2 vector instructions) but I have never managed to get L1, 2, 3 and DRAM BW as high as that reported.
Can U offer some insight how the cache benchmarks work ? Are the tests related to those Intel Memory Latency Checker (https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html)?
Or U can point me to articles explaining the methodology behind the cache benchmarks