- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How to collect PMU events, such as L1D cache misses, using Intel VTune's ITT API?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
Intel® VTune™ Profiler provides a set of hardware event-based analysis types that help you estimate how effectively your application uses hardware resources. These analysis types monitor hardware events supported by your system's Performance Monitoring Unit (PMU). The PMU is hardware built inside a processor to measure its performance parameters such as instruction cycles, cache hits, cache misses, branch misses and many others.
The Instrumentation and Tracing Technology API (ITT API) provided by the Intel® VTune™ Profiler enables your application to generate and control the collection of trace data during its execution.
To use the APIs, add API calls in your code to designate logical tasks. These markers will help you visualize the relationship between tasks in your code relative to other CPU and GPU tasks. To see user tasks in your performance analysis results, enable the Analyze user tasks checkbox in analysis settings.
User task and API data can be visualized in Intel® VTune™ Profiler performance analysis results.
After you have added basic annotations to your application to control performance data collection, you can view these annotations in the Intel VTune Profiler timeline. All supported instrumentation and tracing technology (ITT) API tasks can be visualized in VTune Profiler.
https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/task-analysis.html
A task instance represents a piece of work performed by a particular thread for a period of time. The task is defined by the bracketing of __itt_task_begin() and __itt_task_end() on the same thread.
https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/task-api.html
To display a list of events available on the target PMU, enter: vtune -collect-with runsa -knob event-config=? <target>.
https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/collect-with.html
To collect data on memory access analysis:
-------------------------------------------------------
vtune -collect memory-access -r <path of result directory> -- ./<name of the application>
For reference: https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/run-memory-access-analysis-command-line.html
To get hardware events in CSV format:
-----------------------------------------------
vtune -report hw-events -result-dir <dir> -report-output <path/filename.csv> -format csv -csv-delimiter comma
<dir> is the location of the result directory.
<path/filename> is the PATH and filename of the report file to be created.
For reference: https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/saving-and-formatting-reports.html
If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you get PMU events' observation when you want to profile a piece of code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If you wish to get PMU events' observation when you want to profile a piece of code, it might be more effective to instrument your application with Task or Frame API.
As result, you'll see exact function boundaries in the overtime view and will be able to group performance data by tasks or frames with or without applying filters.
The following steps should be performed to enable ITT based code annotation for target application:
- Get ITT source code from github.
- Call ITT functions to annotate the regions-of-interest inside the source code:
#include <ittnotify.h>
int main() {
__itt_domain* domain = __itt_domain_create("Domain.Global");
assert(domain != nullptr);
// Place a new frame
__itt_frame_begin_v3(domain, nullptr);
{
// Annotate the first task
__itt_string_handle* first_task_handle =
__itt_string_handle_create("FirstTask");
__itt_task_begin(domain, __itt_null, __itt_null, first_task_handle);
{
/* First Task Body */
}
__itt_task_end(domain);
// Annotate the second task
__itt_string_handle* second_task_handle =
__itt_string_handle_create("SecondTask");
__itt_task_begin(domain, __itt_null, __itt_null, second_task_handle);
{
/* Second Task Body */
}
__itt_task_end(domain);
}
__itt_frame_end_v3(domain, nullptr);
return 0;
}
- Build the application and link it with ITT library implementation. One may build ITT static library first, and then link the application with it. Another way is to add ITT sources (in particular, ittnotify_static.c file) into the application directly.
- Run the application under Intel(R) VTune(TM) Analyzer to see the result.
If you are profiling a piece of code that executes within few microseconds, Frame APIs are suitable for this case. One way to work-around this problem is by increasing the number of loop iterations to increase the workload between the API calls. You can later do the math and calculate how much it took to execute one iteration of the loop. However this approach is not suitable if the execution time vary heavily between different iterations of the loop.
For usage example : https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/frame-api.html
ITT API usage example : https://github.com/intel/pti-gpu/blob/master/chapters/code_annotation/ITT.md
After collection completes, the analysis results appear in a viewpoint specific to the analysis type selected. The API data collected is available in the following locations:
Timeline view: Each API type appears differently on the timeline view. In the example below, the code was instrumented with the task API, frame API, event API, and collection control API. Tasks appear as yellow bars on the task thread. Frames appear at the top of the timeline in pink. Events appear on the appropriate thread as a triangle at the event time. Collection control events span the entire timeline. Hover over a task, frame, or event to view the type of API task.
Grid view: Set the Grouping to Task Domain / Task Type / Function / Call Stack or Task Type / Function / Call Stack to view task data in the grid pane.
Platform tab: Individual tasks are available in a larger view on the Platform tab. Hover over a task to get more information.
Hope this clarifies your query.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Where can I see the PMU for say L1D cache for the given frame?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Performance Monitoring Unit (PMU) provides a list of events to measure micro-architectural events such as the number of cycles, instructions retired, L1 cache misses and so on.
Those events are called PMU hardware events or hardware events for short. They vary with each processor type and model.
If you wants to see the PMU for say L1D cache , you can use Memory Subsystem PMU Events.
There's MEM_LOAD_RETIRED.L1_MISS and MEM_LOAD_RETIRED.L1_HIT. Alternatively, you can use L1-dcache-loads - L1-dcache-load-misses
Please refer the below link for more informations:
https://www.cs.utexas.edu/~pingali/CS377P/2018sp/lectures/vtune-cache-jackson.pdf
In Memory Access analysis type uses hardware event-based sampling to collect data for Memory Bound metric, that shows a fraction of cycles spent waiting due to demand load or store instructions
L1 Bound metric that shows how often the machine was stalled without missing the L1 data cache.
The Summary window gives the percentage of pipeline slots in each category for the whole application. You can explore results in multiple ways.
The most common way to explore results is to view metrics at the function level:
Most of the metrics under the Memory Bound category identify which level of the memory hierarchy from the L1 cache through the memory is the bottleneck.
Grayed out metric values indicate that the data collected for this metric is unreliable. This may happen, for example, if the number of samples collected for PMU events is too low. You may either ignore this data, or rerun the collection with the data collection time, sampling interval, or workload increased.
Reference : https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/memory-access-analysis.html
If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think my question was not clear.
I meant to ask how can I get say L1D cache misses for a specific piece of code using ITT API.
Using memory access analysis in VTune gives the hardware counters for end to end run of the program, whereas what I want is the counters only for a specific piece of code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
To focus your performance analysis on a task - program functionality performed by a particular code section - you can use the Intel® VTune™ Profiler ITT API tasks.
• ITT API tasks: Analyze performance of particular code regions (tasks) if your target uses the Task API to
mark task regions and you enabled the Analyze user tasks, events and counters option during the analysis type configuration
If you only want to focus on a piece of code in your program, mark this section by defining it as a Code Region of Interest. Use the ITT API for this purpose.
1. Register the name of the code region you plan to profile:
__itt_pt_region region = __itt_pt_region_create("region");
2. Mark the target loop in your application with this name:
for(…;…;…)
{
__itt_mark_pt_region_begin(region);
<code processing your task>
__itt_mark_pt_region_end(region);
}
Run the Analysis
For more details, please check the link :
For Task Analysis:
Prerequisites:
• Use the ITT Task API to insert calls in your code and define the tasks.
• Configure your analysis target.
1.Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the VTune
Profiler toolbar.
2. Choose the analysis type from the HOW pane.
3. Select the Analyze user tasks, events, and counters option.
4. Click the Start button to run the analysis.
VTune Profiler collects data detecting the marked tasks.
Once the analysis is complete, VTune Profiler displays results in the Summary window, where you can find the detils like counters.
References: "Task Analysis" in https://www.intel.com/content/dam/develop/external/us/en/documents/vtune-profiler-user-guide.pdf
https://github.com/intel/ittapi
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. Could you please give an update?
Can we go ahead and close the thread?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page