Re: Collecting PMU events when using Intel VTune's ITT API

rbachkaniwala3 · ‎09-04-2023

How to collect PMU events, such as L1D cache misses, using Intel VTune's ITT API?

Rahila_T_Intel · ‎09-07-2023

Hi,

Thanks for posting in Intel Communities.

Intel® VTune™ Profiler provides a set of hardware event-based analysis types that help you estimate how effectively your application uses hardware resources. These analysis types monitor hardware events supported by your system's Performance Monitoring Unit (PMU). The PMU is hardware built inside a processor to measure its performance parameters such as instruction cycles, cache hits, cache misses, branch misses and many others.

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/intel-processor-events-reference.html

The Instrumentation and Tracing Technology API (ITT API) provided by the Intel® VTune™ Profiler enables your application to generate and control the collection of trace data during its execution.

To use the APIs, add API calls in your code to designate logical tasks. These markers will help you visualize the relationship between tasks in your code relative to other CPU and GPU tasks. To see user tasks in your performance analysis results, enable the Analyze user tasks checkbox in analysis settings.

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/instrumentation-and-tracing-technology-apis.html

User task and API data can be visualized in Intel® VTune™ Profiler performance analysis results.

After you have added basic annotations to your application to control performance data collection, you can view these annotations in the Intel VTune Profiler timeline. All supported instrumentation and tracing technology (ITT) API tasks can be visualized in VTune Profiler.

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/viewing-itt-api-task-data.html

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/task-analysis.html

A task instance represents a piece of work performed by a particular thread for a period of time. The task is defined by the bracketing of __itt_task_begin() and __itt_task_end() on the same thread.

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/task-api.html

To display a list of events available on the target PMU, enter: vtune -collect-with runsa -knob event-config=? <target>.

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/collect-with.html

To collect data on memory access analysis:

-------------------------------------------------------

vtune -collect memory-access -r <path of result directory> -- ./<name of the application>

For reference: https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/run-memory-access-analysis-command-line.html

To get hardware events in CSV format:

-----------------------------------------------

vtune -report hw-events -result-dir <dir> -report-output <path/filename.csv> -format csv -csv-delimiter comma

<dir> is the location of the result directory.

<path/filename> is the PATH and filename of the report file to be created.

For reference: https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/saving-and-formatting-reports.html

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue.

Thanks

rbachkaniwala3 · ‎09-07-2023

This doesn't answer my question.

Can you get PMU events' observation when you want to profile a piece of code?

Rahila_T_Intel · ‎09-11-2023

Hi,

If you wish to get PMU events' observation when you want to profile a piece of code, it might be more effective to instrument your application with Task or Frame API.

As result, you'll see exact function boundaries in the overtime view and will be able to group performance data by tasks or frames with or without applying filters.

The following steps should be performed to enable ITT based code annotation for target application:

Get ITT source code from github.
Call ITT functions to annotate the regions-of-interest inside the source code:

#include <ittnotify.h>

int main() {
  __itt_domain* domain = __itt_domain_create("Domain.Global");
  assert(domain != nullptr);

  // Place a new frame
  __itt_frame_begin_v3(domain, nullptr);
  {
    // Annotate the first task
    __itt_string_handle* first_task_handle =
      __itt_string_handle_create("FirstTask");
    __itt_task_begin(domain, __itt_null, __itt_null, first_task_handle);
    {
      /* First Task Body */
    }
    __itt_task_end(domain);

    // Annotate the second task
    __itt_string_handle* second_task_handle =
      __itt_string_handle_create("SecondTask");
    __itt_task_begin(domain, __itt_null, __itt_null, second_task_handle);
    {
      /* Second Task Body */
    }
    __itt_task_end(domain);
  }
  __itt_frame_end_v3(domain, nullptr);

  return 0;
}

Build the application and link it with ITT library implementation. One may build ITT static library first, and then link the application with it. Another way is to add ITT sources (in particular, ittnotify_static.c file) into the application directly.
Run the application under Intel(R) VTune(TM) Analyzer to see the result.

If you are profiling a piece of code that executes within few microseconds, Frame APIs are suitable for this case. One way to work-around this problem is by increasing the number of loop iterations to increase the workload between the API calls. You can later do the math and calculate how much it took to execute one iteration of the loop. However this approach is not suitable if the execution time vary heavily between different iterations of the loop.

For usage example : https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/frame-api.html

ITT API usage example : https://github.com/intel/pti-gpu/blob/master/chapters/code_annotation/ITT.md

After collection completes, the analysis results appear in a viewpoint specific to the analysis type selected. The API data collected is available in the following locations:

Timeline view: Each API type appears differently on the timeline view. In the example below, the code was instrumented with the task API, frame API, event API, and collection control API. Tasks appear as yellow bars on the task thread. Frames appear at the top of the timeline in pink. Events appear on the appropriate thread as a triangle at the event time. Collection control events span the entire timeline. Hover over a task, frame, or event to view the type of API task.

Grid view: Set the Grouping to Task Domain / Task Type / Function / Call Stack or Task Type / Function / Call Stack to view task data in the grid pane.

Platform tab: Individual tasks are available in a larger view on the Platform tab. Hover over a task to get more information.

Hope this clarifies your query.

Thanks

rbachkaniwala3 · ‎09-12-2023

Where can I see the PMU for say L1D cache for the given frame?

Rahila_T_Intel · ‎09-13-2023

Hi,

Performance Monitoring Unit (PMU) provides a list of events to measure micro-architectural events such as the number of cycles, instructions retired, L1 cache misses and so on.

Those events are called PMU hardware events or hardware events for short. They vary with each processor type and model.

If you wants to see the PMU for say L1D cache , you can use Memory Subsystem PMU Events.

There's MEM_LOAD_RETIRED.L1_MISS and MEM_LOAD_RETIRED.L1_HIT. Alternatively, you can use L1-dcache-loads - L1-dcache-load-misses

Please refer the below link for more informations:

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/cpu-metrics-reference.html#L1-BOUND

https://www.cs.utexas.edu/~pingali/CS377P/2018sp/lectures/vtune-cache-jackson.pdf

In Memory Access analysis type uses hardware event-based sampling to collect data for Memory Bound metric, that shows a fraction of cycles spent waiting due to demand load or store instructions

L1 Bound metric that shows how often the machine was stalled without missing the L1 data cache.

The Summary window gives the percentage of pipeline slots in each category for the whole application. You can explore results in multiple ways.

The most common way to explore results is to view metrics at the function level:

Most of the metrics under the Memory Bound category identify which level of the memory hierarchy from the L1 cache through the memory is the bottleneck.

Grayed out metric values indicate that the data collected for this metric is unreliable. This may happen, for example, if the number of samples collected for PMU events is too low. You may either ignore this data, or rerun the collection with the data collection time, sampling interval, or workload increased.

Reference : https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/memory-access-analysis.html

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue.

Thanks

rbachkaniwala3 · ‎09-14-2023

I think my question was not clear.

I meant to ask how can I get say L1D cache misses for a specific piece of code using ITT API.

Using memory access analysis in VTune gives the hardware counters for end to end run of the program, whereas what I want is the counters only for a specific piece of code.

Rahila_T_Intel · ‎09-18-2023

Hi,

To focus your performance analysis on a task - program functionality performed by a particular code section - you can use the Intel® VTune™ Profiler ITT API tasks.

• ITT API tasks: Analyze performance of particular code regions (tasks) if your target uses the Task API to

mark task regions and you enabled the Analyze user tasks, events and counters option during the analysis type configuration

If you only want to focus on a piece of code in your program, mark this section by defining it as a Code Region of Interest. Use the ITT API for this purpose.

1. Register the name of the code region you plan to profile:

__itt_pt_region region = __itt_pt_region_create("region");

2. Mark the target loop in your application with this name:

for(…;…;…)

{

__itt_mark_pt_region_begin(region);

__itt_mark_pt_region_end(region);

}

Run the Analysis

For more details, please check the link :

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/anomaly-detection-analysis.html

For Task Analysis:

Prerequisites:

• Use the ITT Task API to insert calls in your code and define the tasks.

• Configure your analysis target.

1.Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the VTune

Profiler toolbar.

2. Choose the analysis type from the HOW pane.

3. Select the Analyze user tasks, events, and counters option.

4. Click the Start button to run the analysis.

VTune Profiler collects data detecting the marked tasks.

Once the analysis is complete, VTune Profiler displays results in the Summary window, where you can find the detils like counters.

References: "Task Analysis" in https://www.intel.com/content/dam/develop/external/us/en/documents/vtune-profiler-user-guide.pdf

https://github.com/intel/ittapi

Thanks

Rahila_T_Intel · ‎09-25-2023

Hi,

We have not heard back from you. Could you please give an update?

Can we go ahead and close the thread?

Thanks

Rahila_T_Intel · ‎10-02-2023

Hi,

We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks

Collecting PMU events when using Intel VTune's ITT API

Intel VTune™ Profiler