Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5206 Discussions

The APS report is incomplete in Granite Rapids - SP (GNR-SP)

Rajeev_A_Intel
Employee
201 Views

I installed latest Intel oneAPI Base Toolkit in a GNR-SP machine which has 2 sockets and 86 cores per socket. I ran the APS command using the below syntax which I have used in older generations like SPR. The output only shows memory details and not any other details.

 

aps -r=./vtune python inference.py

 

aps --version gives the following output. It looks like latest version which should have supported GNR-SP.

 

Intel(R) VTune(TM) Profiler 2025.3.0 (build 630104) Command Line Tool
Copyright (C) 2009 Intel Corporation. All rights reserved.

0 Kudos
3 Replies
yuzhang3_intel
Moderator
191 Views

Did you try another sample binary, like matrix sample.

sdp@d404e64b801a:~/workspace$ aps ~/intel/vtune/samples/matrix/matrix
vtune: Warning: Memory bandwidth collection requires the sampling driver to be enabled on the system. Disable the "Analyze memory bandwidth" knob to proceed with the analysis or install the sampling driver on the system. See the Sampling Drivers help topic for more details. Note that memory bandwidth collection is not possible if you are profiling inside a virtualized environment.
vtune: Warning: The following events cannot be collected: FRONTEND_RETIRED.LATENCY_GE_4
Addr of buf1 = 0x7faff6dff010
Offs of buf1 = 0x7faff6dff180
Addr of buf2 = 0x7fadf6dfe010
Offs of buf2 = 0x7fadf6dfe1c0
Addr of buf3 = 0x7fabf6dfd010
Offs of buf3 = 0x7fabf6dfd100
Addr of buf4 = 0x7fa9f6dfc010
Offs of buf4 = 0x7fa9f6dfc140
Threads #: 16 requested OpenMP threads
Matrix size: 32768
Using multiply kernel: multiply5
Freq = 3.399998 GHz
Execution time = 12.604 seconds
| Summary information
|--------------------------------------------------------------------
Application : matrix
Report creation date : 2025-05-07 23:07:09
OpenMP threads number per Process: 16
HW Platform : Intel(R) Xeon(R) Processor code named Graniterapids
Frequency : 2.50 GHz
Logical core count per node : 192
Collector type : Driverless Perf system-wide counting
Used statistics : /home/sdp/workspace/aps_result_20250507/d404e64b801a.jf.intel.com
|
| Your application might underutilize the available logical CPU cores
| because of insufficient parallel work, blocking on synchronization, or too much I/O. Perform function or source line-level profiling with tools like Intel(R) VTune(TM) Profiler to discover why the CPU is underutilized.
|
Elapsed Time: 21.69 s
SP GFLOPS: 0.11
DP GFLOPS: 0.00
Average CPU Frequency: 2.69 GHz
IPC Rate: 1.43
Serial Time: 4.65 s 21.43% of Elapsed Time
| The Serial Time of your application is significant. It directly impacts
| application Elapsed Time and scalability.Explore options for parallelization
| with Intel(R) Advisor or algorithm or microarchitecture tuning of the serial
| code with Intel(R) VTune(TM) Profiler.
OpenMP Imbalance: 0.00 s 0.00% of Elapsed Time
Physical Core Utilization: 15.90%
| The metric is below 80% threshold, which may signal a poor physical CPU cores
| utilization caused by: load imbalance, threading runtime overhead, contended
| synchronization, insufficient parallelism, incorrect affinity that utilizes
| logical cores instead of physical cores. Perform threading analysis with tools
| like Intel(R) VTune(TM) Profiler to discover why physical cores are
| underutilized.
Average Physical Core Utilization: 15.29 out of 96 Physical Cores
Memory Stalls: 28.50% of Pipeline Slots
| The metric value can indicate that a significant fraction of execution
| pipeline slots could be stalled due to demand memory load and stores. See the
| second level metrics to define if the application is cache- or DRAM-bound and
| the NUMA efficiency. Use Intel(R) VTune(TM) Profiler Memory Access analysis to
| review a detailed metric breakdown by memory hierarchy, memory bandwidth
| information, and correlation by memory objects.
Cache Stalls: 28.00% of Cycles
| A significant proportion of cycles are spent on data fetches from cache. Use
| Intel(R) VTune(TM) Profiler Memory Access analysis to see if accesses to L2 or
| L3 cache are problematic and consider applying the same performance tuning as
| you would for a cache-missing workload. This may include reducing the data
| working set size, improving data access locality, blocking or partitioning the
| working set to fit in the lower cache levels, or exploiting hardware
| prefetchers.
DRAM Stalls: 1.60% of Cycles
Average DRAM Bandwidth: N/A
| Data for this metric is not collected since it requires system-wide
| performance monitoring. Make sure the sampling driver is properly installed on
| your system:
| https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/current/sep-driver.html.
| Otherwise, enable a driverless Perf-based sampling collection by setting the
| /proc/sys/kernel/perf_even_paranoid value to 0 or less.
Vectorization: 100.00%
Instruction Mix:
Memory Footprint:
Resident: 10525.00 MB
Virtual: 34949.00 MB

 

 

yuzhang3_intel_0-1746684974878.png

 

0 Kudos
Rajeev_A_Intel
Employee
187 Views

Initially I was getting only memory details. Then I started getting processor details. Now the only thing left is Vectorization details. It is giving 0.10%, but not the details. The output I get when I try your command. 

devcloud@e8ebd332f5de:~$ aps /opt/intel/oneapi/vtune/latest/samples/en/C++/matrix/matrix
Addr of buf1 = 0x7c796f5ff010
Offs of buf1 = 0x7c796f5ff180
Addr of buf2 = 0x7c796d5fe010
Offs of buf2 = 0x7c796d5fe1c0
Addr of buf3 = 0x7c796b5fd010
Offs of buf3 = 0x7c796b5fd100
Addr of buf4 = 0x7c79695fc010
Offs of buf4 = 0x7c79695fc140
Threads #: 16 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 2.867 seconds
| Summary information
|--------------------------------------------------------------------
Application : matrix
Report creation date : 2025-05-08 06:14:24
HW Platform : Intel(R) Xeon(R) Processor code named Graniterapids
Frequency : 2.00 GHz
Logical core count per node : 344
Collector type : Driverless Perf system-wide counting
Used statistics : /home/devcloud/aps_result_20250508/e8ebd332f5de
|
| Your application might underutilize the available logical CPU cores
| because of insufficient parallel work, blocking on synchronization, or too much I/O. Perform function or source line-level profiling with tools like Intel(R) VTune(TM) Profiler to discover why the CPU is underutilized.
|
Elapsed Time: 2.91 s
SP GFLOPS: 0.00
DP GFLOPS: 5.93
Average CPU Frequency: 2.83 GHz
IPC Rate: 0.64
| The IPC value may be too low.
| This could be caused by issues such as memory stalls, instruction starvation,
| branch misprediction or long latency instructions.
| Use Intel(R) VTune(TM) Profiler Microarchitecture Exploration analysis to
| specify particular reasons of low IPC.
Physical Core Utilization: 10.00%
| The metric is below 80% threshold, which may signal a poor physical CPU cores
| utilization caused by: load imbalance, threading runtime overhead, contended
| synchronization, insufficient parallelism, incorrect affinity that utilizes
| logical cores instead of physical cores. Perform threading analysis with tools
| like Intel(R) VTune(TM) Profiler to discover why physical cores are
| underutilized.
Average Physical Core Utilization: 17.20 out of 172 Physical Cores
Memory Stalls: 64.90% of Pipeline Slots
| The metric value can indicate that a significant fraction of execution
| pipeline slots could be stalled due to demand memory load and stores. See the
| second level metrics to define if the application is cache- or DRAM-bound and
| the NUMA efficiency. Use Intel(R) VTune(TM) Profiler Memory Access analysis to
| review a detailed metric breakdown by memory hierarchy, memory bandwidth
| information, and correlation by memory objects.
Cache Stalls: 64.60% of Cycles
| A significant proportion of cycles are spent on data fetches from cache. Use
| Intel(R) VTune(TM) Profiler Memory Access analysis to see if accesses to L2 or
| L3 cache are problematic and consider applying the same performance tuning as
| you would for a cache-missing workload. This may include reducing the data
| working set size, improving data access locality, blocking or partitioning the
| working set to fit in the lower cache levels, or exploiting hardware
| prefetchers.
DRAM Stalls: 0.70% of Cycles
DRAM Bandwidth
Peak: 1.20 GB/s
Average: 2.00 GB/s
Bound: 0.00%
Vectorization: 0.10%
Instruction Mix:
Memory Footprint:
Resident: 68.00 MB
Virtual: 412.00 MB

Graphical representation of this data is available in the HTML report: /home/devcloud/aps_report_20250508_061430.html

0 Kudos
yuzhang3_intel
Moderator
110 Views

I rebuilt my matrix sample code for vectorization (added -xHost in the Makefile), so you need to check your matrix sample.

0 Kudos
Reply