Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5135 Discussions

GNR-AP MEM BW result in Vtune 2025.0 is inconsistent with other tools

CasperYoon
Beginner
228 Views

Hi.


I'm doing workload performance analysis for GNR-AP system, and the MEM BW result of Vtune 2025.0 is not consistent with other tools (Intel MLC, PCM).


(Please understand that I can't attach any pictures or logs due to my company's security policy)

 

The specifications of my system are as follows
CPU - Intel GNR-AP 72 core (1 Socket)
Mem - DDR5 5600Mhz 1.5TB (128GB x 12 Mem Ch.)

This is the MEM BW of the system (in hex clustering mode) measured by Intel MLC.

 


```
Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --max_bandwidth

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes

Measuring Maximum Memory Bandwidths for the system
Will take several minutes to complete as multiple injection rates will be tried to get the best bandwidth
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 486045.92
3:1 Reads-Writes : 422865.38
2:1 Reads-Writes : 410721.92
1:1 Reads-Writes : 400311.51
Stream-triad like: 399477.00
```

In SNC3 mode, overall BW was measured at 489GB/s, intra-node BW was measured at 164GB/s and inter-node BW was measured at 145GB/s.

 

1)

When measuring the peak memory bandwidth using the command 'vtune -collect memory-access -d 20', vtune shows the maximum bandwidth of the system as 478GB/s in hex clustering mode.


```
Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average % of Elapsed Time with High BW Utilization(%)
---------------- ---------------- ---------------- ------- ---------------------------------------------
DRAM, GB/sec 478 4.600 0.000 0.0%
Collection and Platform Info
User Name: root
Operating System: 6.8.0-47-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
Computer Name: -
Result Size: 51.1 MB
Collection start time: -
Collection stop time: -
Collector Type: Event-based sampling driver
CPU
Name: Intel(R) Xeon(R) Processor code named Graniterapids
Frequency: 2.400 GHz
Logical CPU Count: 144
Max DRAM Single-Package Bandwidth: 478.000 GB/s
LLC size: 453.0 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available

```

 

However, in snc3 mode, vtune shows the peak memory bandwidth as 164 GB/s, which is the maximum bandwidth of a single subnuma node.

 

```
Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average % of Elapsed Time with High BW Utilization(%)
---------------- ---------------- ---------------- ------- ---------------------------------------------
DRAM, GB/sec 167 0.700 0.000 0.0%
Collection and Platform Info
User Name: root
Operating System: 6.8.0-47-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
Computer Name: -
Result Size: 78.0 MB
Collection start time: -
Collection stop time: -
Collector Type: Event-based sampling driver
CPU
Name: Intel(R) Xeon(R) Processor code named Graniterapids
Frequency: 2.400 GHz
Logical CPU Count: 144
Max DRAM Single-Package Bandwidth: 167.000 GB/s
LLC size: 453.0 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available

```

 

In snc3 mode, when I declare '-knob dram-bandwidth-limits=true' and run experiments, this phenomenon occurs consistently.

Is there any way to check the peak MEM BW of the system normally in snc3 mode?

 

2)

When analyzing the results of the workload with vtune, the timeline of Window: Platform shows only 8 Mem. Ch. out of total 12 Mem. Ch. (Please understand that I am not able to upload images).
Also, vtune calculates the overall memory performance by summing up the results of 8 Ch. This happens in both snc3 mode and hex mode.

 

```
|---------------------------------------|
|-- Socket 0 --|
|---------------------------------------|
|-- Memory Channel Monitoring --|
|---------------------------------------|
|-- Mem Ch 0: Reads (MB/s): 144.74 --|
|-- Writes(MB/s): 143.65 --|
|-- Mem Ch 1: Reads (MB/s): 144.56 --|
|-- Writes(MB/s): 143.47 --|
|-- Mem Ch 2: Reads (MB/s): 128.43 --|
|-- Writes(MB/s): 127.38 --|
|-- Mem Ch 3: Reads (MB/s): 128.68 --|
|-- Writes(MB/s): 127.63 --|
|-- Mem Ch 4: Reads (MB/s): 138.10 --|
|-- Writes(MB/s): 137.03 --|
|-- Mem Ch 5: Reads (MB/s): 137.49 --|
|-- Writes(MB/s): 136.43 --|
|-- Mem Ch 6: Reads (MB/s): 144.95 --|
|-- Writes(MB/s): 143.89 --|
|-- Mem Ch 7: Reads (MB/s): 145.13 --|
|-- Writes(MB/s): 144.10 --|
|-- Mem Ch 8: Reads (MB/s): 128.89 --|
|-- Writes(MB/s): 127.82 --|
|-- Mem Ch 9: Reads (MB/s): 129.06 --|
|-- Writes(MB/s): 128.02 --|
|-- Mem Ch 10: Reads (MB/s): 137.68 --|
|-- Writes(MB/s): 136.61 --|
|-- Mem Ch 11: Reads (MB/s): 137.57 --|
|-- Writes(MB/s): 136.42 --|
|-- SKT 0 Mem Read (MB/s) : 1645.30 --|
|-- SKT 0 Mem Write(MB/s) : 1632.47 --|
|-- SKT 0 NM hit rate: 0.00 --|
|-- SKT 0 NM hits (M/s): 0.00 --|
|-- SKT 0 NM misses (M/s): 0.00 --|
|-- SKT 0 NM miss Bw(MB/s): 0.00 --|
|-- SKT 0 Memory (MB/s): 3277.77 --|
|---------------------------------------|
|---------------------------------------||---------------------------------------|
|-- System Read Throughput(MB/s): 1645.30 --|
|-- System Write Throughput(MB/s): 1632.47 --|
|-- System Memory Throughput(MB/s): 3277.77 --|
|---------------------------------------||---------------------------------------|

```

 

When analyzing  same workload using Intel PCM(pcm-memory), vtune seems to be missing the results of SKT0 Ch4,5,10,11.

 


Is there any way to see the results of all 12 memory channels ? or is this not supported yet?

 

Thank you in advance!

Labels (1)
0 Kudos
3 Replies
yuzhang3_intel
Moderator
208 Views

It looks like there are issues with memory-access analysis. I want to double-check here:

1. In snc3 mode, VTune shows DDR maximum BW is 478GB/s  and 167GB/s for two measurements using the same command line:

vtune -collect memory-access -d 20

 

Is there a consistency issue in hex mode?  I ran it multiple times. It is always consistent.

 

2. I can also reproduce the issue, and only the results of 8 channels are summed up.

0 Kudos
CasperYoon
Beginner
192 Views

First, I attach the logs for mlc / 'vtune -collect memory-access -d 20' / vtune actual workload in HEX clustering mode and SNC3 mode.

 

HEX - mlc
```
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 487698.1
3:1 Reads-Writes : 422044.3
2:1 Reads-Writes : 410408.2
1:1 Reads-Writes : 398730.3
Stream-triad like: 398700.7

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0
0 487400.8
```

HEX - vtune -collect memory-access -d 20
```
Elapsed Time: 20.002s
CPU Time: 12.915s
Memory Bound: 14.0% of Pipeline Slots
L1 Bound: 0.0% of Clockticks
L2 Bound: 3.2% of Clockticks
L3 Bound: 9.6% of Clockticks
DRAM Bound: 0.0% of Clockticks
DRAM Bandwidth Bound: 0.0% of Elapsed Time
Store Bound: 0.0% of Clockticks
NUMA: % of Remote Accesses: 0.0%
UPI Utilization Bound: 0.0% of Elapsed Time
Loads: 5,054,651,635
Stores: 3,168,095,040
LLC Miss Count: 0
Local Memory Access Count: 0
Remote Memory Access Count: 0
Remote Cache Access Count: 0
Average Latency (cycles): 9
Total Thread Count: 843
Paused Time: 0s

Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average % of Elapsed Time with High BW Utilization(%)
---------------- ---------------- ---------------- ------- ---------------------------------------------
DRAM, GB/sec 479 5.000 0.000 0.0%
Collection and Platform Info
User Name: -
Operating System: 6.8.0-47-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
Computer Name: -
Result Size: 51.7 MB
Collection start time: 23:08:37 22/11/2023 UTC
Collection stop time: 23:08:57 22/11/2023 UTC
Collector Type: Event-based sampling driver
CPU
Name: Intel(R) Xeon(R) Processor code named Graniterapids
Frequency: 2.400 GHz
Logical CPU Count: 144
Max DRAM Single-Package Bandwidth: 479.000 GB/s
LLC size: 453.0 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available

```

HEX - actual workload
```
Elapsed Time: 9.959s
CPU Time: 1038.055s
Memory Bound: 81.8% of Pipeline Slots
| The metric value is high. This may indicate that a significant fraction
| of execution pipeline slots could be stalled due to demand memory load
| and stores. Explore the metric breakdown by memory hierarchy, memory
| bandwidth information, and correlation by memory objects.
|
L1 Bound: 10.2% of Clockticks
| This metric shows how often machine was stalled without missing the
| L1 data cache. The L1 cache typically has the shortest latency.
| However, in certain cases like loads blocked on older stores, a load
| might suffer a high latency even though it is being satisfied by the
| L1.
|
L2 Bound: 0.7% of Clockticks
L3 Bound: 36.5% of Clockticks
| This metric shows how often CPU was stalled on L3 cache, or contended
| with a sibling Core. Avoiding cache misses (L2 misses/L3 hits)
| improves the latency and increases performance.
|
DRAM Bound: 39.4% of Clockticks
| This metric shows how often CPU was stalled on the main memory
| (DRAM). Caching typically improves the latency and increases
| performance.
|
DRAM Bandwidth Bound: 0.0% of Elapsed Time
Store Bound: 0.8% of Clockticks
NUMA: % of Remote Accesses: 0.0%
UPI Utilization Bound: 0.0% of Elapsed Time
Loads: 187,671,129,965
Stores: 50,469,514,040
LLC Miss Count: 5,755,416,260
Local Memory Access Count: 5,412,378,840
Remote Memory Access Count: 0
Remote Cache Access Count: 0
Average Latency (cycles): 158
Total Thread Count: 271
Paused Time: 2.225s

Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average % of Elapsed Time with High BW Utilization(%)
---------------- ---------------- ---------------- ------- ---------------------------------------------
DRAM, GB/sec 479 320.600 206.601 0.0%

Collection and Platform Info
Application Command Line: -
User Name: -
Operating System: -
Computer Name: -
Result Size: 1.3 GB
Collection start time: 23:12:09 22/11/2023 UTC
Collection stop time: 23:12:19 22/11/2023 UTC
Collector Type: Event-based sampling driver
CPU
Name: Intel(R) Xeon(R) Processor code named Graniterapids
Frequency: 2.400 GHz
Logical CPU Count: 144
Max DRAM Single-Package Bandwidth: 479.000 GB/s
LLC size: 453.0 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available

```

SNC3 - mlc
```
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 505254.0
3:1 Reads-Writes : 426340.0
2:1 Reads-Writes : 411557.5
1:1 Reads-Writes : 397258.7
Stream-triad like: 411275.3

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1 2
0 168410.2 140504.9 137382.8
1 141612.5 168870.4 142064.9
2 141380.8 141597.9 168472.0
```

SNC3 - vtune -collect memory-access -d 20
```
Elapsed Time: 20.002s
CPU Time: 1.545s
Memory Bound: 30.1% of Pipeline Slots
| The metric value is high. This may indicate that a significant fraction
| of execution pipeline slots could be stalled due to demand memory load
| and stores. Explore the metric breakdown by memory hierarchy, memory
| bandwidth information, and correlation by memory objects.
|
L1 Bound: 0.0% of Clockticks
L2 Bound: 19.4% of Clockticks
| This metric shows how often machine was stalled on L2 cache. Avoiding
| cache misses (L1 misses/L2 hits) will improve the latency and
| increase performance.
|
L3 Bound: 19.4% of Clockticks
| This metric shows how often CPU was stalled on L3 cache, or contended
| with a sibling Core. Avoiding cache misses (L2 misses/L3 hits)
| improves the latency and increases performance.
|
DRAM Bound: 0.0% of Clockticks
DRAM Bandwidth Bound: 0.0% of Elapsed Time
Store Bound: 0.0% of Clockticks
NUMA: % of Remote Accesses: 0.0%
UPI Utilization Bound: 0.0% of Elapsed Time
Loads: 220,006,600
Stores: 99,002,970
LLC Miss Count: 0
Local Memory Access Count: 0
Remote Memory Access Count: 0
Remote Cache Access Count: 0
Total Thread Count: 93
Paused Time: 0s

Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average % of Elapsed Time with High BW Utilization(%)
---------------- ---------------- ---------------- ------- ---------------------------------------------
DRAM, GB/sec 167 4.200 2.916 0.0%
Collection and Platform Info
User Name: -
Operating System: 6.8.0-47-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
Computer Name: gnr-ap-calab-rag
Result Size: 36.0 MB
Collection start time: 21:09:36 21/11/2023 UTC
Collection stop time: 21:09:56 21/11/2023 UTC
Collector Type: Event-based sampling driver
CPU
Name: Intel(R) Xeon(R) Processor code named Graniterapids
Frequency: 2.400 GHz
Logical CPU Count: 144
Max DRAM Single-Package Bandwidth: 167.000 GB/s
LLC size: 453.0 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available

```

SNC3 - actual workload
```
Elapsed Time: 13.601s
CPU Time: 1281.700s
Memory Bound: 78.5% of Pipeline Slots
| The metric value is high. This may indicate that a significant fraction
| of execution pipeline slots could be stalled due to demand memory load
| and stores. Explore the metric breakdown by memory hierarchy, memory
| bandwidth information, and correlation by memory objects.
|
L1 Bound: 16.2% of Clockticks
| This metric shows how often machine was stalled without missing the
| L1 data cache. The L1 cache typically has the shortest latency.
| However, in certain cases like loads blocked on older stores, a load
| might suffer a high latency even though it is being satisfied by the
| L1.
|
L2 Bound: 2.0% of Clockticks
L3 Bound: 44.2% of Clockticks
| This metric shows how often CPU was stalled on L3 cache, or contended
| with a sibling Core. Avoiding cache misses (L2 misses/L3 hits)
| improves the latency and increases performance.
|
DRAM Bound: 22.2% of Clockticks
| This metric shows how often CPU was stalled on the main memory
| (DRAM). Caching typically improves the latency and increases
| performance.
|
DRAM Bandwidth Bound: 23.5% of Elapsed Time
| The system spent much time heavily utilizing DRAM bandwidth.
| Improve data accesses to reduce cacheline transfers from/to
| memory using these possible techniques: 1) consume all bytes of
| each cacheline before it is evicted (for example, reorder
| structure elements and split non-hot ones); 2) merge compute-
| limited and bandwidth-limited loops; 3) use NUMA optimizations on
| a multi-socket system. Note: software prefetches do not help a
| bandwidth-limited application. Run Memory Access analysis to
| identify data structures to be allocated in High Bandwidth Memory
| (HBM), if available.
|
Store Bound: 1.8% of Clockticks
NUMA: % of Remote Accesses: 0.0%
UPI Utilization Bound: 0.0% of Elapsed Time
Loads: 188,215,646,300
Stores: 48,043,941,275
LLC Miss Count: 5,362,001,095
Local Memory Access Count: 1,193,583,545
Remote Memory Access Count: 0
Remote Cache Access Count: 0
Average Latency (cycles): 117
Total Thread Count: 271
Paused Time: 2.803s

Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average % of Elapsed Time with High BW Utilization(%)
---------------- ---------------- ---------------- ------- ---------------------------------------------
DRAM, GB/sec 316 315.100 149.858 23.5%

Collection and Platform Info
Application Command Line: -
User Name: -
Operating System: 6.8.0-47-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
Computer Name: -
Result Size: 1.5 GB
Collection start time: 21:20:31 21/11/2023 UTC
Collection stop time: 21:20:45 21/11/2023 UTC
Collector Type: Event-based sampling driver
CPU
Name: Intel(R) Xeon(R) Processor code named Graniterapids
Frequency: 2.400 GHz
Logical CPU Count: 144
Max DRAM Single-Package Bandwidth: 316.000 GB/s
LLC size: 453.0 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available

```

 

1)
The term "inconsistency" I mentioned refers to the difference between HEX and SNC3 modes. I want to measure the same MEM BW in VTune, regardless of the clustering mode.

 

In HEX mode, I could check the platform peak BW of 479 GB/s, but in SNC3 mode, I could not.


Also, the platform peak BW is different in SNC3 mode depending on the workload. The platform peak BW of 'vtune -collect memory-access -d 20' was 167GB/s, while the platform peak BW of the actual workload was 316GB/s. This suggests that vtune was measuring the BW of only one or two sub numa nodes.

 

2)
Yes. The actual workload result in HEX mode shows the observed MAX BW (320 GB/s) is a sum of only 8 out of 12 ch.

On intel pcm, I measured the workload using 480 GB/s.


Should I change the settings to sum the 12 ch. results, or is this a bug?

0 Kudos
yuzhang3_intel
Moderator
163 Views

Thanks for reporting the issues, we need to do a deeper analysis, and give an update later, thanks.

Reply