Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

Advisor Roofline for HBM

Julius_
Beginner
3,901 Views

I am trying to get a roofline model for an application running on HBM memory, using the advixe-gui.

I am specifying the binary using numactl and assign it the HBM NUMA nodes.

However, I am missing the HBM Bandwidth in the Roofline model and I am only seeing the DRAM Bandwidth.  Is there a way to also add the HBM Bandwidth to the plot?

 

 

Labels (1)
0 Kudos
13 Replies
JaideepK_Intel
Employee
3,858 Views

Hi,

 

Thank you for posting in Intel communities.

 

We can see the same from our end too. One possible workaround is using the Intel Vtune profiler, we can view the HBM metrics and attach screenshots for your reference. For more information, you can follow the below link.

 

We ran Intel Vtune memory access analysis with the below command(in CLI):

vtune -c memory-access -- <path to the binary/exe>

Link to download Vtune: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html

 

Link: https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-2/profiling-hbm-performance-on-intel-max.html

JaideepK_Intel_0-1697017330155.pngJaideepK_Intel_1-1697017361100.png

 

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Thank you!

 

Regards,

Jaideep

 

0 Kudos
Julius_
Beginner
3,829 Views

Hello Jaideep,

 

Thank you for your help!

However, no matter what I am doing even with Intel VTune I am not getting any details about the HBM usage. Neither as output in the CLI nor in the platform diagram in the GUI. I was closely following the second article you posted.

I am using numactl to specify the NUMA nodes and I can see in the performance of the Stream Benchmark that the right nodes are beeing used. Sadly, this is not  represented anywhere in VTune.

 

I also double checked the numa configuration (Flat mode) and all seems fine. I am also using the most recent version 2023.2.0 (build 626047).

 

Thanks for any further help!

 

0 Kudos
JaideepK_Intel
Employee
3,691 Views

Hi,


I hope you are doing well.


To understand your issue better, could you please provide the below details?

  1. Operating system and processor details (if Linux please mention kernel details as well)
  2. Need a Sample reproducer i.e. exact replica of the sample which you are trying.
  3. Output result directories of Intel Advisor and Intel VTune with screenshots.


Regards,

Jaideep


0 Kudos
Julius_
Beginner
3,672 Views

Thanks for getting back to me.

For demo purposes I am trying to analyze STREAM provided by Intel: https://github.com/intel/memory-bandwidth-benchmarks

I compile the binary using `make cpu=avx512`

System information provided by `run.sh`:

CPU Model =  Intel (R) Xeon (R) CPU Max 9468

Sockets/Cores/Threads:
	num_sockets          = 2
	num_cores_total      = 96
	num_cores_per_socket = 48
	num_threads_per_core = 2
	Hyper-Threading      = true

NUMA:
	num_numa_domains            = 16
	num_numa_domains_per_socket = 8
	num_cores_per_numa_domain   = 12

Memory = 397.43 GB

CPU Caches:
	L1_cache = 48K (12-way)
	L2_cache = 2048K (16-way)
	L3_cache = 107520K (15-way)
	L3_cache_per_sock = 107520 KB
	L3_cache_per_core = 2240 KB

OS:
Operating System       = AlmaLinux 9.2 (Turquoise Kodkod)
Kernel version         = 5.14.0-284.11.1.el9_2.x86_64
CPU Turbo Boost        = enabled
CPU Scaling Governor   = performance
CPU Scaling Driver     = intel_pstate
Transparent Huge Pages = enabled

ICC version = icc (ICC) 2021.10.0 20230609
Target ISA  = avx512f

Numactl -h:

$ numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 96 97 98 99 100 101 102 103 104 105 106 107
node 0 size: 31343 MB
node 0 free: 6817 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 108 109 110 111 112 113 114 115 116 117 118 119
node 1 size: 32250 MB
node 1 free: 17129 MB
node 2 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 120 121 122 123 124 125 126 127 128 129 130 131
node 2 size: 32211 MB
node 2 free: 31477 MB
node 3 cpus: 36 37 38 39 40 41 42 43 44 45 46 47 132 133 134 135 136 137 138 139 140 141 142 143
node 3 size: 32250 MB
node 3 free: 30986 MB
node 4 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 144 145 146 147 148 149 150 151 152 153 154 155
node 4 size: 32250 MB
node 4 free: 29574 MB
node 5 cpus: 60 61 62 63 64 65 66 67 68 69 70 71 156 157 158 159 160 161 162 163 164 165 166 167
node 5 size: 32250 MB
node 5 free: 30761 MB
node 6 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 168 169 170 171 172 173 174 175 176 177 178 179
node 6 size: 32250 MB
node 6 free: 30963 MB
node 7 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 180 181 182 183 184 185 186 187 188 189 190 191
node 7 size: 32235 MB
node 7 free: 30947 MB
node 8 cpus:
node 8 size: 16384 MB
node 8 free: 3063 MB
node 9 cpus:
node 9 size: 16384 MB
node 9 free: 12374 MB
node 10 cpus:
node 10 size: 16384 MB
node 10 free: 16373 MB
node 11 cpus:
node 11 size: 16384 MB
node 11 free: 16376 MB
node 12 cpus:
node 12 size: 16384 MB
node 12 free: 16143 MB
node 13 cpus:
node 13 size: 16384 MB
node 13 free: 16364 MB
node 14 cpus:
node 14 size: 16384 MB
node 14 free: 15932 MB
node 15 cpus:
node 15 size: 16384 MB
node 15 free: 16356 MB
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
  0:  10  12  12  12  21  21  21  21  13  14  14  14  23  23  23  23 
  1:  12  10  12  12  21  21  21  21  14  13  14  14  23  23  23  23 
  2:  12  12  10  12  21  21  21  21  14  14  13  14  23  23  23  23 
  3:  12  12  12  10  21  21  21  21  14  14  14  13  23  23  23  23 
  4:  21  21  21  21  10  12  12  12  23  23  23  23  13  14  14  14 
  5:  21  21  21  21  12  10  12  12  23  23  23  23  14  13  14  14 
  6:  21  21  21  21  12  12  10  12  23  23  23  23  14  14  13  14 
  7:  21  21  21  21  12  12  12  10  23  23  23  23  14  14  14  13 
  8:  13  14  14  14  23  23  23  23  10  14  14  14  23  23  23  23 
  9:  14  13  14  14  23  23  23  23  14  10  14  14  23  23  23  23 
 10:  14  14  13  14  23  23  23  23  14  14  10  14  23  23  23  23 
 11:  14  14  14  13  23  23  23  23  14  14  14  10  23  23  23  23 
 12:  23  23  23  23  13  14  14  14  23  23  23  23  10  14  14  14 
 13:  23  23  23  23  14  13  14  14  23  23  23  23  14  10  14  14 
 14:  23  23  23  23  14  14  13  14  23  23  23  23  14  14  10  14 
 15:  23  23  23  23  14  14  14  13  23  23  23  23  14  14  14  10

VTune configuration:

Bildschirmfoto 2023-10-18 um 14.54.25.pngThanks for the help!

0 Kudos
JaideepK_Intel
Employee
3,659 Views

Hi,

 

Thank you for sharing the details. We tried the same from our end.

 

Our Environment:

>>Intel (R) Xeon (R) CPU Max 9480

attaching below output from run.sh

 

 

CPU Model = Intel (R) Xeon (R) CPU Max 9480
Sockets/Cores/Threads:
    num_sockets     = 2
    num_cores_total   = 112
    num_cores_per_socket = 56
    num_threads_per_core = 2
    Hyper-Threading   = true
NUMA:
    num_numa_domains      = 8
    num_numa_domains_per_socket = 4
    num_cores_per_numa_domain  = 14
Memory = 1056.38 GB
CPU Caches:
    L1_cache = 48K (12-way)
    L2_cache = 2048K (16-way)
    L3_cache = 115200K (15-way)
    L3_cache_per_sock = 115192 KB
    L3_cache_per_core = 2057 KB
OS:
Operating System    = Ubuntu 20.04.6 LTS
Kernel version     = 5.15.0-52-generic
CPU Turbo Boost    = enabled
CPU Scaling Governor  = powersave
CPU Scaling Driver   = intel_pstate
Transparent Huge Pages = disabled
ICC version = icc (ICC) 2021.10.0 20230609
Target ISA = avx512f

 

 

>>numactl -H output

 

 

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 112 113 114 115 116 117 118 119 120 121 122 123 124 125
node 0 size: 128547 MB
node 0 free: 123812 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 126 127 128 129 130 131 132 133 134 135 136 137 138 139
node 1 size: 129017 MB
node 1 free: 128408 MB
node 2 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 140 141 142 143 144 145 146 147 148 149 150 151 152 153
node 2 size: 128981 MB
node 2 free: 128382 MB
node 3 cpus: 42 43 44 45 46 47 48 49 50 51 52 53 54 55 154 155 156 157 158 159 160 161 162 163 164 165 166 167
node 3 size: 129017 MB
node 3 free: 128392 MB
node 4 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69 168 169 170 171 172 173 174 175 176 177 178 179 180 181
node 4 size: 129017 MB
node 4 free: 128396 MB
node 5 cpus: 70 71 72 73 74 75 76 77 78 79 80 81 82 83 182 183 184 185 186 187 188 189 190 191 192 193 194 195
node 5 size: 129017 MB
node 5 free: 128428 MB
node 6 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 96 97 196 197 198 199 200 201 202 203 204 205 206 207 208 209
node 6 size: 129017 MB
node 6 free: 128243 MB
node 7 cpus: 98 99 100 101 102 103 104 105 106 107 108 109 110 111 210 211 212 213 214 215 216 217 218 219 220 221 222 223
node 7 size: 129006 MB
node 7 free: 128017 MB
node distances:
node  0  1  2  3  4  5  6  7
 0: 10 12 12 12 21 21 21 21
 1: 12 10 12 12 21 21 21 21
 2: 12 12 10 12 21 21 21 21
 3: 12 12 12 10 21 21 21 21
 4: 21 21 21 21 10 12 12 12
 5: 21 21 21 21 12 10 12 12
 6: 21 21 21 21 12 12 10 12
 7: 21 21 21 21 12 12 12 10

 

 

I hope everything runs fine from our end, attaching screenshots for reference.

The command we used:

 

 

vtune -c memory-access -- /usr/bin/numactl -m 0-5 ./stream_avx512.bin

 

 

JaideepK_Intel_4-1697726212123.png

JaideepK_Intel_3-1697726145466.png

JaideepK_Intel_2-1697725781645.png

To understand a little better are you speaking about HMB Bandwidth bound and NUMA% Remote accessess?

 

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Thank you!

 

Thanks,

Jaideep

 

 

 

 

 

0 Kudos
JaideepK_Intel
Employee
3,524 Views

Hi,


If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Thank you!


Regards,

Jaideep


0 Kudos
JaideepK_Intel
Employee
3,398 Views

Hi,


We haven't heard back from you, Is there any update from your end on the issue?


Thanks,

Jaideep


0 Kudos
Julius_
Beginner
3,284 Views

Thank you for looking into that!

So far I did not have any success with gathering information about the HBM Bandwidth. However, might there be an issue with the kernel 5.14.0-284.11.1.el9_2.x86_64 I am using?

 

It seems like the HBM PMUs e.g. in /sys/bus/event_source/devices/ are named like "uncore_type_14" instead of e.g. "uncore_hbm_ID".

Support for Sapphire Rapids seems to have been introduced in this patch: https://lore.kernel.org/all/162547162722.395.795111830712921025.tip-bot2@tip-bot2/T/

 

So my question is, which minimal kernel version do you recommend to us together with the Sapphire Rapids?

 

Thank you so much for your help so far!

0 Kudos
JaideepK_Intel
Employee
3,069 Views

Hi,


The above results which we posted are collected from the Intel Devcloud platform. we are working on this and get back to you with an update.


Thanks,

Jaideep


0 Kudos
JaideepK_Intel
Employee
2,955 Views

Hi,


I hope you are doing well.


Can you please share the below information?

  • Can you please clarify if the system is in HBM Cache Mode or HBM/DRAM Flat Mode?
  • please provide the output to /opt/intel/oneapi/vtune/2024.0/sepdk/src/insmod-sep -q


Thanks,

Jaideep





0 Kudos
JaideepK_Intel
Employee
2,829 Views

Hi,


Could you please share the above details?


Thanks,

Jaideep


0 Kudos
JaideepK_Intel
Employee
2,401 Views

Hi,


Could you please share the above details?


Thanks,

Jaideep


0 Kudos
JaideepK_Intel
Employee
2,244 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks,

Jaideep


0 Kudos
Reply