- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to get a roofline model for an application running on HBM memory, using the advixe-gui.
I am specifying the binary using numactl and assign it the HBM NUMA nodes.
However, I am missing the HBM Bandwidth in the Roofline model and I am only seeing the DRAM Bandwidth. Is there a way to also add the HBM Bandwidth to the plot?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel communities.
We can see the same from our end too. One possible workaround is using the Intel Vtune profiler, we can view the HBM metrics and attach screenshots for your reference. For more information, you can follow the below link.
We ran Intel Vtune memory access analysis with the below command(in CLI):
vtune -c memory-access -- <path to the binary/exe>
Link to download Vtune: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html
If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Thank you!
Regards,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jaideep,
Thank you for your help!
However, no matter what I am doing even with Intel VTune I am not getting any details about the HBM usage. Neither as output in the CLI nor in the platform diagram in the GUI. I was closely following the second article you posted.
I am using numactl to specify the NUMA nodes and I can see in the performance of the Stream Benchmark that the right nodes are beeing used. Sadly, this is not represented anywhere in VTune.
I also double checked the numa configuration (Flat mode) and all seems fine. I am also using the most recent version 2023.2.0 (build 626047).
Thanks for any further help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I hope you are doing well.
To understand your issue better, could you please provide the below details?
- Operating system and processor details (if Linux please mention kernel details as well)
- Need a Sample reproducer i.e. exact replica of the sample which you are trying.
- Output result directories of Intel Advisor and Intel VTune with screenshots.
Regards,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for getting back to me.
For demo purposes I am trying to analyze STREAM provided by Intel: https://github.com/intel/memory-bandwidth-benchmarks
I compile the binary using `make cpu=avx512`
System information provided by `run.sh`:
CPU Model = Intel (R) Xeon (R) CPU Max 9468
Sockets/Cores/Threads:
num_sockets = 2
num_cores_total = 96
num_cores_per_socket = 48
num_threads_per_core = 2
Hyper-Threading = true
NUMA:
num_numa_domains = 16
num_numa_domains_per_socket = 8
num_cores_per_numa_domain = 12
Memory = 397.43 GB
CPU Caches:
L1_cache = 48K (12-way)
L2_cache = 2048K (16-way)
L3_cache = 107520K (15-way)
L3_cache_per_sock = 107520 KB
L3_cache_per_core = 2240 KB
OS:
Operating System = AlmaLinux 9.2 (Turquoise Kodkod)
Kernel version = 5.14.0-284.11.1.el9_2.x86_64
CPU Turbo Boost = enabled
CPU Scaling Governor = performance
CPU Scaling Driver = intel_pstate
Transparent Huge Pages = enabled
ICC version = icc (ICC) 2021.10.0 20230609
Target ISA = avx512f
Numactl -h:
$ numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 96 97 98 99 100 101 102 103 104 105 106 107
node 0 size: 31343 MB
node 0 free: 6817 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 108 109 110 111 112 113 114 115 116 117 118 119
node 1 size: 32250 MB
node 1 free: 17129 MB
node 2 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 120 121 122 123 124 125 126 127 128 129 130 131
node 2 size: 32211 MB
node 2 free: 31477 MB
node 3 cpus: 36 37 38 39 40 41 42 43 44 45 46 47 132 133 134 135 136 137 138 139 140 141 142 143
node 3 size: 32250 MB
node 3 free: 30986 MB
node 4 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 144 145 146 147 148 149 150 151 152 153 154 155
node 4 size: 32250 MB
node 4 free: 29574 MB
node 5 cpus: 60 61 62 63 64 65 66 67 68 69 70 71 156 157 158 159 160 161 162 163 164 165 166 167
node 5 size: 32250 MB
node 5 free: 30761 MB
node 6 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 168 169 170 171 172 173 174 175 176 177 178 179
node 6 size: 32250 MB
node 6 free: 30963 MB
node 7 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 180 181 182 183 184 185 186 187 188 189 190 191
node 7 size: 32235 MB
node 7 free: 30947 MB
node 8 cpus:
node 8 size: 16384 MB
node 8 free: 3063 MB
node 9 cpus:
node 9 size: 16384 MB
node 9 free: 12374 MB
node 10 cpus:
node 10 size: 16384 MB
node 10 free: 16373 MB
node 11 cpus:
node 11 size: 16384 MB
node 11 free: 16376 MB
node 12 cpus:
node 12 size: 16384 MB
node 12 free: 16143 MB
node 13 cpus:
node 13 size: 16384 MB
node 13 free: 16364 MB
node 14 cpus:
node 14 size: 16384 MB
node 14 free: 15932 MB
node 15 cpus:
node 15 size: 16384 MB
node 15 free: 16356 MB
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 12 12 21 21 21 21 13 14 14 14 23 23 23 23
1: 12 10 12 12 21 21 21 21 14 13 14 14 23 23 23 23
2: 12 12 10 12 21 21 21 21 14 14 13 14 23 23 23 23
3: 12 12 12 10 21 21 21 21 14 14 14 13 23 23 23 23
4: 21 21 21 21 10 12 12 12 23 23 23 23 13 14 14 14
5: 21 21 21 21 12 10 12 12 23 23 23 23 14 13 14 14
6: 21 21 21 21 12 12 10 12 23 23 23 23 14 14 13 14
7: 21 21 21 21 12 12 12 10 23 23 23 23 14 14 14 13
8: 13 14 14 14 23 23 23 23 10 14 14 14 23 23 23 23
9: 14 13 14 14 23 23 23 23 14 10 14 14 23 23 23 23
10: 14 14 13 14 23 23 23 23 14 14 10 14 23 23 23 23
11: 14 14 14 13 23 23 23 23 14 14 14 10 23 23 23 23
12: 23 23 23 23 13 14 14 14 23 23 23 23 10 14 14 14
13: 23 23 23 23 14 13 14 14 23 23 23 23 14 10 14 14
14: 23 23 23 23 14 14 13 14 23 23 23 23 14 14 10 14
15: 23 23 23 23 14 14 14 13 23 23 23 23 14 14 14 10
VTune configuration:
Thanks for the help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for sharing the details. We tried the same from our end.
Our Environment:
>>Intel (R) Xeon (R) CPU Max 9480
attaching below output from run.sh
CPU Model = Intel (R) Xeon (R) CPU Max 9480
Sockets/Cores/Threads:
num_sockets = 2
num_cores_total = 112
num_cores_per_socket = 56
num_threads_per_core = 2
Hyper-Threading = true
NUMA:
num_numa_domains = 8
num_numa_domains_per_socket = 4
num_cores_per_numa_domain = 14
Memory = 1056.38 GB
CPU Caches:
L1_cache = 48K (12-way)
L2_cache = 2048K (16-way)
L3_cache = 115200K (15-way)
L3_cache_per_sock = 115192 KB
L3_cache_per_core = 2057 KB
OS:
Operating System = Ubuntu 20.04.6 LTS
Kernel version = 5.15.0-52-generic
CPU Turbo Boost = enabled
CPU Scaling Governor = powersave
CPU Scaling Driver = intel_pstate
Transparent Huge Pages = disabled
ICC version = icc (ICC) 2021.10.0 20230609
Target ISA = avx512f
>>numactl -H output
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 112 113 114 115 116 117 118 119 120 121 122 123 124 125
node 0 size: 128547 MB
node 0 free: 123812 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 126 127 128 129 130 131 132 133 134 135 136 137 138 139
node 1 size: 129017 MB
node 1 free: 128408 MB
node 2 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 140 141 142 143 144 145 146 147 148 149 150 151 152 153
node 2 size: 128981 MB
node 2 free: 128382 MB
node 3 cpus: 42 43 44 45 46 47 48 49 50 51 52 53 54 55 154 155 156 157 158 159 160 161 162 163 164 165 166 167
node 3 size: 129017 MB
node 3 free: 128392 MB
node 4 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69 168 169 170 171 172 173 174 175 176 177 178 179 180 181
node 4 size: 129017 MB
node 4 free: 128396 MB
node 5 cpus: 70 71 72 73 74 75 76 77 78 79 80 81 82 83 182 183 184 185 186 187 188 189 190 191 192 193 194 195
node 5 size: 129017 MB
node 5 free: 128428 MB
node 6 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 96 97 196 197 198 199 200 201 202 203 204 205 206 207 208 209
node 6 size: 129017 MB
node 6 free: 128243 MB
node 7 cpus: 98 99 100 101 102 103 104 105 106 107 108 109 110 111 210 211 212 213 214 215 216 217 218 219 220 221 222 223
node 7 size: 129006 MB
node 7 free: 128017 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 12 12 21 21 21 21
1: 12 10 12 12 21 21 21 21
2: 12 12 10 12 21 21 21 21
3: 12 12 12 10 21 21 21 21
4: 21 21 21 21 10 12 12 12
5: 21 21 21 21 12 10 12 12
6: 21 21 21 21 12 12 10 12
7: 21 21 21 21 12 12 12 10
I hope everything runs fine from our end, attaching screenshots for reference.
The command we used:
vtune -c memory-access -- /usr/bin/numactl -m 0-5 ./stream_avx512.bin
To understand a little better are you speaking about HMB Bandwidth bound and NUMA% Remote accessess?
If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Thank you!
Thanks,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Thank you!
Regards,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you, Is there any update from your end on the issue?
Thanks,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for looking into that!
So far I did not have any success with gathering information about the HBM Bandwidth. However, might there be an issue with the kernel 5.14.0-284.11.1.el9_2.x86_64 I am using?
It seems like the HBM PMUs e.g. in /sys/bus/event_source/devices/ are named like "uncore_type_14" instead of e.g. "uncore_hbm_ID".
Support for Sapphire Rapids seems to have been introduced in this patch: https://lore.kernel.org/all/162547162722.395.795111830712921025.tip-bot2@tip-bot2/T/
So my question is, which minimal kernel version do you recommend to us together with the Sapphire Rapids?
Thank you so much for your help so far!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The above results which we posted are collected from the Intel Devcloud platform. we are working on this and get back to you with an update.
Thanks,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I hope you are doing well.
Can you please share the below information?
- Can you please clarify if the system is in HBM Cache Mode or HBM/DRAM Flat Mode?
- please provide the output to /opt/intel/oneapi/vtune/2024.0/sepdk/src/insmod-sep -q
Thanks,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please share the above details?
Thanks,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please share the above details?
Thanks,
Jaideep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks,
Jaideep
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page