I ran a code with different frequencies and collected vtune data (on 8280 processor, rhel 7) using microarchitecture analysis. I understand that vtune(v2020) can be used to identify the portions of codes which are underutilizing the given hardware resources on a processor. I did this experiment in order to see how the application responds on variation of a particular component of hardware or , which hardware component limits the scaling of this application (example - memory frequency/cpu frequency etc.)?
So, i gathered the data with various frequencies (acpi-cpufreq) and followed the metrics breakdown trail of the numbers shown in red color on vtune GUI as -
1: Back End Bound --> 2: (Memory Bound, Core Bound) --> 3: DRAM Bound --> 4: (Memory Bandwidth, Memory Latency) --> 5: Local DRAM.
I noticed that -
a) Back-End Bound: = Memory Bound + Core Bound , example (62% of clock ticks = 42 % + 20 %)
b) Memory Bound ~= L1 Bound + L2 Bound + L3 Bound + DRam Bound + Store Bound(42 ~= 8% + 3% + 2% + 20% + 6%)
c) DRam Bound < Memory Bandwidth Bound + Memory Latency (20 < 28 + 10)
d) Memory Latency << Local DRAM + Remote DRAM + Remote Cache (10 << 97 + 2 + 1)
Q1: What could be the reason behing the subcategory total exceeding the category value for c & d ?
for c and d i was expecting something like DRam Bound = Memory Bandwidth Bound + Memory Latency.
Q2: On increasing the CPU frequency i got following from vtune for DRAM Memory Bandwidth
1GHz - 28 % of Clockticks
1.4GHz - 37 %
1.8GHz - 42 %
2GHz - 42.5 %
2.6GHz - 42.8 %
2.7GHz - 42.9 %
2.7+boost enabled - 41.7 %
- The number of CPU stalls (for DRAM) are not increasing when the frequency exceeds 1.8 GHz. Now i am looking for the reason behind this behaviour.
I expected that with higher frequency, stalls would grow as more CPU cycles/ pipeline slots will be wasted due to data unavailability.
I am focusing on metrics highlighted in red. As cache bound clock cycles were almost constant (.2/.4% increase in each of L1,L2,L3,Store) for all the frequencies mentioned above, could i say that larger cache will not help here? - contrary to what is mentioned here
Q3: I noted that on varying the frequencies, the Vector Capacity Usage (FPU) stays constant at around 70%. Which from the explanation here means that 70% of my floating point computations executed on VPU units (rest were scalar).
also, here i can see that there are different types of execution units which can process 256 but data. Is it possible to see the break up of the Floating point applications like - how many used 256-FP MUL, how many used 256 FP Add etc ?
Q4: Are 256 FP Add/256-FP MUL and FMA are different ? If yes then on which port the front end unit dispatches the uOPs for FMA? as i can't see the FMA unit in the block diagram
please let me now if some more information is required from my end or any of the questions mentioned above are vague / unclear.
I understand that the Q1 and 2 are related to code,
but it would be great if i am able to get some information for Q3 & 4.
Also, for Q4, I am aware that FMA units are 512 bit wide for cascadelake. But, in case FMA and FP Add/MUL execution units are separate, not sure if FP Add and FP MUL are also 512 bit wide.
Q1. Only level-1 and level-2 metrics measured in slots can reliably add-up to parent. Metrics at lower levels in the hierarchy can't. They can overlap, they can use heuristics and estimations, etc. You should interpret metric values on level 3 and lower as weights - the higher the value the bigger bottleneck it represents.
Q2. So in your case percentage of cycles stalled due to waiting data from DRAM is ~constant after the frequency exceeds 1.8 GHz. It is hard to say what exactly is happening without knowing the code and seeing all bottlenecks (the whole metrics tree). But since accessing DRAM is the biggest bottleneck larger cache actually may help since it may reduce the need to go to DRAM
Q3. The Vector Capacity Usage (FPU) takes into account not just vector operations but also how many vector 'lanes' you are using. E.g. if all your instructions are 128-bit vector ones but you CPU supports 256-bit this metrics still not exceed 50%. You can use HPC Performance Characterization analysis in VTune for some more detailed breakdown of FP operations.
Q4. This should be useful: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition