Intel(R) VTune(TM) Profiler Self Check Utility Copyright (C) 2009-2020 Intel Corporation. All rights reserved. Build Number: 621966 Ignored warnings: ['To profile kernel modules during the session, make sure they are available in the /lib/modules/kernel_version/ location.', 'To enable hardware event-based sampling, PRODUCT_LEGAL_SHORT_NAME has disabled the NMI watchdog timer. The watchdog timer will be re-enabled after collection completes.'] Check of files: Ok ================================================================================ Context values: Command line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\amplxe-runss.exe --context-value-list Stdout: targetOS: Windows OS: Windows OSBuildNumber: 22000 OSBitness: 64 AdministratorPrivileges: true isPtraceScopeLimited: false isCATSupportedByCPU: true isL3CATAvailable: false isL2CATAvailable: true L2CATDetails: COS=8;ways=10 isL3MonitoringSupportedByCPU: false isTSXAvailable: false isPTAvailable: true isHTEnabled: true fpgaOnBoard: None omniPathOnBoard: None genArchOnBoard: 0 pciClassParts: tidValuesForIO: populatedIoParts: populatedIoUnits: populatedTidValuesForIO: isSGXAvailable: false Hypervisor: None PerfmonVersion: 5 isMaxDRAMBandwidthMeasurementSupported: true preferedGpuAdapter: 0:1:0.0 isEHFIAvailable: true areGpuHardwareMetricsAvailableList: gpuPlatformIndexList: ETW: OK isGpuBusynessAvailable: yes isGpuBusynessDetailsAvailable: notAccessible isGpuWaitAvailable: no isEtwCLRSupported: yes isFtraceAvailable: isMdfEtwAvailable: false isCSwitchAvailable: yes isFunctionTracingAvailable: no isIowaitTracingAvailable: no isVSyncAvailable: na HypervisorType: None isDeviceOrCredentialGuardEnabled: false isSEPDriverAvailable: true SEPDriverVersion: 5.31 isPAXDriverLoaded: true PAXDriverVersion: 1.0 platformType: 145 CPU_NAME: Intel(R) microarchitecture code named Alderlake-S PMU: alderlake availablePmuTypes: bigcore,smallcore,cbo,ncu,imc,power referenceFrequency: 3700000000 isPStateAvailable: true isVTSSPPDriverAvailable: true isNMIWatchDogTimerRunning: false isAOCLAvailable: false isTPSSAvailable: true isPytraceAvailable: true isGENDebugInfoAvailableList: isGTPinCollectionAvailableList: forceShowInlines: false isEnergyCollectionSupported: true isSocwatchDriverLoaded: true isCPUSupportedBySocwatch: false isCpuThrottlingAvailable: false isIPMWatchReady: true isNvdimmAvailable: false isOsCountersCollectorAvailable: true l0LoaderStatus: LibNotFound l0DevicesAvailable: false l0VPUDevicesAvailable: false l0GPUDevicesAvailable: false Getting context values: OK ================================================================================ Check driver: isSEPDriverAvailable: true isPAXDriverLoaded: true Ok ================================================================================ SEP version: Command line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\sep.exe -version Stdout: Sampling Enabling Product Version: 5.31 built on Dec 4 2021 09:23:59 SEP Driver Version: 5.31 (public) PAX Driver Version: 1.0 Platform type: 145 CPU name: Intel(R) microarchitecture code named Alderlake-S PMU: alderlake Driver configs: Maskable Interrupt, MULTI PEBS OFF (tracepoints not accessible), REGISTER CHECK ON Copyright(C) 2007-2020 Intel Corporation. All rights reserved. Check driver with sep -version: Ok ================================================================================ HW event-based analysis (counting mode)... Command line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\vtune.exe -collect performance-snapshot -r C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_ps -data-limit 0 -finalization-mode none -source-search-dir C:\Program Files (x86)\Intel\oneAPI\vtune\latest\samples\en\C++\matrix\src -- C:\Program Files (x86)\Intel\oneAPI\vtune\latest\samples\en\C++\matrix\matrix.exe Stdout: Addr of buf1 = 00000000009CE040 Offs of buf1 = 00000000009CE180 Addr of buf2 = 00000000029D7040 Offs of buf2 = 00000000029D71C0 Addr of buf3 = 00000000049E2040 Offs of buf3 = 00000000049E2100 Addr of buf4 = 00000000069F2040 Offs of buf4 = 00000000069F2140 Threads #: 16 Win threads Matrix size: 2048 Using multiply kernel: multiply1 Execution time = 11.366 seconds Stderr: vtune: Peak bandwidth measurement started. vtune: Peak bandwidth measurement finished. vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_ps -command stop. vtune: Collection stopped. vtune: Using result path `C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_ps' vtune: Executing actions 0 % vtune: Executing actions 100 % vtune: Executing actions 100 % done HW event-based analysis (counting mode) (Intel driver) Example of analysis types: Performance Snapshot Collection: Ok -------------------------------------------------------------------------------- Running finalization... Command line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\vtune.exe -finalize -r C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_ps Stderr: vtune: Using result path `C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_ps' vtune: Executing actions 0 % vtune: Executing actions 0 % Finalizing results vtune: Executing actions 0 % Finalizing the result vtune: Executing actions 0 % Clearing the database vtune: Executing actions 14 % Clearing the database vtune: Executing actions 14 % Loading raw data to the database vtune: Executing actions 14 % Loading 'systemcollector-13696-boxer.sc' file vtune: Executing actions 25 % Loading 'systemcollector-13696-boxer.sc' file vtune: Executing actions 25 % Loading 'emon.0.bwhist' file vtune: Executing actions 25 % Loading 'C:\Users\kimel\AppData\Local\Temp\vtune- vtune: Executing actions 25 % Updating precomputed scalar metrics vtune: Executing actions 28 % Updating precomputed scalar metrics vtune: Executing actions 28 % Processing profile metrics and debug information vtune: Executing actions 39 % Processing profile metrics and debug information vtune: Executing actions 39 % Setting data model parameters vtune: Executing actions 39 % Resolving module symbols vtune: Executing actions 39 % Resolving thread name information vtune: Executing actions 43 % Resolving thread name information vtune: Executing actions 43 % Resolving call target names for dynamic code vtune: Executing actions 48 % Resolving call target names for dynamic code vtune: Executing actions 48 % Resolving interrupt name information vtune: Executing actions 53 % Resolving interrupt name information vtune: Executing actions 53 % Processing profile metrics and debug information vtune: Executing actions 56 % Processing profile metrics and debug information vtune: Executing actions 57 % Processing profile metrics and debug information vtune: Executing actions 58 % Processing profile metrics and debug information vtune: Executing actions 60 % Processing profile metrics and debug information vtune: Executing actions 62 % Processing profile metrics and debug information vtune: Executing actions 63 % Processing profile metrics and debug information vtune: Executing actions 63 % Preparing output tree vtune: Executing actions 63 % Parsing columns in input tree vtune: Executing actions 64 % Parsing columns in input tree vtune: Executing actions 64 % Creating top-level columns vtune: Executing actions 65 % Creating top-level columns vtune: Executing actions 65 % Creating top-level rows vtune: Executing actions 67 % Creating top-level rows vtune: Executing actions 67 % Setting data model parameters vtune: Executing actions 68 % Setting data model parameters vtune: Executing actions 68 % Precomputing frequently used data vtune: Executing actions 68 % Precomputing frequently used data vtune: Executing actions 69 % Precomputing frequently used data vtune: Executing actions 70 % Precomputing frequently used data vtune: Executing actions 71 % Precomputing frequently used data vtune: Executing actions 72 % Precomputing frequently used data vtune: Executing actions 73 % Precomputing frequently used data vtune: Executing actions 74 % Precomputing frequently used data vtune: Executing actions 75 % Precomputing frequently used data vtune: Executing actions 76 % Precomputing frequently used data vtune: Executing actions 77 % Precomputing frequently used data vtune: Executing actions 78 % Precomputing frequently used data vtune: Executing actions 79 % Precomputing frequently used data vtune: Executing actions 79 % Updating precomputed scalar metrics vtune: Executing actions 82 % Updating precomputed scalar metrics vtune: Executing actions 82 % Discarding redundant overtime data vtune: Executing actions 85 % Discarding redundant overtime data vtune: Executing actions 85 % Saving the result vtune: Executing actions 89 % Saving the result vtune: Executing actions 92 % Saving the result vtune: Executing actions 99 % Saving the result vtune: Executing actions 100 % Saving the result vtune: Executing actions 100 % done Finalization: Ok -------------------------------------------------------------------------------- Command line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\vtune.exe -R summary -r C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_ps Stdout: Elapsed Time: 11.444s IPC: 0.418 | The IPC may be too low. This could be caused by issues such as memory | stalls, instruction starvation, branch misprediction or long latency | instructions. Explore the other hardware-related metrics to identify what | is causing low IPC. | P-Core: 0.402 | The IPC may be too low. This could be caused by issues such as memory | stalls, instruction starvation, branch misprediction or long latency | instructions. Explore the other hardware-related metrics to identify | what is causing low IPC. | E-Core: 0.476 | The IPC may be too low. This could be caused by issues such as memory | stalls, instruction starvation, branch misprediction or long latency | instructions. Explore the other hardware-related metrics to identify | what is causing low IPC. | SP GFLOPS: 0.000 DP GFLOPS: 1.089 x87 GFLOPS: 0.000 Average CPU Frequency: 4.273 GHz Logical Core Utilization: 94.7% (15.148 out of 16) Physical Core Utilization: 83.0% (8.302 out of 10) Microarchitecture Usage: 12.0% of Pipeline Slots | You code efficiency on this platform is too low. | | Possible cause: memory stalls, instruction starvation, branch misprediction | or long latency instructions. | | Next steps: Run Microarchitecture Exploration analysis to identify the cause | of the low microarchitecture usage efficiency. | P-Core Retiring: 13.0% of Pipeline Slots Front-End Bound: 1.7% of Pipeline Slots Bad Speculation: 0.4% of Pipeline Slots Back-End Bound: 85.0% of Pipeline Slots | A significant portion of pipeline slots are remaining empty. When | operations take too long in the back-end, they introduce bubbles in | the pipeline that ultimately cause fewer pipeline slots containing | useful work to be retired per cycle than the machine is capable to | support. This opportunity cost results in slower execution. Long- | latency operations like divides and memory operations can cause this, | as can too many operations being directed to a single execution port | (for example, more multiply operations arriving in the back-end per | cycle than the execution unit can support). | Memory Bound: 72.7% of Pipeline Slots | The metric value is high. This can indicate that the significant | fraction of execution pipeline slots could be stalled due to | demand memory load and stores. Use Memory Access analysis to have | the metric breakdown by memory hierarchy, memory bandwidth | information, correlation by memory objects. | L1 Bound: 3.2% of Clockticks L2 Bound: 0.2% of Clockticks L3 Bound: 34.9% of Clockticks | This metric shows how often CPU was stalled on L3 cache, or | contended with a sibling Core. Avoiding cache misses (L2 | misses/L3 hits) improves the latency and increases | performance. | DRAM Bound: 43.0% of Clockticks | This metric shows how often CPU was stalled on the main | memory (DRAM). Caching typically improves the latency and | increases performance. | Memory Bandwidth: 76.6% of Clockticks | Issue: A significant fraction of cycles was stalled due | to approaching bandwidth limits of the main memory | (DRAM). | | Tips: Improve data accesses to reduce cacheline transfers | from/to memory using these possible techniques: | - Consume all bytes of each cacheline before it is | evicted (for example, reorder structure elements | and split non-hot ones). | - Merge compute-limited and bandwidth-limited loops. | - Use NUMA optimizations on a multi-socket system. | | Note: software prefetches do not help a bandwidth-limited | application. | Memory Latency: 21.8% of Clockticks | Issue: A significant fraction of cycles was stalled due | to the latency of the main memory (DRAM). | | Tips: Improve data accesses or interleave them with | compute using such possible techniques as data layout re- | structuring or software prefetches (through the | compiler). | Core Bound: 12.3% of Pipeline Slots | This metric represents how much Core non-memory issues were of a | bottleneck. Shortage in hardware compute resources, or | dependencies software's instructions are both categorized under | Core Bound. Hence it may indicate the machine ran out of an OOO | resources, certain execution units are overloaded or dependencies | in program's data- or instruction- flow are limiting the | performance (e.g. FP-chained long-latency arithmetic operations). | E-Core Retiring: 9.8% of Pipeline Slots Front-End Bound: 4.1% of Pipeline Slots Bad Speculation: 4.9% of Pipeline Slots Back-End Bound: 81.6% of Pipeline Slots | A significant portion of pipeline slots are remaining empty. When | operations take too long in the back-end, they introduce bubbles in | the pipeline that ultimately cause fewer pipeline slots containing | useful work to be retired per cycle than the machine is capable to | support. This opportunity cost results in slower execution. Long- | latency operations like divides and memory operations can cause this, | as can too many operations being directed to a single execution port | (for example, more multiply operations arriving in the back-end per | cycle than the execution unit can support). | Resource Bound: 81.6% of Pipeline Slots | Resource Bound | Alternative Back-End Bound: 81.6% of Pipeline Slots | A significant portion of pipeline slots are remaining empty. When | operations take too long in the back-end, they introduce bubbles in | the pipeline that ultimately cause fewer pipeline slots containing | useful work to be retired per cycle than the machine is capable to | support. This opportunity cost results in slower execution. Long- | latency operations like divides and memory operations can cause this, | as can too many operations being directed to a single execution port | (for example, more multiply operations arriving in the back-end per | cycle than the execution unit can support). | Core Bound: 24.9% | This metric represents how much Core non-memory issues were of a | bottleneck. Shortage in hardware compute resources, or | dependencies software's instructions are both categorized under | Core Bound. Hence it may indicate the machine ran out of an OOO | resources, certain execution units are overloaded or dependencies | in program's data- or instruction- flow are limiting the | performance (e.g. FP-chained long-latency arithmetic operations). | Memory Bound: 56.7% | The metric value is high. This can indicate that the significant | fraction of execution pipeline slots could be stalled due to | demand memory load and stores. Use Memory Access analysis to have | the metric breakdown by memory hierarchy, memory bandwidth | information, correlation by memory objects. | L2 Bound: 7.1% | This metric shows how often machine was stalled on L2 cache. | Avoiding cache misses (L1 misses/L2 hits) will improve the | latency and increase performance. | L3 Bound: 16.1% | This metric shows how often CPU was stalled on L3 cache, or | contended with a sibling Core. Avoiding cache misses (L2 | misses/L3 hits) improves the latency and increases | performance. | DRAM Bound: 33.5% | This metric shows how often CPU was stalled on the main | memory (DRAM). Caching typically improves the latency and | increases performance. | Memory Bound: 72.7% of Pipeline Slots | The metric value is high. This can indicate that the significant fraction of | execution pipeline slots could be stalled due to demand memory load and | stores. Use Memory Access analysis to have the metric breakdown by memory | hierarchy, memory bandwidth information, correlation by memory objects. | P-Core Memory Bound: 72.7% of Pipeline Slots | The metric value is high. This can indicate that the significant | fraction of execution pipeline slots could be stalled due to demand | memory load and stores. Use Memory Access analysis to have the metric | breakdown by memory hierarchy, memory bandwidth information, | correlation by memory objects. | Cache Bound: 38.4% of Clockticks | A significant proportion of cycles are being spent on data | fetches from caches. Check Memory Access analysis to see if | accesses to L2 or L3 caches are problematic and consider applying | the same performance tuning as you would for a cache-missing | workload. This may include reducing the data working set size, | improving data access locality, blocking or partitioning the | working set to fit in the lower cache levels, or exploiting | hardware prefetchers. Consider using software prefetchers, but | note that they can interfere with normal loads, increase latency, | and increase pressure on the memory system. This metric includes | coherence penalties for shared data. Check Microarchitecture | Exploration analysis to see if contested accesses or data sharing | are indicated as likely issues. | DRAM Bound: 43.0% of Clockticks | This metric shows how often CPU was stalled on the main memory | (DRAM). Caching typically improves the latency and increases | performance. | E-Core Memory Bound: 56.7% | The metric value is high. This can indicate that the significant | fraction of execution pipeline slots could be stalled due to demand | memory load and stores. Use Memory Access analysis to have the metric | breakdown by memory hierarchy, memory bandwidth information, | correlation by memory objects. | Cache Bound: 23.2% of Clockticks | A significant proportion of cycles are being spent on data | fetches from caches. Check Memory Access analysis to see if | accesses to L2 or L3 caches are problematic and consider applying | the same performance tuning as you would for a cache-missing | workload. This may include reducing the data working set size, | improving data access locality, blocking or partitioning the | working set to fit in the lower cache levels, or exploiting | hardware prefetchers. Consider using software prefetchers, but | note that they can interfere with normal loads, increase latency, | and increase pressure on the memory system. This metric includes | coherence penalties for shared data. Check Microarchitecture | Exploration analysis to see if contested accesses or data sharing | are indicated as likely issues. | DRAM Bound: 33.5% | This metric shows how often CPU was stalled on the main memory | (DRAM). Caching typically improves the latency and increases | performance. | Vectorization: 0.0% of Packed FP Operations | A significant fraction of floating point arithmetic instructions are scalar. | This indicates that the code was not fully vectorized. Use Intel Advisor to | see possible reasons why the code was not vectorized. | Instruction Mix SP FLOPs: 0.0% of uOps Packed: 3.3% from SP FP 128-bit: 3.3% from SP FP 256-bit: 0.0% from SP FP Scalar: 96.7% from SP FP | A significant fraction of floating point arithmetic instructions | are scalar. This indicates that the code was not fully | vectorized. Use Intel Advisor to see possible reasons why the | code was not vectorized. | DP FLOPs: 5.3% of uOps Packed: 0.0% from DP FP 128-bit: 0.0% from DP FP 256-bit: 0.0% from DP FP Scalar: 100.0% from DP FP | A significant fraction of floating point arithmetic instructions | are scalar. This indicates that the code was not fully | vectorized. Use Intel Advisor to see possible reasons why the | code was not vectorized. | x87 FLOPs: 0.0% of uOps Non-FP: 94.7% of uOps Collection and Platform Info Application Command Line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\samples\en\C++\matrix\matrix.exe Operating System: Microsoft Windows 10 Computer Name: boxer Result Size: 3.7 MB Collection start time: 23:24:37 02/03/2022 UTC Collection stop time: 23:24:48 02/03/2022 UTC Collector Type: Event-based counting driver CPU Name: Intel(R) microarchitecture code named Alderlake-S Frequency: 3.686 GHz Logical CPU Count: 16 Max DRAM Single-Package Bandwidth: 43.000 GB/s Cache Allocation Technology Level 2 capability: available Level 3 capability: not detected Recommendations: Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm. | Use Hotspots analysis to identify the most time consuming functions. | Drill down to see the time spent on every line of code. Memory Access: The Memory Bound metric is high (72.7%). A significant fraction of execution pipeline slots could be stalled due to demand memory load and stores. | Use Memory Access analysis to measure metrics that can identify memory | access issues. HPC Performance Characterization: Vectorization (0.0%) is low. A significant fraction of floating point arithmetic instructions are scalar. This indicates that the code was not fully vectorized. Use Intel Advisor to see possible reasons why the code was not vectorized. | Use HPC Performance Characterization analysis to examine the performance | of compute-intensive applications. Understand CPU/GPU utilization and get | information about OpenMP efficiency, memory access, and vectorization. If you want to skip descriptions of detected performance issues in the report, enter: vtune -report summary -report-knob show-issues=false -r . Alternatively, you may view the report in the csv format: vtune -report -format=csv. Stderr: vtune: Using result path `C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_ps' vtune: Executing actions 0 % vtune: Executing actions 0 % Finalizing results vtune: Executing actions 50 % Finalizing results vtune: Executing actions 50 % Generating a report vtune: Executing actions 50 % Setting data model parameters vtune: Executing actions 75 % Setting data model parameters vtune: Executing actions 75 % Generating a report vtune: Executing actions 100 % Generating a report vtune: Executing actions 100 % done Report: Ok ================================================================================ Instrumentation based analysis check... Command line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\vtune.exe -collect hotspots -r C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_tpss -data-limit 0 -finalization-mode none -source-search-dir C:\Program Files (x86)\Intel\oneAPI\vtune\latest\samples\en\C++\matrix\src -- C:\Program Files (x86)\Intel\oneAPI\vtune\latest\samples\en\C++\matrix\matrix.exe Stdout: Addr of buf1 = 00000000060B5040 Offs of buf1 = 00000000060B5180 Addr of buf2 = 00000000080CC040 Offs of buf2 = 00000000080CC1C0 Addr of buf3 = 000000000A0DB040 Offs of buf3 = 000000000A0DB100 Addr of buf4 = 000000000C0E0040 Offs of buf4 = 000000000C0E0140 Threads #: 16 Win threads Matrix size: 2048 Using multiply kernel: multiply1 Execution time = 12.324 seconds Stderr: vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_tpss -command stop. vtune: Collection stopped. vtune: Using result path `C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_tpss' vtune: Executing actions 0 % vtune: Executing actions 100 % vtune: Executing actions 100 % done Instrumentation based analysis check Example of analysis types: Hotspots and Threading with user-mode sampling Collection: Ok -------------------------------------------------------------------------------- Running finalization... Command line: C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\vtune.exe -finalize -r C:\Users\kimel\AppData\Local\Temp\vtune-tmp-kimel\self-checker-2022.03.03_00.24.27\result_tpss