Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
5260 Discussions

Huge amount of memory used while processing VTune traces

Etienne
Beginner
11,167 Views

We are planning to use VTune over Chrome to fix performance bottlenecks on Intel. Our first experiments with the tool were great. We are able to use the VTune API to annotate internal Chrome tasks and pin-points performance issues.

 

Unfortunately, we are now reaching the limit of the tool which is using too much memory while processing a ~10 second trace (Chrome startup trace). The tool requires about 2h - 4h to process the traces and is reaching ~200G to 400G of memory usage (see attachment).

 

I ran ETW over the VTune to try to find the bottleneck (see attachment). It seems to me there is three phases.

  1) Tool initialisation (CPU bound)

  2) Followed by reading the traces (Disk bound)

  3) Processing? [or maybe symbolisation] (Memory bound)

 

I would like to know if it's possible to have access to the public symbols to help investigate this issue with large memory consumption? Otherwise, could someone from VTune be able to reproduce?

 

Part of me suspect the memory usage is related to pdb loading and symbolisation within VTune.

0 Kudos
27 Replies
SreedeviK_Intel
Moderator
9,049 Views

Hi,


Thanks for posting in Intel Communities.


Can you get back to us on the following information so that we can try reproducing it from our side:

1. CPU, Processor and OS details

2. Sample reproducer code along with the steps/commands 

3. VTune version along with type of analysis performed


Regards,

Sreedevi


0 Kudos
Etienne
Beginner
9,040 Views

CPU, Processor and OS details

==================================================
Name: Intel(R) Xeon(R) Processor code named Skylake
Frequency: 3.0 GHz
Logical CPU Count: 72

2x Processor Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 2993 Mhz, 18 Core(s), 18 Logical Processor(s)

Installed Physical Memory (RAM) 192 GB

OS Name Microsoft Windows 10 Enterprise
Version 10.0.19045 Build 19045

 

2. Sample reproducer code along with the steps/commands 

==================================================

Analysis [applies to all of them] but here are two examples:

  1) Hotspot

  "C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\vtune" -collect hotspots -no-follow-child "--app-working-dir=C:\Users\etienneb\AppData\Local\Google\Chrome SxS\Application" -- "C:\Users\etienneb\AppData\Local\Google\Chrome SxS\Application\chrome.exe" --user-data-dir=c:\src\dummy --no-sandbox

 

  2) System Overview

"C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\vtune" -collect system-overview -knob analyze-power-usage=true -knob analyze-throttling-reasons=true -no-follow-child "--app-working-dir=C:\Users\etienneb\AppData\Local\Google\Chrome SxS\Application" -- "C:\Users\etienneb\AppData\Local\Google\Chrome SxS\Application\chrome.exe" --user-data-dir=c:\src\dummy --no-sandbox

 

STEPS:

  We launched the collection through the UI and when Chrome finished to load the first page, we stopped the collection through the UI.

 

3. VTune version along with type of analysis performed

==================================================

VTune Profiler 2023.2.0

Product build:

626047

Installation directory:

C:\Program Files (x86)\Intel\oneAPI\vtune\latest

 

0 Kudos
SreedeviK_Intel
Moderator
8,963 Views

Hi,


We are checking on this internally and will get back to you with an update shortly.


Regards,

Sreedevi


0 Kudos
SreedeviK_Intel
Moderator
8,903 Views

Hi,

 

Can you please confirm if you are using hardware event-based sampling drivers? If not, could you try as mentioned below and share your result directory.

In GUI, check the hardware event-based sampling checkbox.

For example, in GUI:
SreedeviK_Intel_0-1697007546796.png

 In CLI, try adding "-knob sampling-mode=hw" in the command .

 

Regards,

Sreedevi

 

0 Kudos
Etienne
Beginner
8,875 Views

We did try both options (hardware vs user-mode).

We also reduced the sampling interval to collect less data.

We also try to collect a really small trace (1 second).

 

We did get trouble loading results from other analysis too. This is why we really suspect it is related to the symbolisation phase.

0 Kudos
Etienne
Beginner
8,867 Views

I just made an other try. I attached the generated vtune files for a quick trace.

I hope the output file is self-contained. 

 

The trace is a quick chrome startup. All the child processes are traced.

VTune is launched with admin rights. The hotspot analysis is used with hardware sampling (5 ms interval).

0 Kudos
Etienne
Beginner
8,841 Views

On a brand new Intel CPU laptop, 4x cores / 16G ram, windows freshly installed, I did install vtune and took a trace of chrome startup.

The hotspot analysis was completed in less than 1 minutes. Unfortunately, the stackframes were not visble and the symbols were not loaded.

 

I added the environment path for symbols and re-run the same test. After an hour, the finalization phase is still running.

 

I am using these symbols servers:

  _NT_SYMBOL_PATH=SRV*C:\src\symbols*https://msdl.microsoft.com/download/symbols;SRV*C:\src\symbols*https://chromium-browser-symsrv.commondatastorage.googleapis.com;SRV*C:\src\symbols\*https://download.amd.com/dir/bin;SRV*C:\src\symbols*https://driver-symbols.nvidia.com;SRV*C:\src\symbols\*https://software.intel.com/sites/downloads/symbols/

 

I highly suspect the performance issues are related to symbols loading. I won't be surprised that chrome.dll.pdb is just too big to be easilly processed by vtune.

0 Kudos
SreedeviK_Intel
Moderator
8,665 Views

Hi,


We are checking on this internally and will get back to you with an update shortly.


Regards,

Sreedevi


0 Kudos
SreedeviK_Intel
Moderator
8,614 Views

Hi,


Sorry for the delay in getting back to you.

We had checked with our development team and they informed that they are working on this fix and is targeted to fix in the future releases.

 

Also, I could see that you don't have priority support. But, Our dev team would like to know whether any of your team members have priority support. With priority support, you could easily access builds earlier than the targeted release.


Regards,

Sreedevi


0 Kudos
SreedeviK_Intel
Moderator
8,510 Views

Hi,


We have not heard back from you. Could you please provide us an update?


Regards,

Sreedevi


0 Kudos
Etienne
Beginner
8,465 Views

I don't know what you expect as an update.

 

We were trying to use VTune and see the potential optimisations that can be detected with that low level tools.

Unfortunately, it doesn't work as-is on the code base. We spent time to investigate the source of the issue and came to the conclusion that the limitations are in the tools and we can move forward on tooling analysis. Until the fixes are available, we can't investigate the usefulness of VTune over our code base.

 

 Our dev team would like to know whether any of your team members have priority support.

 

I don't know what is 'priority support' service. It is our first use of the tool, so I doubt we do have it.

0 Kudos
SreedeviK_Intel
Moderator
8,330 Views

Hi,


Can you please try running your sample on updated VTune version (2024.0) and please let us know if the issue still persists?


Regards,

Sreedevi


0 Kudos
Etienne
Beginner
8,306 Views

I tried on one of my dev computer and it went smooth. The Hotspot analysis was able to symbolize the trace quickly.

 

I tried on my laptop and I got the following error in my log. I highly suspect this is related to one of the security software installed on my laptop (corp policy). I don't know why it was working fine before; is it a vtune regression?

'''

11/24/2023 11:37:16:705 : 14368 : ERROR : Installation of component has failed.
Component id: intel.oneapi.win.oneapi-common.licensing, name: oneAPI Common, version: 2024.0.0+49430.
During the execution of the application 'C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe' with arguments '-NoLogo -NoProfile -NonInteractive -ExecutionPolicy AllSigned -File C:\ProgramData\Intel\InstallerCache\DownloadCache\intel.oneapi.win.oneapi-common.licensing,v=2024.0.0+49430\latest_symlink_post_install.ps1 -installDir C:\Program Files (x86)\Intel\oneAPI -linkTargetVersion 2024.0 -latestLinkDir licensing' errors were received:
C:\ProgramData\Intel\InstallerCache\DownloadCache\intel.oneapi.win.oneapi-common.licensing,v=2024.0.0+49430\latest_syml
ink_post_install.ps1 : Cannot dot-source this command because it was defined in a different language mode. To invoke
this command without importing its contents, omit the '.' operator.
+ CategoryInfo : InvalidOperation: (:) [latest_symlink_post_install.ps1], NotSupportedException
+ FullyQualifiedErrorId : DotSourceNotSupported,latest_symlink_post_install.ps

'''

 

I'll give a try on our lab computers in a few day an see if it is fixed on these computers too.

0 Kudos
Etienne
Beginner
8,297 Views

I took a 30 seconds trace (System Overview) as a test. Memory consumption seems to be a way better.

Unforunately, the time to process the trace is still really high. What is the usual time for opening these traces?

 

 

0 Kudos
SreedeviK_Intel
Moderator
8,250 Views

Hi,

Thank you for sharing your observations.

Can you please share the details of the machines where hotspot analysis worked and not worked (Processor details, OS details and if it is a linux, kindly specify the kernel version)?


Regards,

Sreedevi



0 Kudos
Etienne
Beginner
8,217 Views

Sorry for the noise, I clicked on the wrong button. I need to re-write my post.

0 Kudos
Etienne
Beginner
8,213 Views

I posed the details for my computer in the 3rd comment above:

  Roughly 2 x 18 Core, (RAM) 192 GB

 

I still do face a really long time to process vtune trace. The memory performance consumption seems to have improved a lot and this is making the whole process working. Unfortunately, it still required few hours to open a few seconds trace.

 

I took a look over vtune and my overview understanding is:

  1) Frontend: Chromium base UI

  2) Backend: Node.js based, running server.js

  3) Worker: I'm not sure what is being used. But this is the process that performs symbolisation.

  4) Communication: gRPC (or protobuf based)

  5) Database: sqlite

 

The bottleneck seems to be with the workers (see vtune_worker.png).  The worker is responsible of the symbolisation (see vtune_debug3). Since the symbolisation is using the dbghelp, I enabled the debugging with the environment variable, hook a debugger to the worker process and I looked to the debghelp output in windbg (see vtune_debug1).

 

This is making clear that the bottleneck is the symbolisation by far. It is taking about ~2 seconds to ~30 seconds by file and it is going for hours. The output in the debugger is aligned with the output in vtune UI (see screenshot).

 

As a short term solution, Is it possible to increase the amount of workers? I do have the CPU power / memory to handle more workers.

 

0 Kudos
SreedeviK_Intel
Moderator
8,071 Views

Hi,

 

Thank you for sharing the information with us.

 

We are checking on this internally and will get back to you with an update shortly.

To assist you better, kindly perform the following steps and share us the output:

 

1) Open a File Explorer window to "%TEMP%" (C:\Users\sreede2x\AppData\Local\Temp) and delete the amplxe-log-%USER% and amplxe-tmp-%USER% directories.

2) Run the failing scenario in VTune.

3) Send a .zip of "%TEMP%"\amplxe-log-%USER% to us.

 

SreedeviK_Intel_0-1701858457343.jpeg

 

 

Regards,

Sreedevi

 

0 Kudos
SreedeviK_Intel
Moderator
7,960 Views

Hi,


We have not heard back from you. Could you please confirm if your issue is resolved or not?


Regards,

Sreedevi


0 Kudos
Etienne
Beginner
7,939 Views

The issue is still there and it is still related to symbolisation. It is easy to reproduce as documented above in this post.

I added plenty of details in my comment "11-28-2023" with screenshots.

0 Kudos
Reply