Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5117 Discussions

Application does not respond and does not finish while profiling

MikhO
Beginner
13,241 Views

Hi all!

I am doing profiling of a Qt-based application in Windows10 using Intel VTune Amplifier 2019 (Update 8).

The problem is that the apllication stops to react and to respond when profiling is on.

The profiler is run paused and then resumed at the particular operation to be tested.

After resuming the application does not react and the operation stops at some stage (say 50%) and does not finish, but in Task Manager the application loads CPU quite extensively (i.e. something happens).

Without profiling, the operation goes well. The same problem happens under the same conditions but on another machine.

Could you recommend something what can help?

Thank you in advance.

Mikhail

0 Kudos
41 Replies
MikhO
Beginner
5,750 Views

Hello,

Thank you very much for your prompt reply and for valuable information.

As I said, I do not use Amplifier2019, we should for get about it at this point of time and not mention anymore further. For now I am using Intel oneAPI  VTune Profilier 2021.1.2 Patch 1. This should be enought to detect the processor, I hope.

I am not interested in the HW event based analysis, because it gives me not suitable profile, when the call stack is represented as a list of functions. What I would like to see, is a normal call stack, which looks like a tree of function calls. Moreover, this kind of the analysis is hardly suitable, because the duration of most actions being profiled is more then couple of seconds.

As for memory, I have 32 GB. the task manager is opened always, the memory is controlled, I always have enough memory.

Nevertheless, I have the problem, when some action cannot be profilied, because the application stops to respond and does not complete, so I have to kill it in the task manager. Maybe it crashes, but the CPU is active, and something is still processed in the program, although it does not respond. This happens only with VTune, even if it starts paused. Attachment to the process does not help also. But the operation completes well without Intel VTune.

That's why, I would like to know, why it is so and how to investigate the strange interaction between Intel VTune and the application. Could you advice some methods to investigate and to solve this problem?

For now I suspect that the problem could be related to multithreading intensively used in such operations. As I have 20 cores, many of them or maybe all are used. But I have not study this thoroughly, and I do not know, if it is possible to control.

Your colleague Raeesa gave a good advice on performing anomaly analysis, and I have tried but I am not sure what the options are needed. The analysis was not successful with the default options and some variations, because the problem appears again: the action does not complete and the application does not respond. I asked Raeesa for details, and she started to check something and is still probably checking since April...

Thank you in advance.

Best regards,

Mikhail

0 Kudos
MikhO
Beginner
5,741 Views

Hi Vinutha,

regarding memory, just noted, I have +10 GB available when amplifier fails. I guess, memory is not a question.

BR, Mikhail

0 Kudos
Andrey_I_Intel
Employee
5,733 Views

Hi Mikhail,

 

VTune dev here. Software based (User-Mode) hotspots scale quite well for up to hundreds of threads, but may have trouble with applications that use unconventional threading, so that may be the cause for a deadlock on a specific stage. If you happen to use custom spinlocks, fibers, coroutines in the application then that may be the case. Also maybe you have non-native, e.g. .NET driven code in your application?

Unfortunately we cannot debug much further without any kind of reproducer, so we generally recommend switching to hardware-based hotspots whenever possible and leave user mode only as a fallback option. Profile picture in HW vs SW in general should be very similar, so lets focus on fixing what is missing in HW.

From the description you gave I think you are probably missing stacks in the result that are not collected by default as it happens with User-Mode, so first make sure it looks like this on configuration stage: Screenshot 2021-05-12 130150.png

On my sample application it looks like this in Top-down being profiled with the provided configuration:

Andrey_I_Intel_0-1620813950418.png

If you are still not getting what is expected please provide more details about what is missing vs expected.

 

BR,

Andrey

0 Kudos
MikhO
Beginner
5,698 Views

Hi Andrey,

Thank you very much for your reply and for the detailed information.

I have tried the HW event-based sampling, but after activating the "Collect stacks" option I got an error message: "Stack flow analysis on this platform is limited to the hardware LBR-based stack type that has a depth limitation." I attached the screenshot. I guess, this is the reason, why the profiles obtained by this method are difficult to be analysed and are different from those obtained by using the User mode sampling.

Could you tell me, how this limitation can be lifted?

As for the possible reasons, why the application is not profiled successfully sometimes,  there should not be non-native part of the code, and the application is mostly written in pure C++. But I can assume that spinlocks are possible. Nevertheless, there is no possibility to find/fix such parts. The code is, as it is.

Thus, I mostly apply the User mode sampling for now, although it fails from time to time. I would note that in such cases profiling can be still successful after repeating multiple times, but in many cases I give up, and I cannot say, if it would be successful after a hundred of repetitions. Moreover, in such cases profiling is successful on other machines with older processors and Amplifier2015. That's why, I also assume that the processor of my machine is not well supported by Amplifier2021 or the profiler is with some bug. Maybe the HW based analysis will be a good solution, if it works.

Thank you in advance.

Best regards, Mikhail

0 Kudos
MikhO
Beginner
5,697 Views

This is the sreenshot with the error message, I get in the HW event-based analysis with the "Collect stacks" option.

0 Kudos
Andrey_I_Intel
Employee
5,688 Views

When we are talking about Windows, this error can only be caused by incomplete or broken profiling driver installation. Could you please try manual installation by launching "bin64/amplxe-sepreg.exe -i -v" utility from vtune installation folder and restart VTune?

If this does not fix the issue please post its output and also "amplxe-sepreg.exe -c -v" and "sc query vtss" commands output as well.

0 Kudos
MikhO
Beginner
5,657 Views

Hi Andrey,

Thank you very much for the commands.

I have tried to install the driver in cmd under admin, but it seems to be failed, and here is the output:

>amplxe-sepreg.exe -i -v
Warning, socperf3 driver is already installed and will be re-used... skipping
Installing and starting sepdrv5...
OK
Installing and starting sepdal...
OK
VTSS++ driver found
Deleting system32/drivers/vtss.sys file...OK
Forming source path for vtss.sys...OK
Forming destination path for vtss.sys...OK
Copying file C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\sepdrv\vtss.sys to C:\WINDOWS\system32\drivers\vtss.sys...OK
Installing and starting VTSS++ driver...FAILED

Line 12 is suspicious, and I have the same error message in the Intel VTune (although I have not restarted the PC yet).

The output of "amplxe-sepreg.exe -c -v" is as follows:

>amplxe-sepreg.exe -c -v
Checking platform...
Platform is genuine Intel: OK
Platform has SSE2: OK
Platform architecture: INTEL64
User has admin rights: OK
Drivers will be installed to C:\WINDOWS\System32\Drivers\
Checking sepdrv5 driver path...OK
Checking sepdrv5 service...
Driver status: the sepdrv5 service is running
Checking sepdal driver path...OK
Checking sepdal service...
Driver status: the sepdal service is running
Checking socperf3 driver path...OK
Checking socperf3 service...
Driver status: the socperf3 service is running
VTSS++ driver found

I am a bit confused by line 17, but it is as it is...

Finally, the output of "sc query vtss":

>sc query vtss

SERVICE_NAME: vtss
        TYPE               : 1  KERNEL_DRIVER
        STATE              : 1  STOPPED
        WIN32_EXIT_CODE    : 4294967290  (0xfffffffa)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0

Thank you in advance.

BR, Mikhail

0 Kudos
Andrey_I_Intel
Employee
5,647 Views

Could you please post the detailed Windows version, as reported by "winver" utility?

0 Kudos
MikhO
Beginner
5,644 Views

Hi Andrey,

Windows version: 20H2 (OS Build 19042.985). It's Windows 10 or so.

If needed, the processor is Intel Core i9-10900 CPU @ 2.80GHz.

BR, Mikhail

0 Kudos
Andrey_I_Intel
Employee
5,636 Views

Oh, right, support for 20H2 in this part was added only since VTune 2021.2, sorry for misleading info. Upgrade to the latest VTune version should resolve this installation issue.

 

BR,

Andrey

0 Kudos
MikhO
Beginner
5,617 Views

Hi Andrey,

Thank you for the valuable information. I updated the profiler up to 2021.3 and tested.

First of all,I tried to install/reinstall the driver, and the output is as follows:

>amplxe-sepreg.exe -i -v
Warning, socperf3 driver is already installed and will be re-used... skipping
Installing and starting sepdrv5...
OK
Installing and starting sepdal...
OK
VTSS++ driver found
Deleting system32/drivers/vtss.sys file...FAILED
Forming source path for vtss.sys...OK
Forming destination path for vtss.sys...OK
Installing and starting VTSS++ driver...OK

 

One can note an error in line 8, but this is about system32 or. Could you tell me, if it is an issue?

Meanwhile, the output of the check command give the following information:

>amplxe-sepreg.exe -c -v
Checking platform...
Platform is genuine Intel: OK
Platform has SSE2: OK
Platform architecture: INTEL64
User has admin rights: OK
Drivers will be installed to C:\WINDOWS\System32\Drivers\
Checking sepdrv5 driver path...OK
Checking sepdrv5 service...
Driver status: the sepdrv5 service is running
Checking sepdal driver path...OK
Checking sepdal service...
Driver status: the sepdal service is running
Checking socperf3 driver path...OK
Checking socperf3 service...
Driver status: the socperf3 service is running
VTSS++ driver found

Nevertheless, with the new version I do not have that error message about the HW sampling limitation, and I obtain call-stacks with the corresponding option.

Another plus, the HW event based sampling was successful, where the profiler failed before with using the User mode sampling. Thus, it works, and it is applicable, but I still don't understand completely, if it can fully replace the user mode sampling. For example, it is written that the HW event-based sampling is suitable to profile actions shorter than a few seconds, but I applied it even for the analysis  of actions longer than 30 seconds or even one minute.

Could you tell me, what are the negative consequences of such a use in terms of accuracy first of all in comparison with the user mode sampling? Could you tell me, which one is more preferable and when?

While testing I got inconsistent and even contradictory results generated by the HW event-based and user-mode samplings: the profile obtained by the user-mode sampling showed the performance deterioration, while the profile obtained by the HW event-based sampling did not reveal any change in terms of performance.

Could you tell me, why it can be so and which one is more accurate?

Thank you in advance.

Best regards,
Mikhail

0 Kudos
Vinutha_SV
Employee
5,610 Views

Hi Mikhail,

HW event based analysis has sampling interval of 1ms resolution (user mode is 10ms resolution)and hence it is preferred to use it if you have a shorter application run. But you can always use it for longer runs as well.

Performance deterioration - in terms of total elapsed time reported by VTune? This happens because of collection overhead as user mode sampling has more overhead than hw based sampling.

If you are seeing different hotspots in user mode and hw based runs, let me know.


0 Kudos
MikhO
Beginner
5,591 Views

Hi Vinutha,

Thank you very much for your reply and for the information.

As for the sampling interval, I am not sure that I understand you completely. As far as I understand, you mean CPU sampling interval. But I can change this quantity in both types of analysis. In HW event based sampling, it is 1 ms by default, but I can change it, and according to the manual, up to 1000 ms or so.

Could you clarify, if I am correct and what you meant in your answer?

When I mentioned the performance deterioration, I meant the execution time for a particulat operation which increased after some change in the code. I take two versions, before the code change and after, and profile them to reveal the reason of the performance change. And the user-mode sampling showed me the performance difference, while the HW event-based showed that the old and new versions are similar, i.e. no change in the performance. And this fact confuses, what the result is more reliable.

Could you tell me, which type of the analysis is true and is more accurate in this case?

Thank you in advance.

Best regards,

Mikhail

0 Kudos
Bernard
Valued Contributor I
5,526 Views

>>>When I mentioned the performance deterioration, I meant the execution time for a particulat operation which increased after some change in the code. I take two versions, before the code change and after, and profile them to reveal the reason of the performance change. And the user-mode sampling showed me the performance difference, while the HW event-based showed that the old and new versions are similar, i.e. no change in the performance. And this fact confuses, what the result is more reliable.>>>

Sampling-mode has an inherent overhead which is due to PMI handling (circa 1000-1500 cycles at least) there is additional cost of counter-multiplexing and counter virtualization (due to context-switches). IIRC the VTune while relying on the Sep drivers collector multiplexes event every 50 samples and this probably requires a long chain of WRMSR/RDMSR instructions save/restore the PMC's counts. 

For more precise and less of overhead measurements you may use a counting-mode coupled with the usage of ITT API markers, the other options is to insert calls to RDPMC from the user space (since the Linux kernel version 4), such calls would wraps the code at the scope of function call, loop and basic-blocks.

P.s.

             I have seen many times a large variations between two profiling sessions of the same program (without any optimizations). These variations were as alrge as ~15% between the runs for the higher frequency events e.g. CPU_CLK_UNHALTED.THREAD.

 

0 Kudos
MikhO
Beginner
5,416 Views

Hi,

Thank you very much for the valuable information.

Could you tell me, what kind of the hotspots analysis was meant, when you described "Sampling-mode", was it User-Mode sampling or HW event-based analysis?

As far as I understand from your other statement, to increase the accuracy and to reduce overheads, you would recommend two options. This sounds promising and interesting.

Could you recommend some links, where I could find a more detailed information about these techniques? And in my case, profiling is done under windows.

Thank you.

Best regards,

Mikhail

0 Kudos
Bernard
Valued Contributor I
5,406 Views

Hi @MikhO 

 

In my response I ment (by writing: "Sampling-mode") a method of profiling when PMU signals a counter-overflow state, thus enabling a registered handle to collect a valuable data i.e.  (HW event-based sampling).

 

>>>And in my case, profiling is done under windows.>>>

The low overhead method i.e. executing RDPMC from the user space is valid for the Linux OS and I do not know if this method will be working in Windows (probably not at least from the user space address range). For the counting mode i.e. manual instrumentation of your aplication (Windows) probably there would be a need to write a kernel driver for  PMU resources access. The other option (for Windows) is this tool  PCM.

I suppose that aforementioned tool is able to access PMC by the means of communicating with the kernel driver (PMU accessor). The overhead may be as high as user-kernel switch and any chain of instructions needed to read the PMC i.e. "counters" count (be it RDMSR or RDPMC executing from the kernel mode). I suppose that the overhead may be as high as hundreds of cycles.

>>>Could you recommend some links, where I could find a more detailed information about these techniques>>>

 

I have been extensively relying on the Intel documentation "Software Development Manual", VTune configuration files and information obtained from this website https://perfmon-events.intel.com/

Additional websites:

https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html

https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-volume-3b-system-programming-guide-part-2.html

https://download.01.org/perfmon/

 

In case of third link (perfmon) the most important information is contained in so called in specific CPU directories (raw events encoding), TMAM methodology and "perfmon_server_events".

 

 

 

 

0 Kudos
MikhO
Beginner
5,390 Views

Hi Bernard,

thank you very much for the valuable and thorough information. That's very interesting, I will study it and try the tool you recommended.

Best regards,

Mikhail

0 Kudos
Vinutha_SV
Employee
5,534 Views

Hi,

Sorry or the delayed response.

Can you let me know, how big is this difference? Is it possible to share the screenshots or the results where you see difference?

User-mode Hotspots focuses on algorithm level of optimization in user code. Analysis based on HW events (hotspots and others) focuses on how efficiently user code is executed by HW.


0 Kudos
MikhO
Beginner
5,417 Views

Hi,

Sorry for the delayed answer, it's a summer time.

Unfortunate, I can't provide you the data now, because I can't find the project where I observed the effect. I do profiling daily, and there are a huge number of projects. When I get an example with such an effect or find the old results, I will post it here, I guess, you will be notified. I can say definitely that the difference was notable.

For now I mostly apply and User-Mode Sampling. as it gives me more convenient results (call stacks are shown correctly by default, smother plots of thread activity), finally it's faster than Hardware event-based Sampling. If the former fails, the latter is applied, but fortunately it happens not very often.

Your statement about the difference between the two types of the hotspots analysis are interesting, but I do not understand your point completely. As far as I understand, in both cases I become the time spent in different parts of the code, and this is what I needed to identify so-called "battlenecks" and the reasons of the performance deterioration/improvement. Could you give a bit more details about the difference between the user-mode sampling and hardware event-based sampling?

As far as I understand, in bothe cases the sampling strategy is applied, although "event-based" in the name of the second analysis type is a bit confusing.

Thank you.

Best regards,

Mikhail

0 Kudos
Vinutha_SV
Employee
5,474 Views

Hi,

Gentle reminder to provide me the result folder or a screenshot.


0 Kudos
Vinutha_SV
Employee
5,362 Views

Hi,

I hope you are able to resolve the issue. Since I have not received response from few weeks, we shall stop monitoring this thread.


0 Kudos
Reply