Analyzers
Support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646
4578 Discussions

Application does not respond and does not finish while profiling

MikhO
Beginner
4,335 Views

Hi all!

I am doing profiling of a Qt-based application in Windows10 using Intel VTune Amplifier 2019 (Update 8).

The problem is that the apllication stops to react and to respond when profiling is on.

The profiler is run paused and then resumed at the particular operation to be tested.

After resuming the application does not react and the operation stops at some stage (say 50%) and does not finish, but in Task Manager the application loads CPU quite extensively (i.e. something happens).

Without profiling, the operation goes well. The same problem happens under the same conditions but on another machine.

Could you recommend something what can help?

Thank you in advance.

Mikhail

0 Kudos
41 Replies
MikhO
Beginner
1,345 Views

Hello,

Thank you very much for your prompt reply and for valuable information.

As I said, I do not use Amplifier2019, we should for get about it at this point of time and not mention anymore further. For now I am using Intel oneAPI  VTune Profilier 2021.1.2 Patch 1. This should be enought to detect the processor, I hope.

I am not interested in the HW event based analysis, because it gives me not suitable profile, when the call stack is represented as a list of functions. What I would like to see, is a normal call stack, which looks like a tree of function calls. Moreover, this kind of the analysis is hardly suitable, because the duration of most actions being profiled is more then couple of seconds.

As for memory, I have 32 GB. the task manager is opened always, the memory is controlled, I always have enough memory.

Nevertheless, I have the problem, when some action cannot be profilied, because the application stops to respond and does not complete, so I have to kill it in the task manager. Maybe it crashes, but the CPU is active, and something is still processed in the program, although it does not respond. This happens only with VTune, even if it starts paused. Attachment to the process does not help also. But the operation completes well without Intel VTune.

That's why, I would like to know, why it is so and how to investigate the strange interaction between Intel VTune and the application. Could you advice some methods to investigate and to solve this problem?

For now I suspect that the problem could be related to multithreading intensively used in such operations. As I have 20 cores, many of them or maybe all are used. But I have not study this thoroughly, and I do not know, if it is possible to control.

Your colleague Raeesa gave a good advice on performing anomaly analysis, and I have tried but I am not sure what the options are needed. The analysis was not successful with the default options and some variations, because the problem appears again: the action does not complete and the application does not respond. I asked Raeesa for details, and she started to check something and is still probably checking since April...

Thank you in advance.

Best regards,

Mikhail

MikhO
Beginner
1,336 Views

Hi Vinutha,

regarding memory, just noted, I have +10 GB available when amplifier fails. I guess, memory is not a question.

BR, Mikhail

Andrey_I_Intel
Employee
1,328 Views

Hi Mikhail,

 

VTune dev here. Software based (User-Mode) hotspots scale quite well for up to hundreds of threads, but may have trouble with applications that use unconventional threading, so that may be the cause for a deadlock on a specific stage. If you happen to use custom spinlocks, fibers, coroutines in the application then that may be the case. Also maybe you have non-native, e.g. .NET driven code in your application?

Unfortunately we cannot debug much further without any kind of reproducer, so we generally recommend switching to hardware-based hotspots whenever possible and leave user mode only as a fallback option. Profile picture in HW vs SW in general should be very similar, so lets focus on fixing what is missing in HW.

From the description you gave I think you are probably missing stacks in the result that are not collected by default as it happens with User-Mode, so first make sure it looks like this on configuration stage: Screenshot 2021-05-12 130150.png

On my sample application it looks like this in Top-down being profiled with the provided configuration:

Andrey_I_Intel_0-1620813950418.png

If you are still not getting what is expected please provide more details about what is missing vs expected.

 

BR,

Andrey

MikhO
Beginner
1,293 Views

Hi Andrey,

Thank you very much for your reply and for the detailed information.

I have tried the HW event-based sampling, but after activating the "Collect stacks" option I got an error message: "Stack flow analysis on this platform is limited to the hardware LBR-based stack type that has a depth limitation." I attached the screenshot. I guess, this is the reason, why the profiles obtained by this method are difficult to be analysed and are different from those obtained by using the User mode sampling.

Could you tell me, how this limitation can be lifted?

As for the possible reasons, why the application is not profiled successfully sometimes,  there should not be non-native part of the code, and the application is mostly written in pure C++. But I can assume that spinlocks are possible. Nevertheless, there is no possibility to find/fix such parts. The code is, as it is.

Thus, I mostly apply the User mode sampling for now, although it fails from time to time. I would note that in such cases profiling can be still successful after repeating multiple times, but in many cases I give up, and I cannot say, if it would be successful after a hundred of repetitions. Moreover, in such cases profiling is successful on other machines with older processors and Amplifier2015. That's why, I also assume that the processor of my machine is not well supported by Amplifier2021 or the profiler is with some bug. Maybe the HW based analysis will be a good solution, if it works.

Thank you in advance.

Best regards, Mikhail

MikhO
Beginner
1,292 Views

This is the sreenshot with the error message, I get in the HW event-based analysis with the "Collect stacks" option.

Andrey_I_Intel
Employee
1,283 Views

When we are talking about Windows, this error can only be caused by incomplete or broken profiling driver installation. Could you please try manual installation by launching "bin64/amplxe-sepreg.exe -i -v" utility from vtune installation folder and restart VTune?

If this does not fix the issue please post its output and also "amplxe-sepreg.exe -c -v" and "sc query vtss" commands output as well.

MikhO
Beginner
1,252 Views

Hi Andrey,

Thank you very much for the commands.

I have tried to install the driver in cmd under admin, but it seems to be failed, and here is the output:

>amplxe-sepreg.exe -i -v
Warning, socperf3 driver is already installed and will be re-used... skipping
Installing and starting sepdrv5...
OK
Installing and starting sepdal...
OK
VTSS++ driver found
Deleting system32/drivers/vtss.sys file...OK
Forming source path for vtss.sys...OK
Forming destination path for vtss.sys...OK
Copying file C:\Program Files (x86)\Intel\oneAPI\vtune\latest\bin64\sepdrv\vtss.sys to C:\WINDOWS\system32\drivers\vtss.sys...OK
Installing and starting VTSS++ driver...FAILED

Line 12 is suspicious, and I have the same error message in the Intel VTune (although I have not restarted the PC yet).

The output of "amplxe-sepreg.exe -c -v" is as follows:

>amplxe-sepreg.exe -c -v
Checking platform...
Platform is genuine Intel: OK
Platform has SSE2: OK
Platform architecture: INTEL64
User has admin rights: OK
Drivers will be installed to C:\WINDOWS\System32\Drivers\
Checking sepdrv5 driver path...OK
Checking sepdrv5 service...
Driver status: the sepdrv5 service is running
Checking sepdal driver path...OK
Checking sepdal service...
Driver status: the sepdal service is running
Checking socperf3 driver path...OK
Checking socperf3 service...
Driver status: the socperf3 service is running
VTSS++ driver found

I am a bit confused by line 17, but it is as it is...

Finally, the output of "sc query vtss":

>sc query vtss

SERVICE_NAME: vtss
        TYPE               : 1  KERNEL_DRIVER
        STATE              : 1  STOPPED
        WIN32_EXIT_CODE    : 4294967290  (0xfffffffa)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0

Thank you in advance.

BR, Mikhail

Andrey_I_Intel
Employee
1,242 Views

Could you please post the detailed Windows version, as reported by "winver" utility?

MikhO
Beginner
1,239 Views

Hi Andrey,

Windows version: 20H2 (OS Build 19042.985). It's Windows 10 or so.

If needed, the processor is Intel Core i9-10900 CPU @ 2.80GHz.

BR, Mikhail

Andrey_I_Intel
Employee
1,231 Views

Oh, right, support for 20H2 in this part was added only since VTune 2021.2, sorry for misleading info. Upgrade to the latest VTune version should resolve this installation issue.

 

BR,

Andrey

MikhO
Beginner
1,212 Views

Hi Andrey,

Thank you for the valuable information. I updated the profiler up to 2021.3 and tested.

First of all,I tried to install/reinstall the driver, and the output is as follows:

>amplxe-sepreg.exe -i -v
Warning, socperf3 driver is already installed and will be re-used... skipping
Installing and starting sepdrv5...
OK
Installing and starting sepdal...
OK
VTSS++ driver found
Deleting system32/drivers/vtss.sys file...FAILED
Forming source path for vtss.sys...OK
Forming destination path for vtss.sys...OK
Installing and starting VTSS++ driver...OK

 

One can note an error in line 8, but this is about system32 or. Could you tell me, if it is an issue?

Meanwhile, the output of the check command give the following information:

>amplxe-sepreg.exe -c -v
Checking platform...
Platform is genuine Intel: OK
Platform has SSE2: OK
Platform architecture: INTEL64
User has admin rights: OK
Drivers will be installed to C:\WINDOWS\System32\Drivers\
Checking sepdrv5 driver path...OK
Checking sepdrv5 service...
Driver status: the sepdrv5 service is running
Checking sepdal driver path...OK
Checking sepdal service...
Driver status: the sepdal service is running
Checking socperf3 driver path...OK
Checking socperf3 service...
Driver status: the socperf3 service is running
VTSS++ driver found

Nevertheless, with the new version I do not have that error message about the HW sampling limitation, and I obtain call-stacks with the corresponding option.

Another plus, the HW event based sampling was successful, where the profiler failed before with using the User mode sampling. Thus, it works, and it is applicable, but I still don't understand completely, if it can fully replace the user mode sampling. For example, it is written that the HW event-based sampling is suitable to profile actions shorter than a few seconds, but I applied it even for the analysis  of actions longer than 30 seconds or even one minute.

Could you tell me, what are the negative consequences of such a use in terms of accuracy first of all in comparison with the user mode sampling? Could you tell me, which one is more preferable and when?

While testing I got inconsistent and even contradictory results generated by the HW event-based and user-mode samplings: the profile obtained by the user-mode sampling showed the performance deterioration, while the profile obtained by the HW event-based sampling did not reveal any change in terms of performance.

Could you tell me, why it can be so and which one is more accurate?

Thank you in advance.

Best regards,
Mikhail

Vinutha_SV
Moderator
1,205 Views

Hi Mikhail,

HW event based analysis has sampling interval of 1ms resolution (user mode is 10ms resolution)and hence it is preferred to use it if you have a shorter application run. But you can always use it for longer runs as well.

Performance deterioration - in terms of total elapsed time reported by VTune? This happens because of collection overhead as user mode sampling has more overhead than hw based sampling.

If you are seeing different hotspots in user mode and hw based runs, let me know.


MikhO
Beginner
1,186 Views

Hi Vinutha,

Thank you very much for your reply and for the information.

As for the sampling interval, I am not sure that I understand you completely. As far as I understand, you mean CPU sampling interval. But I can change this quantity in both types of analysis. In HW event based sampling, it is 1 ms by default, but I can change it, and according to the manual, up to 1000 ms or so.

Could you clarify, if I am correct and what you meant in your answer?

When I mentioned the performance deterioration, I meant the execution time for a particulat operation which increased after some change in the code. I take two versions, before the code change and after, and profile them to reveal the reason of the performance change. And the user-mode sampling showed me the performance difference, while the HW event-based showed that the old and new versions are similar, i.e. no change in the performance. And this fact confuses, what the result is more reliable.

Could you tell me, which type of the analysis is true and is more accurate in this case?

Thank you in advance.

Best regards,

Mikhail

Bernard
Black Belt
1,121 Views

>>>When I mentioned the performance deterioration, I meant the execution time for a particulat operation which increased after some change in the code. I take two versions, before the code change and after, and profile them to reveal the reason of the performance change. And the user-mode sampling showed me the performance difference, while the HW event-based showed that the old and new versions are similar, i.e. no change in the performance. And this fact confuses, what the result is more reliable.>>>

Sampling-mode has an inherent overhead which is due to PMI handling (circa 1000-1500 cycles at least) there is additional cost of counter-multiplexing and counter virtualization (due to context-switches). IIRC the VTune while relying on the Sep drivers collector multiplexes event every 50 samples and this probably requires a long chain of WRMSR/RDMSR instructions save/restore the PMC's counts. 

For more precise and less of overhead measurements you may use a counting-mode coupled with the usage of ITT API markers, the other options is to insert calls to RDPMC from the user space (since the Linux kernel version 4), such calls would wraps the code at the scope of function call, loop and basic-blocks.

P.s.

             I have seen many times a large variations between two profiling sessions of the same program (without any optimizations). These variations were as alrge as ~15% between the runs for the higher frequency events e.g. CPU_CLK_UNHALTED.THREAD.

 

MikhO
Beginner
1,011 Views

Hi,

Thank you very much for the valuable information.

Could you tell me, what kind of the hotspots analysis was meant, when you described "Sampling-mode", was it User-Mode sampling or HW event-based analysis?

As far as I understand from your other statement, to increase the accuracy and to reduce overheads, you would recommend two options. This sounds promising and interesting.

Could you recommend some links, where I could find a more detailed information about these techniques? And in my case, profiling is done under windows.

Thank you.

Best regards,

Mikhail

Bernard
Black Belt
1,001 Views

Hi @MikhO 

 

In my response I ment (by writing: "Sampling-mode") a method of profiling when PMU signals a counter-overflow state, thus enabling a registered handle to collect a valuable data i.e.  (HW event-based sampling).

 

>>>And in my case, profiling is done under windows.>>>

The low overhead method i.e. executing RDPMC from the user space is valid for the Linux OS and I do not know if this method will be working in Windows (probably not at least from the user space address range). For the counting mode i.e. manual instrumentation of your aplication (Windows) probably there would be a need to write a kernel driver for  PMU resources access. The other option (for Windows) is this tool  PCM.

I suppose that aforementioned tool is able to access PMC by the means of communicating with the kernel driver (PMU accessor). The overhead may be as high as user-kernel switch and any chain of instructions needed to read the PMC i.e. "counters" count (be it RDMSR or RDPMC executing from the kernel mode). I suppose that the overhead may be as high as hundreds of cycles.

>>>Could you recommend some links, where I could find a more detailed information about these techniques>>>

 

I have been extensively relying on the Intel documentation "Software Development Manual", VTune configuration files and information obtained from this website https://perfmon-events.intel.com/

Additional websites:

https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html

https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-v...

https://download.01.org/perfmon/

 

In case of third link (perfmon) the most important information is contained in so called in specific CPU directories (raw events encoding), TMAM methodology and "perfmon_server_events".

 

 

 

 

MikhO
Beginner
985 Views

Hi Bernard,

thank you very much for the valuable and thorough information. That's very interesting, I will study it and try the tool you recommended.

Best regards,

Mikhail

Vinutha_SV
Moderator
1,129 Views

Hi,

Sorry or the delayed response.

Can you let me know, how big is this difference? Is it possible to share the screenshots or the results where you see difference?

User-mode Hotspots focuses on algorithm level of optimization in user code. Analysis based on HW events (hotspots and others) focuses on how efficiently user code is executed by HW.


MikhO
Beginner
1,012 Views

Hi,

Sorry for the delayed answer, it's a summer time.

Unfortunate, I can't provide you the data now, because I can't find the project where I observed the effect. I do profiling daily, and there are a huge number of projects. When I get an example with such an effect or find the old results, I will post it here, I guess, you will be notified. I can say definitely that the difference was notable.

For now I mostly apply and User-Mode Sampling. as it gives me more convenient results (call stacks are shown correctly by default, smother plots of thread activity), finally it's faster than Hardware event-based Sampling. If the former fails, the latter is applied, but fortunately it happens not very often.

Your statement about the difference between the two types of the hotspots analysis are interesting, but I do not understand your point completely. As far as I understand, in both cases I become the time spent in different parts of the code, and this is what I needed to identify so-called "battlenecks" and the reasons of the performance deterioration/improvement. Could you give a bit more details about the difference between the user-mode sampling and hardware event-based sampling?

As far as I understand, in bothe cases the sampling strategy is applied, although "event-based" in the name of the second analysis type is a bit confusing.

Thank you.

Best regards,

Mikhail

Vinutha_SV
Moderator
1,069 Views

Hi,

Gentle reminder to provide me the result folder or a screenshot.


Vinutha_SV
Moderator
957 Views

Hi,

I hope you are able to resolve the issue. Since I have not received response from few weeks, we shall stop monitoring this thread.


Reply