Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

No call stack information for some threads

Girish_M_
Beginner
1,437 Views

I have two systems with vtune installed and i am trying to collect hardware events and then generate a report grouped by thread. I use the following two commands:

amplxe-cl -collect general-exploration -knob enable-stack-collection=true -data-limit=0 -d='unlimited' -target-duration-type=long -r vresult -app-working-dir . --search-dir sym:p=. -- ./myapp myarg

amplxe-cl -report hw-events -group-by thread -r vresult >result.tx

The two systems are

System A - Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz

System B -  Intel(R) Xeon(R) CPU E5530  @ 2.40GHz

In system A i get all thread information, for example if 8 threads were created i get all thread information however on the other i do not get the information for all the threads. The report generated has lesser number of threads than there should be.

When i try doing the same thing through the GUI in the system B i see that some threads have no call stack information and thus the Hw events for these threads are NIL. 

I have dbg library packages installed as well. Appreciate any help. Thanks

0 Kudos
1 Solution
Peter_W_Intel
Employee
1,419 Views

That is true for your "System B -  Intel(R) Xeon(R) CPU E5530  @ 2.40GHz". This is a Nehalem-EP processor.

I can reproduce this on my side.

# amplxe-cl -collect general-exploration -knob enable-stack-collection=true -app-working-dir /home/peter/problem_report -- /home/peter/problem_report/primes.ia32
amplxe: Error: Cannot enable advanced capabilities for Hardware Event-based Sampling: problem with the driver (vtss/vtsspp). Check that the driver is running and the driver group is in the current user group list. See "Building and Managing the Sampling Driver" help topic for further details.

# amplxe-cl -collect general-exploration -knob enable-stack-collection=false -app-working-dir /home/peter/problem_report -- /home/peter/problem_report/primes.ia32  ; it can work properly

Event-based sampling with stack collection can work only on SandBridge processors or later. You may try other supported processor. 

 

View solution in original post

0 Kudos
25 Replies
Peter_W_Intel
Employee
1,420 Views

That is true for your "System B -  Intel(R) Xeon(R) CPU E5530  @ 2.40GHz". This is a Nehalem-EP processor.

I can reproduce this on my side.

# amplxe-cl -collect general-exploration -knob enable-stack-collection=true -app-working-dir /home/peter/problem_report -- /home/peter/problem_report/primes.ia32
amplxe: Error: Cannot enable advanced capabilities for Hardware Event-based Sampling: problem with the driver (vtss/vtsspp). Check that the driver is running and the driver group is in the current user group list. See "Building and Managing the Sampling Driver" help topic for further details.

# amplxe-cl -collect general-exploration -knob enable-stack-collection=false -app-working-dir /home/peter/problem_report -- /home/peter/problem_report/primes.ia32  ; it can work properly

Event-based sampling with stack collection can work only on SandBridge processors or later. You may try other supported processor. 

 

0 Kudos
Girish_M_
Beginner
1,202 Views

Thank you for your fast reply. I checked that the driver is running and i am the member of the group driver is running in. I see that on setting enable-stack-collection to false i see thread data for all threads however as expected i see very few HW counters. So i will have to try on another system to get these counters i suppose. Thanks again!!

0 Kudos
Peter_W_Intel
Employee
1,202 Views

In system A i get all thread information, for example if 8 threads were created i get all thread information however on the other i do not get the information for all the threads.  The report generated has lesser number of threads than there should be.

I don't understand that you said you can get all thread information...why the report generated has lesser number of threads than there should be?

Can you please describe it more detail and post VTune result? 

0 Kudos
Girish_M_
Beginner
1,202 Views

I have attached the output of

amplxe-cl -collect custom-hw-0  -knob enable-stack-collection=true -data-limit=0 -d='unlimited' -target-duration-type=long -r vresult -app-working-dir . --search-dir sym:p=. -- ./myapp myarg

//here my app runs with 8 threads

amplxe-cl -report hw-events -group-by thread -r vresult >vtuneresults.txt

I have attached vtuneresults.txt and there is information on only 7 threads and not 8.

I tried unloading the drivers (./rmmod-sep3 -s) and reloaded them granting permissions to all users (./insmod-sep3 -pu -p 666) as shown in the Build and manage sampling driver thread and the number of times the thread information is not found is greatly reduced.

0 Kudos
Dmitry_P_Intel1
Employee
1,202 Views

It would be helpful if you could zip and attach the result directory - the directory you point as -r vresult.

Thanks & Regards, Dmitry

0 Kudos
Girish_M_
Beginner
1,202 Views

Please find the result attached. 

0 Kudos
Peter_W_Intel
Employee
1,202 Views

Thanks for your result data.

The reason was simple, you can find all eight threads in timeline panel, but one of eight threads consumed less CPU time (responsible for context switching?), so functions of  that thread were not appeared in hotspots report.

0 Kudos
Girish_M_
Beginner
1,202 Views

All the threads here executed the same code so when you say consumed less CPU time, Do you mean the time it consumed was not accounted by Vtune?  and i do not see such issue on other system.

0 Kudos
Peter_W_Intel
Employee
1,202 Views

The reason could be - your app ran shortly 0.3s?. Last thread waited for task assignment but program ended. Please try to add more workload.  

0 Kudos
Dmitry_P_Intel1
Employee
1,202 Views

Hello,

VTune reports that one of the working threads did not have samples since it was out by preemption and was inactive during the whole run.

To explore more could you please do the following collection:

 amplxe-cl -collect general-exploration -analyze-system -data-limit=0 -d='unlimited' -target-duration-type=long -r vresult -app-working-dir . --search-dir sym:p=. -- ./myapp myarg

and provide the result dir.

Also - it seems that you are limiting the number of working threads by phisical cores only. Do you use any affinity for threads to pin them to phisical cores exclusively?

Thanks & Regards, Dmitry

 

0 Kudos
Girish_M_
Beginner
1,202 Views

Please find the compressed vresult directory attached. Also i measure the ticks each thread ran using clock() in C and i see that all these threads executed for almost the same number of ticks yet i do not get the thread data for one thread.  In this case the thread with id 11034 is missing and i see that this thread was created first using pthreads and finished last.

0 Kudos
Girish_M_
Beginner
1,202 Views

Also I am limiting the number of threads to be equal to number of cores however i am not pining them to a certain core. Also please find the source code for the app that i am running, it is a nqueens problem solver however in this case instead of merging the different data got from different threads , each thread is made to generate all the data so that each thread executes for the same time theoretically.

0 Kudos
Girish_M_
Beginner
1,202 Views

This can be reproduced on SandyBridge as well. I run this in a loop and count the number of thread info that has been produced, if it is less, i stop.

0 Kudos
Peter_W_Intel
Employee
1,202 Views

Thanks for your example code.

I recompiled your code, and cannot reproduce this issue (I attached binary, you may try again, is it due to gcc version?)

gcc-4.4.6-3.el6.x86_64

I used VTune Amplifier XE 2013 Update 17. See attached my VTune result.

There were 8 working threads + 1 main thread. Can other people reproduce this problem? 

0 Kudos
Peter_W_Intel
Employee
1,202 Views

binary file

0 Kudos
Girish_M_
Beginner
1,202 Views

Hi Peter,

I used the binary  you provided and ran it on the machine : Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, which was less prone to this error.

I still see the error. Please find the result attached. Since the number of cores is 4 in this machine, i try to spawn 4 threads however here i get info on only 3 threads and the main thread info is very rare. I hardly see it.

Did you run it in a loop? because i see this issue sometimes not getting reproduced on running less number of times.

0 Kudos
Peter_W_Intel
Employee
1,202 Views

It seemed that you changed binary name with command "./NDAID 4" in result...there are 8 logical cores, why did you run it in 4 threads?

I also tried in IvyBridge processor which has 8 logical CPU cores, with 3.5GHz frequency, the result was expected, the difference is:

My OS: 3.11.0-19-generic

Your OS: 3.13.0-24-generic

Is it possible due to OS task scheduling reason? Can you try on some old operating systems?

0 Kudos
Dmitry_P_Intel1
Employee
1,202 Views

Looked at the result attached and also experimented a bit on my IVT box compiling the source provided.

First - the run seems to be quite short and you use target-duration-type=long knob that sets pretty coarse grain sampling interval.

So I would recommend to set it to "veryshort".

When I did this I saw all 9 threads - 8 working threads and 1 main thread that though consumes pretty small portion of CPU mostly waiting on thread_join.

Thansk & Regards, Dmitry

0 Kudos
Girish_M_
Beginner
1,202 Views

Hi Dmitry,

I tried very short on  Nehalem-EP Processor with target-duration-type set to veryshort and i could reproduce the issue. I also found that once i restart the system it can sometimes take more than 1000 runs to reproduce the issue. Attached is the result of the run when i hit the issue.

Peter,

I am waiting for a system, i will soon run on older kernel once i get it.

Regards

Girish

 

0 Kudos
Girish_M_
Beginner
1,085 Views

Hi Peter,

I tried it on kernel 3.11.0-15-generic and i still see the problem.

 

Girish

0 Kudos
Reply