Suggestions Needed for Finding a Locking/Waiting Problem using Hardware Sampling

todd-bezenek · ‎04-14-2011

I have an application running on a Dual Xeon i9 with a total of 8 2-way HT cores for 16 concurrent threads of execution.

My application is very complex, with on the order of 100 various types of concurrent threads being scheduled by the Linux scheduler.

When run at full speed, it runs well for 20-30 minutes and then starts to degrade. It look as if there is a lock or a series of queued events which need to be cleared and are causing the machine to be mostly quiescent for 1-2 seconds. This happens periodically every 4-6 seconds.

I tried a minimal locks-and-waits sampling, but even with the sampling paused, there was enough of a slowdown in the throughput of the system that it never got into the stall for 1-2 seconds state.

When I run HW-based sampling, the stall behavior shows up. But, I do not know how to find out what is stalling from the HW-based sampling, since when it is stalled, there are no hardware events to sample. I could make some guesses based on what is not running (showing up in samples) during the stalls, but it is essentially everything, so this is problematic.

Do you have any suggestions?

One possibility is to use spin locks, which would show up as execution
at the PC of the spin code, but this is likely to distort the behavior
similar to what the locks-and-waits sampling did.

Thank you for any suggestions,

Todd

Todd Bezenek
Computer Architect / Performance Analyst
bezenek@gmail.com

Peter_W_Intel · ‎04-14-2011

Hi Todd,

Assume that you are working on latest Update 2.

Since you complicated application consumes system resource high, I suggest to use command line to profile. I want to know if the stall was caused by the tool, or application-self. So you can launch application manually and use command line to profile whole system (e.g. amplxe-cl -collect lightweight-hotspots -analyze-system -duration xxx. Thus, all applications in system will be profiled - not only for your app only)

For LocksandWaits analysis, please try below: (avoid big overhead when monitoring in spin-locks)
amplxe-cl --collect locksandwaits -knob collect-spin-data=false-knob collect-signals=true -follow-child -- your_app your_args

Hope it helps.

Regards, Peter