Suggestions Needed for Finding a Locking/Waiting Problem using Hardware Sampling
I have an application running on a Dual Xeon i9 with a total of 8 2-way HT cores for 16 concurrent threads of execution.
application is very complex, with on the order of 100 various types of
concurrent threads being scheduled by the Linux scheduler.
run at full speed, it runs well for 20-30 minutes and then starts to
degrade. It look as if there is a lock or a series of queued events
which need to be cleared and are causing the machine to be mostly
quiescent for 1-2 seconds. This happens periodically every 4-6 seconds.
tried a minimal locks-and-waits sampling, but even with the sampling
paused, there was enough of a slowdown in the throughput of the system
that it never got into the stall for 1-2 seconds state.
run HW-based sampling, the stall behavior shows up. But, I do not know
how to find out what is stalling from the HW-based sampling, since when
it is stalled, there are no hardware events to sample. I could make
some guesses based on what is not running (showing up in samples) during
the stalls, but it is essentially everything, so this is problematic.
Do you have any suggestions?
One possibility is to use spin locks, which would show up as execution at the PC of the spin code, but this is likely to distort the behavior similar to what the locks-and-waits sampling did.
Since you complicated application consumes system resource high, I suggest to use command line to profile. I want to know if the stall was caused by the tool, or application-self. So you can launch application manually and use command line to profile whole system (e.g. amplxe-cl -collect lightweight-hotspots -analyze-system -duration xxx. Thus, all applications in system will be profiled - not only for your app only)
For LocksandWaits analysis, please try below: (avoid big overhead when monitoring in spin-locks) amplxe-cl --collect locksandwaits -knob collect-spin-data=false-knob collect-signals=true -follow-child -- your_app your_args