We have multicore/multithread application media-intensive software developed with IPP and we are using intel compiler with AVX optimization enabled. We also have a JDK ( 1.8.0_181-b13) installed and a Java code to handle the signaling and control operations of media applications running on C/C++. This is done through JNI running on other cores.
On Rhel 6 our application runs with constant CPU usage at a fixed load testing. But on Rhel 7 we are seeing for 1st one hour or so (sometimes it takes 2 hours) the CPU usage is high. And post that without any change the CPU usage is coming down by 5-10%. Initially, we thought it is caused by "TUNED" but even uninstalling tuned. and we are using the default rhel kernel (3.10.0-957.el7.x86_64).
We are using Intel Xeon CPU[Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz ] with 2 Sockets and multiple CPUS[ ProLiant DL380 Gen10 ]. Our product/application is running on a bare-metal server. Not in any VM or KVM.
The OS is Vanilla RHEL 7.8 .
We are using numcatl cores and the product is pinned on 3 cores in NUMA 1. The cpupower.service is not running in our server. The Intel P_state and C_State drivers are not installed. Turbo Boost is disabled. The nohz_full and CPU isolation is disabled in our setup The irqbalance service is running in default mode. The /proc/sys/vm/ data is restored to default OS values after the uninstallation of TUNED. Spectre Meltdown patches are applied. Any clue or help will be highly appreciated.
I have forgotten to add, that you may restrict the count of events either to kernel mode code or to user mode code and look at the resulting hotspots.
I do not recommend to measure both user mode and kernel mode count overflow. By my own observation the perf-driverless activity in the kernel mode is very large and will skew the results.
Thanks for replying to this thread. And suggesting using Vtune. We are trying but we are facing some issue when we try to capture hotpot data with target PID. We are trying to resolve that.
We are new to Vtune, we are looking into documentation, if you can suggest how to enable/disable event counting for Kernel/User space will be of great help.
You may create a *custom analysis and open the EVENT drop-down menu, then choose either: USER or OS.
Run at least a 10 times the same analysis and observe the results (there may be a variations) in some performance metrics reported by the VTune.
We are trying but we are facing some issue when we try to capture hotpot data with target PID.
If possible try to run your executable under the VTune control.
*of course this is relevant to hotspot analysis also.
we are able to get the Performance snapshot. I will need to check with legal to share in community the capture.
But when we try to capture the hotspot analysis, we end up having the error due to PIN_MAX_THREADS. Can you please let us know if this can be bypassed? or configure, we could not find any details for this.
Pasted the effor
>>>But when we try to capture the hotspot analysis, we end up having the error due to PIN_MAX_THREADS. Can you please let us know if this can be bypassed?>>>
You should ask for help on Intel VTune forum.
I have never (as a VTune user) encountered the problem described in your response.
We have been able to capture Hardware based hotspots and we are trying to analyze the sample.
We found that the same section of code is utilizing different amount of CPU time after the process if getting executed for 1 hour or a little more. Until then, the CPU usage is higher.
And we need a little help from you regarding what can be the possible cause for the same. I am attaching the assembly analysis screenshot from when the CPU usage is higher and low. It includes the CPU time and the instructions retired numbers.
Can you please help us little on the possible reason for the same.
Thanks and Best Regards.
By looking at assembly I presume that attached part of the code may be executed by some loop (at the higher level).
There are two arguments loads into %r14 and %r13 (possibly pointers?) and later there is a possible "pointer chase" like machine code sequence. I suppose, that high number of retired instructions and CPU time spent may be related to ineffective caching of the pointee data. The code at addresses 0x29795, 0x29799 and 0x2979d is probably executed serially and represent some kind of data structure manipulation. The code at 0x297b8 is dependent on result at 0x2979d (hence it was marked by the VTune).
It is interesting how much the issue of skid skewed the results. VTune pre-configured "hotspot analysis" relied on "INSTR_RETIRED.PREC_DIST" and for large number of loop iterations or other lengthy hot code the convergence is high and hence precision of results is more accurate.
It is hard to know exactly what has happen without seeing the whole picture presumably at source level and without an additional samples of other performance events (cache-hierarchy related).
I am checking whether I can do that. I will get back on the same.
In the meantime, if you have any other suggestions, please let me know so.
Thanks & Best Regards