During the analysis of 5G model which is partly based on Intel FlexRAN library I observed a strange case of the event UOPS_EXECUTED.X87 counting and assignment of its value to single function address range which does not contain any x87 instructions.
The VTune worked in driverless mode relying on perf_events interface. The specific event count was accumulated by both user mode and kernel mode code triggers ('uk' -- option), the period of event was a default and set probably by VTune. The size of collected callstack was small and set to 1024 bytes. The callstack resolution was not reconstructed mostly and probably because a RBP register usage as general purpose register thorough the libraries code. There was only a single test run with 90 microarchitectural events multiplexed.
Here is the relevant screenshot
The aforementioned function matrix_inv_cholesky_16_16 is a part of FlexRAN library and is implemented by means of manual vectorization with AVX512 intrinsics. The ICC generated implementation was thoroughly analyzed line by line and did not contain any x87 instructions, beside that no function calls to Intel runtime libraries or functions was present. The matrix inversion function contained its own complex arithmetics without relying on std::complex class, hence ICC did not insert x87 based implementation of complex division.
VTune was not able to reconstruct the callstack and as it seen in picture above the value of 254,915,064 x87 uops was assigned to matrix_inv_cholesky_16_16 function.
I strongly suspect a two potential reasons:
-- The aforementioned x87 uops were issued by the other functions in call chain and VTune not being able to rebuild the callstack assigned that count to top stack function (a matrix inversion function)
-- The setting of UOPS_EXECUTED.X87 enabled accumulation of kernel and user mode events and during the frequent syscall communication to kernel part of perf interface the x87 FXSAVE and FXRSTR instruction or rather their uop decomposition were collected by the event counter. The aforementioned function was the largest hotspot and probably during its execution many perf samples were issued, thus accumulating the x87 uops.
Hope, that someone may provide some help in this case.
Thanks for reaching out to us.Could you please share the below details with us so that we can analyze the issue in detail.
1) Give more information about your application.
2) More details about the platform and vtune version you are using.
3) The commands/steps used to profile the application.
Thank you very much for replying to my post.
Tomorrow I will provide more details in regards to your inquiries.
Please find the answers below
1) We test the performance of L1-uplink module (i.e. Pucch). This module contains mainly a control layer and dsp layer. The 'dsp layer' relies heavly on Intel FLEXRan library for carrying low-level vectrorized dsp operations. The test application runs in the Docker container.
a) Test machine OS version
b) Test machine CPU
CPU Skylake 24-core (Intel scalable 2nd gen) [the exact CPU information I do not posses because that server is not in working condition now]
3) The application runs inside the Docker container and VTune using driverless collection mode is attached to specific process ID. The application is executed 100 times in order to generate the sufficient number of samples.
An additional technical description of the anomaly was provided in my relevant forum post.
For your interest -- I uploaded a 3 collection files: map.events, perfcmd and sep.events
Thank you for your help
My previous reply is moderated probably because of inclusion of .txt files.
1) Our application is a simulation of 5G L1 uplink layer PUCCH module. The application runs inside a docker container and contains mainly a data, control plane code and a dsp code. The DSP layer relies mainly on Intel own FLEXRan library.
2) The platform is a Intel scalable CPU 2nd gen based server (24-core CPU [probably Platinum 8268]) (currently I do not have an access to machine). The VTune version (installed locally) vtune_profiler_2020.0.0.605129
3) The type of collector is a perf driverless collector. We run our simulation in Docker container and VTune is attached to specific PID. The analysis type is "runsa" with close to 90 performance events being collected.We use VTune GUI to create the microarchitectural session and run it. Currently I attached a 3 relevant files: map.events,perfcmd and sep.events. If you are intersted I can provide a data.perf file.
Although more than 10 months has passed since I asked for help -- this issue is still relevant.
Yes I would like to restart the investigation.
Could you please attach a screenshot of VTune asm view for the matrix_inv_... function so that we could see on what instructions the UOPS_EXECUTED.X87 samples fall?
If you prefer to send the data privately - feel free to contact me by mail: firstname.lastname@example.org
Unfortunately I was not able to find the old version of our library containing
the aforementioned matrix_inv_cholesky_16_16 function.
As I scarcely remember the x87 samples were mapped to non-x87 instructions.
Recently I have experienced similar anomalous behaviour of the VTune mapping where
samples of the AVX512 arithmetic vector packed single precision instructions were
mapped to seemingly random x86 scalar integer instructions. I suppose, that VTune lacks the heuristic to deal
with this kind of issues and profiler simply maps the sample IP (RIP value) to resolved binary address space.
If you want then I can privately share the screenshots of that new anomaly i.e. AVX512 vector packed float instructions mappings.
I'm so sorry that we where unable to handle your question in a timely manner. We definitely should improve the support process. I will discuss this with management. I'm closing this case for now.