Intel VTune: anomalous assignment of the event UOPS_EXECUTED.X87

Bernard · ‎07-30-2020

Hello,

During the analysis of 5G model which is partly based on Intel FlexRAN library I observed a strange case of the event UOPS_EXECUTED.X87 counting and assignment of its value to single function address range which does not contain any x87 instructions.

The VTune worked in driverless mode relying on perf_events interface. The specific event count was accumulated by both user mode and kernel mode code triggers ('uk' -- option), the period of event was a default and set probably by VTune. The size of collected callstack was small and set to 1024 bytes. The callstack resolution was not reconstructed mostly and probably because a RBP register usage as general purpose register thorough the libraries code. There was only a single test run with 90 microarchitectural events multiplexed.

Here is the relevant screenshot

The aforementioned function matrix_inv_cholesky_16_16 is a part of FlexRAN library and is implemented by means of manual vectorization with AVX512 intrinsics. The ICC generated implementation was thoroughly analyzed line by line and did not contain any x87 instructions, beside that no function calls to Intel runtime libraries or functions was present. The matrix inversion function contained its own complex arithmetics without relying on std::complex class, hence ICC did not insert x87 based implementation of complex division.

VTune was not able to reconstruct the callstack and as it seen in picture above the value of 254,915,064 x87 uops was assigned to matrix_inv_cholesky_16_16 function.

I strongly suspect a two potential reasons:

-- The aforementioned x87 uops were issued by the other functions in call chain and VTune not being able to rebuild the callstack assigned that count to top stack function (a matrix inversion function)

-- The setting of UOPS_EXECUTED.X87 enabled accumulation of kernel and user mode events and during the frequent syscall communication to kernel part of perf interface the x87 FXSAVE and FXRSTR instruction or rather their uop decomposition were collected by the event counter. The aforementioned function was the largest hotspot and probably during its execution many perf samples were issued, thus accumulating the x87 uops.

Hope, that someone may provide some help in this case.

Thank you

ChithraJ_Intel · ‎08-03-2020

Hi,

Thanks for reaching out to us.Could you please share the below details with us so that we can analyze the issue in detail.

1) Give more information about your application.

2) More details about the platform and vtune version you are using.

3) The commands/steps used to profile the application.

Regards,

Chithra

Bernard · ‎08-03-2020

Hello Chithra,

Thank you very much for replying to my post.

Tomorrow I will provide more details in regards to your inquiries.

Thank you

Bernard

Bernard · ‎08-04-2020

Hi Chithra,

Please find the answers below

1) We test the performance of L1-uplink module (i.e. Pucch). This module contains mainly a control layer and dsp layer. The 'dsp layer' relies heavly on Intel FLEXRan library for carrying low-level vectrorized dsp operations. The test application runs in the Docker container.

2)

a) Test machine OS version

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

b) Test machine CPU

CPU Skylake 24-core (Intel scalable 2nd gen) [the exact CPU information I do not posses because that server is not in working condition now]

3) The application runs inside the Docker container and VTune using driverless collection mode is attached to specific process ID. The application is executed 100 times in order to generate the sufficient number of samples.

An additional technical description of the anomaly was provided in my relevant forum post.

For your interest -- I uploaded a 3 collection files: map.events, perfcmd and sep.events

Thank you for your help

Bernard · ‎08-04-2020

I replied to your post and somehow my reply is awaiting a moderation?

Very strange

Bernard · ‎08-04-2020

Hi Chithra,

My previous reply is moderated probably because of inclusion of .txt files.

1) Our application is a simulation of 5G L1 uplink layer PUCCH module. The application runs inside a docker container and contains mainly a data, control plane code and a dsp code. The DSP layer relies mainly on Intel own FLEXRan library.

2) The platform is a Intel scalable CPU 2nd gen based server (24-core CPU [probably Platinum 8268]) (currently I do not have an access to machine). The VTune version (installed locally) vtune_profiler_2020.0.0.605129

3) The type of collector is a perf driverless collector. We run our simulation in Docker container and VTune is attached to specific PID. The analysis type is "runsa" with close to 90 performance events being collected.We use VTune GUI to create the microarchitectural session and run it. Currently I attached a 3 relevant files: map.events,perfcmd and sep.events. If you are intersted I can provide a data.perf file.

Thank you

Bernard

ChithraJ_Intel · ‎08-05-2020

Hi Bernad,

Thanks for details shared. We are forwarding your case to Subject Matter Expert. They will get back to you soon on this.

Regards,

Chithra

Bernard · ‎08-05-2020

Hi Chithra,

Thank you for forwarding this issue to VTune SME.

The case of this anomalous assignment is quiet perplexing.

--Bernard

Bernard · ‎08-10-2020

Hi Chithra,

Do you have any updates in regards to this issue?

Thank you

--Bernard

DMITRY_T_Intel · ‎06-23-2021

Hi Bernard,

Could you please tell me: Is this issue still relevant? Would you like to restart investigation?

Thank you very much!

Bernard · ‎06-28-2021

Hello Dmitry,

Although more than 10 months has passed since I asked for help -- this issue is still relevant.

Yes I would like to restart the investigation.

Best regards

Bernard

Dmitry_R_Intel1 · ‎06-28-2021

Hi Bernard,

Could you please attach a screenshot of VTune asm view for the matrix_inv_... function so that we could see on what instructions the UOPS_EXECUTED.X87 samples fall?

If you prefer to send the data privately - feel free to contact me by mail: dmitry.ryabtsev@intel.com

Bernard · ‎06-29-2021

Hi Dmitry,

Unfortunately I was not able to find the old version of our library containing

the aforementioned matrix_inv_cholesky_16_16 function.

As I scarcely remember the x87 samples were mapped to non-x87 instructions.

Recently I have experienced similar anomalous behaviour of the VTune mapping where

samples of the AVX512 arithmetic vector packed single precision instructions were

mapped to seemingly random x86 scalar integer instructions. I suppose, that VTune lacks the heuristic to deal

with this kind of issues and profiler simply maps the sample IP (RIP value) to resolved binary address space.

If you want then I can privately share the screenshots of that new anomaly i.e. AVX512 vector packed float instructions mappings.

Best regards

Bernard

Bernard · ‎06-30-2021

@Dmitry_R_Intel1

I have contacted you by sending a private message.

Thank you

Bernard

Bernard · ‎07-02-2021

@DMITRY_T_Intel

This issue can be closed.

It is very sad, that Subject Matter Expert help arrived after 10 months of my waiting.

DMITRY_T_Intel · ‎07-02-2021

Hi Bernard,

I'm so sorry that we where unable to handle your question in a timely manner. We definitely should improve the support process. I will discuss this with management. I'm closing this case for now.

Thank you!

Intel VTune: anomalous assignment of the event UOPS_EXECUTED.X87

Intel VTune™ Profiler