Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Black Belt
258 Views

Intel VTune: anomalous assignment of the event UOPS_EXECUTED.X87

Hello,

During the analysis of 5G model which is partly based on Intel FlexRAN library I observed a strange case of the event UOPS_EXECUTED.X87  counting and assignment of its  value to single function address range which does not contain any x87 instructions.

The VTune worked in driverless mode relying on perf_events interface. The specific event count was accumulated by both user mode and kernel mode code triggers ('uk' -- option), the period of event was a default and set probably by VTune. The size of collected callstack was small and set to 1024 bytes. The callstack resolution was not reconstructed mostly and probably because a RBP register usage as general purpose register thorough the libraries code. There was only a single test run with 90 microarchitectural events multiplexed.

Here is the relevant screenshot

iliyapolak_0-1596109428399.png

The aforementioned function matrix_inv_cholesky_16_16  is a part of FlexRAN library and is implemented by means of manual vectorization with AVX512 intrinsics. The ICC generated implementation was thoroughly analyzed line by line and did not contain any x87 instructions, beside that no function calls to Intel runtime libraries or functions was present. The matrix inversion function contained its own complex arithmetics without relying on std::complex class, hence ICC did not insert x87 based implementation of complex division.

VTune was not able to reconstruct the callstack and as it seen in picture above the value of 254,915,064 x87 uops was assigned to matrix_inv_cholesky_16_16 function.

I strongly suspect a two potential reasons:

-- The aforementioned x87 uops were issued by the other functions in call chain and VTune not being able to rebuild the callstack assigned that count to top stack function (a matrix inversion function)

-- The setting of UOPS_EXECUTED.X87 enabled accumulation of kernel and user mode events and during the frequent syscall communication to kernel part of perf interface  the x87 FXSAVE and FXRSTR instruction or rather their uop decomposition were collected by the event counter.  The aforementioned function was the largest hotspot and probably during its execution many perf samples were issued, thus accumulating the x87 uops.

Hope, that someone may provide some help in this case.

 Thank you

 

 

 

Labels (1)
0 Kudos
8 Replies
Highlighted
Moderator
214 Views

Hi,


Thanks for reaching out to us.Could you please share the below details with us so that we can analyze the issue in detail.


1) Give more information about your application.

2) More details about the platform and vtune version you are using.

3) The commands/steps used to profile the application.


Regards,

Chithra


0 Kudos
Highlighted
Black Belt
210 Views

Hello Chithra,

Thank you very much for replying to my post.

Tomorrow I will provide more details in regards to your inquiries.

 

Thank you

Bernard

 

0 Kudos
Highlighted
Black Belt
179 Views

 

Hi Chithra,

Please find the answers below

1) We test the performance of  L1-uplink module (i.e. Pucch). This module contains mainly a control layer  and dsp layer. The 'dsp layer' relies heavly on Intel FLEXRan library for carrying low-level vectrorized dsp operations. The test application runs in the Docker container.

 

2)

 a) Test machine OS version   

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019
 

  b) Test machine CPU

 CPU Skylake 24-core (Intel scalable 2nd gen) [the exact CPU information I do not posses because that server is not in working condition now]

 3) The application runs inside the Docker container and VTune using driverless collection mode is attached to specific process ID. The application is executed 100 times in order to generate the sufficient number of samples.

 An additional technical description of the anomaly was provided in my relevant forum post.

 For your interest -- I uploaded a 3 collection files: map.events, perfcmd and sep.events

Thank you for your help

 

 

 

0 Kudos
Highlighted
Black Belt
197 Views

I replied to your post and somehow my reply is awaiting a moderation?

Very strange

0 Kudos
Highlighted
Black Belt
186 Views

 

Hi Chithra,

My previous reply is moderated probably because of inclusion of .txt files.

1) Our application is a simulation of 5G L1 uplink layer PUCCH module. The application runs inside a docker container and contains mainly a data, control plane code and a dsp code. The DSP layer relies mainly on Intel own FLEXRan library.

2) The platform is a Intel scalable CPU 2nd gen based server (24-core CPU [probably Platinum 8268]) (currently I do not have an access to machine). The VTune version (installed locally) vtune_profiler_2020.0.0.605129

3) The type of collector is a perf driverless collector. We run our simulation in Docker container and VTune is attached to specific PID. The analysis type is "runsa" with close to 90 performance events being collected.We use VTune GUI to create the microarchitectural session and run it. Currently I attached a 3 relevant files: map.events,perfcmd and sep.events. If you are intersted I can provide a data.perf file.

Thank you

Bernard

 

 

0 Kudos
Highlighted
Moderator
171 Views

Hi Bernad,


Thanks for details shared. We are forwarding your case to Subject Matter Expert. They will get back to you soon on this.


Regards,

Chithra



0 Kudos
Highlighted
Black Belt
168 Views

Hi Chithra,

Thank you for forwarding this issue to VTune SME.

The case of this anomalous assignment is quiet perplexing.

 

--Bernard

0 Kudos
Highlighted
Black Belt
124 Views

Hi Chithra,

Do you have any updates in regards to this issue?

Thank you

--Bernard

0 Kudos