Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
5261 Discussions

Intel VTune: anomalous assignment of the event UOPS_EXECUTED.X87

Bernard
Valued Contributor I
5,077 Views

Hello,

During the analysis of 5G model which is partly based on Intel FlexRAN library I observed a strange case of the event UOPS_EXECUTED.X87  counting and assignment of its  value to single function address range which does not contain any x87 instructions.

The VTune worked in driverless mode relying on perf_events interface. The specific event count was accumulated by both user mode and kernel mode code triggers ('uk' -- option), the period of event was a default and set probably by VTune. The size of collected callstack was small and set to 1024 bytes. The callstack resolution was not reconstructed mostly and probably because a RBP register usage as general purpose register thorough the libraries code. There was only a single test run with 90 microarchitectural events multiplexed.

Here is the relevant screenshot

iliyapolak_0-1596109428399.png

The aforementioned function matrix_inv_cholesky_16_16  is a part of FlexRAN library and is implemented by means of manual vectorization with AVX512 intrinsics. The ICC generated implementation was thoroughly analyzed line by line and did not contain any x87 instructions, beside that no function calls to Intel runtime libraries or functions was present. The matrix inversion function contained its own complex arithmetics without relying on std::complex class, hence ICC did not insert x87 based implementation of complex division.

VTune was not able to reconstruct the callstack and as it seen in picture above the value of 254,915,064 x87 uops was assigned to matrix_inv_cholesky_16_16 function.

I strongly suspect a two potential reasons:

-- The aforementioned x87 uops were issued by the other functions in call chain and VTune not being able to rebuild the callstack assigned that count to top stack function (a matrix inversion function)

-- The setting of UOPS_EXECUTED.X87 enabled accumulation of kernel and user mode events and during the frequent syscall communication to kernel part of perf interface  the x87 FXSAVE and FXRSTR instruction or rather their uop decomposition were collected by the event counter.  The aforementioned function was the largest hotspot and probably during its execution many perf samples were issued, thus accumulating the x87 uops.

Hope, that someone may provide some help in this case.

 Thank you

 

 

 

Labels (1)
0 Kudos
15 Replies
ChithraJ_Intel
Moderator
5,033 Views

Hi,


Thanks for reaching out to us.Could you please share the below details with us so that we can analyze the issue in detail.


1) Give more information about your application.

2) More details about the platform and vtune version you are using.

3) The commands/steps used to profile the application.


Regards,

Chithra


0 Kudos
Bernard
Valued Contributor I
5,029 Views

Hello Chithra,

Thank you very much for replying to my post.

Tomorrow I will provide more details in regards to your inquiries.

 

Thank you

Bernard

 

0 Kudos
Bernard
Valued Contributor I
4,998 Views

 

Hi Chithra,

Please find the answers below

1) We test the performance of  L1-uplink module (i.e. Pucch). This module contains mainly a control layer  and dsp layer. The 'dsp layer' relies heavly on Intel FLEXRan library for carrying low-level vectrorized dsp operations. The test application runs in the Docker container.

 

2)

 a) Test machine OS version   

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019
 

  b) Test machine CPU

 CPU Skylake 24-core (Intel scalable 2nd gen) [the exact CPU information I do not posses because that server is not in working condition now]

 3) The application runs inside the Docker container and VTune using driverless collection mode is attached to specific process ID. The application is executed 100 times in order to generate the sufficient number of samples.

 An additional technical description of the anomaly was provided in my relevant forum post.

 For your interest -- I uploaded a 3 collection files: map.events, perfcmd and sep.events

Thank you for your help

 

 

 

0 Kudos
Bernard
Valued Contributor I
5,016 Views

I replied to your post and somehow my reply is awaiting a moderation?

Very strange

0 Kudos
Bernard
Valued Contributor I
5,005 Views

 

Hi Chithra,

My previous reply is moderated probably because of inclusion of .txt files.

1) Our application is a simulation of 5G L1 uplink layer PUCCH module. The application runs inside a docker container and contains mainly a data, control plane code and a dsp code. The DSP layer relies mainly on Intel own FLEXRan library.

2) The platform is a Intel scalable CPU 2nd gen based server (24-core CPU [probably Platinum 8268]) (currently I do not have an access to machine). The VTune version (installed locally) vtune_profiler_2020.0.0.605129

3) The type of collector is a perf driverless collector. We run our simulation in Docker container and VTune is attached to specific PID. The analysis type is "runsa" with close to 90 performance events being collected.We use VTune GUI to create the microarchitectural session and run it. Currently I attached a 3 relevant files: map.events,perfcmd and sep.events. If you are intersted I can provide a data.perf file.

Thank you

Bernard

 

 

0 Kudos
ChithraJ_Intel
Moderator
4,990 Views

Hi Bernad,


Thanks for details shared. We are forwarding your case to Subject Matter Expert. They will get back to you soon on this.


Regards,

Chithra



0 Kudos
Bernard
Valued Contributor I
4,987 Views

Hi Chithra,

Thank you for forwarding this issue to VTune SME.

The case of this anomalous assignment is quiet perplexing.

 

--Bernard

0 Kudos
Bernard
Valued Contributor I
4,943 Views

Hi Chithra,

Do you have any updates in regards to this issue?

Thank you

--Bernard

0 Kudos
DMITRY_T_Intel
Employee
4,442 Views

Hi Bernard,

Could you please tell me: Is this issue still relevant? Would you like to restart investigation?

Thank you very much!


0 Kudos
Bernard
Valued Contributor I
4,343 Views

Hello Dmitry,

Although more than 10 months has passed since I asked for help -- this issue is still relevant.

Yes I would like to restart the investigation.

 

Best regards

Bernard

       

0 Kudos
Dmitry_R_Intel1
Employee
4,340 Views

Hi Bernard,

Could you please attach a screenshot of VTune asm view for the matrix_inv_... function  so that we could see on what instructions the UOPS_EXECUTED.X87 samples fall? 

If you prefer to send the data privately - feel free to contact me by mail: dmitry.ryabtsev@intel.com

0 Kudos
Bernard
Valued Contributor I
4,331 Views

Hi Dmitry,

 

Unfortunately I was not able to find the old version of our library containing

the aforementioned matrix_inv_cholesky_16_16 function.

As I scarcely remember the x87 samples were mapped to non-x87 instructions.

Recently I have experienced similar anomalous behaviour of the VTune mapping where

samples of the AVX512 arithmetic vector packed single precision instructions were

mapped to seemingly random x86 scalar integer instructions. I suppose, that VTune lacks the heuristic to deal

with this kind of issues and profiler simply maps the sample IP (RIP value) to resolved binary address space.

 

If you want then I can privately share the screenshots of that new anomaly i.e. AVX512 vector packed float instructions mappings.

 

Best regards

 

Bernard

 

 

 

0 Kudos
Bernard
Valued Contributor I
4,313 Views

@Dmitry_R_Intel1 

 

I have contacted you by sending a private message.

 

Thank you

Bernard

0 Kudos
Bernard
Valued Contributor I
4,304 Views

@DMITRY_T_Intel 

 

This issue can be closed.

It is very sad, that Subject Matter Expert help arrived after 10 months of my waiting.

0 Kudos
DMITRY_T_Intel
Employee
4,298 Views

Hi Bernard,

I'm so sorry that we where unable to handle your question in a timely manner. We definitely should improve the support process. I will discuss this with management. I'm closing this case for now.

Thank you!


0 Kudos
Reply