Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

vectorization usage 0?

David_M_17
Novice
749 Views

 

I collected VTune general exploration on Intel Xeon Phi - KNC.   I expected to see data in the column for vectorization usage.  Instead this column is zero.   My program includes a call to mkl - dgemm with aligned matrices - so I am sure vector instructions are called.  How come VTune shows zero for vectorization usage - I understood KNX is prone to over count vector instructions, not undercount?  I tried to attach a screenshot, but the 653KB files seems too large for this web interface.   If you give me an email address I can email it to you. 

 

 

0 Kudos
1 Solution
Dmitry_P_Intel1
Employee
749 Views

Hello David,

Please try to add:  

-knob enable-vpu-metrics=true

to general exploration CL that you use so you will have something like:

amplxe-cl -collect general-exploration -target-system=mic-native:mic0 -knob enable-vpu-metrics=true -- /home/drm/runme

They are switched off by default since on KNC we can collect only 2 events simultaneously. If we have more - we need to multiplex and this can hurt statistical representativeness of the results in some cases. So we chose CPI and general cache usage to collect by default.

Also please note that in VPU_INSTRUTIONS_EXECUTED and VPU_ELEMENTS_ACTIVE events that we use to calculate vectororization intensity metric contain not only instructions that perform floating point operations but also instructions that load vector registers from memory and store them to memory etc so you can estimate some "upper bound" of vectorization efficiency - if it is low - then you can say that you have inefficiency, if it is good - it might not mean that it is really good.

Since VTune 2017 Beta (and in upcoming 2016 U3) we added ability to configure and get command line for "Arbitrary targets" - the targets that you don't have direct connection to when you configure in GUI. In this case you cannot launch collection from GUI but you can generate command line to copy-paste to target and run it there.

Thanks & Regards, Dmitry

 

View solution in original post

0 Kudos
4 Replies
Dmitry_P_Intel1
Employee
749 Views

Hello David,

Could you please provide details what analysis type do you use?

Thanks & Regards, Dmitry

 

0 Kudos
David_M_17
Novice
749 Views

amplxe-cl -collect general-exploration -target-system=mic-native:mic0 -- /home/drm/runme

runme is a script that contains:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/drm/.

./t1.exe

 

t1.exe is the binary to examine.  All the functions for t1.exe appear as well as cache events.   vectorization usage remains zero for all functions though.

I have to use command line and then view from remote system.   The remote system allows gui, but doesn't have a xeon phi card so I am unable to use the gui to create the command line to copy.

-David

 

0 Kudos
David_M_17
Novice
749 Views

Well, I needed to move forward.  I just setup custom analysis types for the associated events then exported to cvs to create my own formula to calculate the vectorization usage.  It seems the events for vectorization usage are not collected with general exploration for Xeon Phi.   I just find it odd that the general exploration events display has a column for vectorization usage if ge doesn't collect the events to populate the column.   The presence of the column implies something that isn't there.  The column should either be removed from ge or the events should be collected.  cheers.

0 Kudos
Dmitry_P_Intel1
Employee
750 Views

Hello David,

Please try to add:  

-knob enable-vpu-metrics=true

to general exploration CL that you use so you will have something like:

amplxe-cl -collect general-exploration -target-system=mic-native:mic0 -knob enable-vpu-metrics=true -- /home/drm/runme

They are switched off by default since on KNC we can collect only 2 events simultaneously. If we have more - we need to multiplex and this can hurt statistical representativeness of the results in some cases. So we chose CPI and general cache usage to collect by default.

Also please note that in VPU_INSTRUTIONS_EXECUTED and VPU_ELEMENTS_ACTIVE events that we use to calculate vectororization intensity metric contain not only instructions that perform floating point operations but also instructions that load vector registers from memory and store them to memory etc so you can estimate some "upper bound" of vectorization efficiency - if it is low - then you can say that you have inefficiency, if it is good - it might not mean that it is really good.

Since VTune 2017 Beta (and in upcoming 2016 U3) we added ability to configure and get command line for "Arbitrary targets" - the targets that you don't have direct connection to when you configure in GUI. In this case you cannot launch collection from GUI but you can generate command line to copy-paste to target and run it there.

Thanks & Regards, Dmitry

 

0 Kudos
Reply