- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm starting to learn how to analyze OMP projects, and I've found that AH in Vtune gives me CPI, and I'm trying to better understand what is displayed.
For arguments sake, if I've set # threads to 4 on a 4 core machine and the CPI displayed is 1, is this 1 for each core or for the entire machine? As in if the machine advanced X cycles, then has each core done X instructions or X/4?
Thanks,
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The CPI metric is calculated per item in the result grid. You can observe it for example, per function for each thread or CPU Core just by selecting appropriate grouping on the top of the grid in the Bottom-Up view. In case you have a Functin/Callstack grouping (default), then CPI is calculated per function on all CPUs.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The CPI metric is calculated per item in the result grid. You can observe it for example, per function for each thread or CPU Core just by selecting appropriate grouping on the top of the grid in the Bottom-Up view. In case you have a Functin/Callstack grouping (default), then CPI is calculated per function on all CPUs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Arik,
On summary CPI is counted by the whole workload so all the cycles consumed by your application (on any core) divided by number of the application instructions (on any core). By CPI you will not be able to define how much instructions were executed per core. But if in your example we assume that cores did equal work with the same efficiency then you will have X/4 per core if cumulative number of clockticks is X and CPI=1 as far as I understand.
Thanks & Regards, Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok, thank you both!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arik,
By the way: if you use Intel OpenMP I would highly recommend to try HPC Performance Characterization analysis to look at OpenMP usage efficiency metrics like serial time vs parallel time, imbalance, different kind of overhead etc.
Thanks & Regards, Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Dmitry
HPC analysis is part of VTune or a separate tool?
Also I'm still trying to work out how to use the Intel omp.h and not the Microsoft omp.h, as the program is being compiled in VS2015
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arik,
HPC Performance Characterization analysis is a part of VTune since VTune Amplifier XE 2016 Update 3 and also it is available in VTune Amplifier 2017 Beta and Beta Update 1.
In command line you will need to point something like this:
>amplxe-cl -collect hpc-performance -data-limit=0 -r <my_result_dir> <my_app>
In GUI the analysis is available in the analysis tree as "HPC Performance Characterization".
Thanks & Regards, Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think I'm not using Update 3 because that doesn't exist for me. I am using the 2016 edition though. I'll try and update.
EDIT: I am using Update 3. My build: Update 3 (build 464096)
So any other reason that option doesn't exist?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The analysis should work with 2016 U3. Let us know if you encounter with any problems.
Thanks & Regards, Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the speedy reply.
Tried running in command line and got :
amplxe: Fatal error: Cannot find analysis type. Check input parameters or reinstall the product. Available analisis types:
hotspots
advanced-hotspots
concurrency
locksandwaits
general-exploration
bandwidth
memory-access
tsx-exploration
tsx-hotspots
sgx-hotspots
cpugpu-concurrency
system-overview
gpu-hotspots
disk-io
My cmd:
"C:\Program Files (x86)\IntelSWTools\VTune Amplifier 2016 for Systems\bin64\amplxe-cl" -collect hpc-performance -data-limit=0 -r c:\work\arinberg -- C:\work\arinberg\Kernels\patternMatching\PatternMatching\Debug\PatternMatching.exe C:\work\arinberg\Kernels\patternMatching\Group6 C:\work\arinberg\Kernels\patternMatching\wk1.tcpdump 8 128 1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, I see the point of confusion. There are two VTunes - one "for Systems" and the second is "XE". The HPC Performance Characterization was enabled in VTune Amplifier XE. And 464096 is U3 for Systems if I'm not mistaken.
I know that HPC Performance Characterization will be added to VTune Amplifier for Systems in 2017 Gold only.
Anyway - you can find also the metrics on OpenMP efficiency (works for Intel OpenMP) in Advanced Hotspots as a special section on summary and on bottom up pane grid if you choose /OpenMP Regions/.. grouping
You can also read the following topic on this: https://software.intel.com/en-us/node/544172
Thanks & Regards, Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I got XE now. The HPC is now available. Thanks!
Would there be a problem with OMP analysis if I'm compiling in VS 2015 using Microsoft compiler?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Microsoft compiler doesn't insert specific identification of OpenMP regions as ICL does. You could run against the Intel libiomp5 in place of the Microsoft OpenMP library. You may need to do this in order to set affinity for repeatable results in VTune.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Arik,
On summary CPI is counted by the whole workload so all the cycles consumed by your application (on any core) divided by number of the application instructions (on any core). By CPI you will not be able to define how much instructions were executed per core. But if in your example we assume that cores did equal work with the same efficiency then you will have X/4 per core if cumulative number of clockticks is X and CPI=1 as far as I understand.
Thanks & Regards, Dmitry

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page