- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using Intel Vtune Amplifier XE 2013 to profile a parallel program running on a multicore CPU, in particular it is written in OpenCL and executed in Xeon Phi. I wonder how should be the exact interpretation of the results brought by Vtune, i.e.,
- Is it the value of the performance counter collected by a single thread or the whole core? (Assuming there are many cores in a CPU and many threads can be executed concurrently on a core, as in case of Xeon Phi).
- How does Vtune sample on a multicore CPU? Does it sample on a single core and report it, or sample on many cores and take the average?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for replying. I suppose you are meaning that for the second question, VTune spawns a worker on each processor and report the average results of all workers. However I don't understand the meaning of current and active logical CPU in the group that you mentioned. In normal sense i think that would be the processor where the host (the program is executed). In my case, an OpenCL will be executed on two CPUs: a the host program will be executed on the host processor and host operating system (linux, a Xeon based processor - this part of program is sequential, which is not of concern and is not profiled) and the main computation is executed on parallel on a Xeon Phi processor. The physical (and also logical) cores in Xeon Phi are treated equally (I might be wrong at this point) so I don't think "active processor" is the right term in this sense.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that the tool will collect performance data (samples) on each logical core, it means that contents in all samples includes: instructions address, pid, tid, mid, and cpu id. The user can see average data in Summary report, the user also see any type report such as "thread\function\callstack", "process\function\callstack" in Bottom-up report by changing "Grouping" item - because all data is ready. For event -based sampling result, the user can see result categorized by "Pack\cpu\process\..." in report.
If your sub-task working on Xeon Phi(TM) coprocessor, you may use "-collect knc-lightweight-hotspots" to collect performance from coprocessor. Note that other task working on Host will not be profiled.
If you use regular "-collect lightweight-hotspots" then all performance is from Host, not from coprocessor. Yes, you have to do them in separate two sessions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Peter,
Your information is very usefull because I don't use a GUI since I have to work remotely. I assume the command line tool and the GUI tool are consistent, meaning they basically produce results in the same way, thus I understand that the result of command line VTune (which does not include pid, tid and cpu id) is an averaged number (across all the logical cores).
I use "-collect knc-lightweight-hotspots" to collect performance on coprocessor, of course. Tasks can be filtered by function name. This is very nice compared to other tool, but it implies a higher overhead compared to tools that monitor the logical processor (not aware at function level). Please collect me if i'm wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> I suppose you are meaning that for the second question>>>
Yes, I tried to answer the second question.
>>>VTune spawns a worker on each processor and report the average results of all workers. However I don't understand the meaning of current and active logical CPU in the group that you mentioned>>>
Unfortunately I do not know the exact implementation of the performance monitoring driver(s).I only suppose that the implementation described by me could be possible.
Current CPU will be the processor executing machine code of the KeGetCurrentProcessorEx() function.Logical CPU's could be HT cores which reside on physical core.
Regarding the implementation worker thread after checking the logical cores can start accessing various performance counters by reading/writing to various MSR registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>In my case, an OpenCL will be executed on two CPUs: a the host program will be executed on the host processor and host operating system (linux, a Xeon based processor - this part of program is sequential, which is not of concern and is not profiled) and the main computation is executed on parallel on a Xeon Phi >>>
Unfortunately I do not know how it is done on Xeon Phi coprocessor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your clarification. That sounds right to me.
Although tried hard to search about VTune but I'm afraid that I missed some important details somewhere. Please let me know if you know where I can read how it is implemented.
Thank.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"... but it implies a higher overhead compared to tools that monitor the logical processor (not aware at function level). Please collect me if i'm wrong." - VTune(TM) Amplifier periodically collect performance data via device driver (interruption), all raw data will be collected but analysis will be postponed when application exits. Overhead is very light. You can check to run app with or without VTune then compare results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Although tried hard to search about VTune but I'm afraid that I missed some important details somewhere. Please let me know if you know where I can read how it is implemented>>>
You will not be able to find any information related to the exact internal implementation(I mean at code level) of VTune software.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>Although tried hard to search about VTune but I'm afraid that I missed some important details somewhere. Please let me know if you know where I can read how it is implemented>>>
You will not be able to find any information related to the exact internal implementation(I mean at code level) of VTune software.
If you are really interested in performance minitoring software you can download PMC and study its source code.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page