Vtune Amplifier XE for Multicores, how it works?

thanhtuan123 · ‎03-23-2013

I'm using Intel Vtune Amplifier XE 2013 to profile a parallel program running on a multicore CPU, in particular it is written in OpenCL and executed in Xeon Phi. I wonder how should be the exact interpretation of the results brought by Vtune, i.e.,

Is it the value of the performance counter collected by a single thread or the whole core? (Assuming there are many cores in a CPU and many threads can be executed concurrently on a core, as in case of Xeon Phi).
How does Vtune sample on a multicore CPU? Does it sample on a single core and report it, or sample on many cores and take the average?

Bernard · ‎03-24-2013

IIRC HT logical core does not posses any MSR registers/counters so I suppose that performance sampling is done at the physical core level by kernel mode driver(on Windows platform).I suppose that VTune can spawn its own worker thread to run on the current cpu. It gets current logical cpu number by calling for example KeGetCurrentProcessorEx() function.On multiprocessor system driver can use for example this routine KeQueryGroupAffinity to obtain the active logical cpus in the group.

thanhtuan123 · ‎03-24-2013

Thanks for replying. I suppose you are meaning that for the second question, VTune spawns a worker on each processor and report the average results of all workers. However I don't understand the meaning of current and active logical CPU in the group that you mentioned. In normal sense i think that would be the processor where the host (the program is executed). In my case, an OpenCL will be executed on two CPUs: a the host program will be executed on the host processor and host operating system (linux, a Xeon based processor - this part of program is sequential, which is not of concern and is not profiled) and the main computation is executed on parallel on a Xeon Phi processor. The physical (and also logical) cores in Xeon Phi are treated equally (I might be wrong at this point) so I don't think "active processor" is the right term in this sense.

Peter_W_Intel · ‎03-24-2013

I think that the tool will collect performance data (samples) on each logical core, it means that contents in all samples includes: instructions address, pid, tid, mid, and cpu id. The user can see average data in Summary report, the user also see any type report such as "thread\function\callstack", "process\function\callstack" in Bottom-up report by changing "Grouping" item - because all data is ready. For event -based sampling result, the user can see result categorized by "Pack\cpu\process\..." in report.

If your sub-task working on Xeon Phi(TM) coprocessor, you may use "-collect knc-lightweight-hotspots" to collect performance from coprocessor. Note that other task working on Host will not be profiled.

If you use regular "-collect lightweight-hotspots" then all performance is from Host, not from coprocessor. Yes, you have to do them in separate two sessions.

thanhtuan123 · ‎03-24-2013

Thanks Peter,

Your information is very usefull because I don't use a GUI since I have to work remotely. I assume the command line tool and the GUI tool are consistent, meaning they basically produce results in the same way, thus I understand that the result of command line VTune (which does not include pid, tid and cpu id) is an averaged number (across all the logical cores).

I use "-collect knc-lightweight-hotspots" to collect performance on coprocessor, of course. Tasks can be filtered by function name. This is very nice compared to other tool, but it implies a higher overhead compared to tools that monitor the logical processor (not aware at function level). Please collect me if i'm wrong.

Bernard · ‎03-24-2013

>>> I suppose you are meaning that for the second question>>>

Yes, I tried to answer the second question.

>>>VTune spawns a worker on each processor and report the average results of all workers. However I don't understand the meaning of current and active logical CPU in the group that you mentioned>>>

Unfortunately I do not know the exact implementation of the performance monitoring driver(s).I only suppose that the implementation described by me could be possible.

Current CPU will be the processor executing machine code of the KeGetCurrentProcessorEx() function.Logical CPU's could be HT cores which reside on physical core.

Regarding the implementation worker thread after checking the logical cores can start accessing various performance counters by reading/writing to various MSR registers.

Bernard · ‎03-24-2013

>>>In my case, an OpenCL will be executed on two CPUs: a the host program will be executed on the host processor and host operating system (linux, a Xeon based processor - this part of program is sequential, which is not of concern and is not profiled) and the main computation is executed on parallel on a Xeon Phi >>>

Unfortunately I do not know how it is done on Xeon Phi coprocessor

thanhtuan123 · ‎03-24-2013

Thank you for your clarification. That sounds right to me.

Although tried hard to search about VTune but I'm afraid that I missed some important details somewhere. Please let me know if you know where I can read how it is implemented.

Thank.

Peter_W_Intel · ‎03-24-2013

"... but it implies a higher overhead compared to tools that monitor the logical processor (not aware at function level). Please collect me if i'm wrong." - VTune(TM) Amplifier periodically collect performance data via device driver (interruption), all raw data will be collected but analysis will be postponed when application exits. Overhead is very light. You can check to run app with or without VTune then compare results.

Bernard · ‎03-25-2013

>>>Although tried hard to search about VTune but I'm afraid that I missed some important details somewhere. Please let me know if you know where I can read how it is implemented>>>

You will not be able to find any information related to the exact internal implementation(I mean at code level) of VTune software.

Bernard · ‎03-25-2013

iliyapolak wrote:

>>>Although tried hard to search about VTune but I'm afraid that I missed some important details somewhere. Please let me know if you know where I can read how it is implemented>>>

You will not be able to find any information related to the exact internal implementation(I mean at code level) of VTune software.

If you are really interested in performance minitoring software you can download PMC and study its source code.