limit profiling to a subset of cores

leggett__charles · ‎07-10-2019

Is it possible to limit the profiling / data collection to a subset of available cores? I'm profiling a multithreaded executable on a system with 2 CPUs, each with 10 physical cores. I run the job on just one of the cores, limiting it to 10 hardware threads with > taskset -c 0,2,4,6,8,10,12,14,16,18 theJob If I try to profile it with > amplxe-cl -collect threading taskset -c 0,2,4,6,8,10,12,14,16,18 theJob all cores get profiled, which skews the results. Is there a way to just profile just the cores the job is running on?

Vladimir_R_Intel · ‎07-11-2019

Hi Charles,

actually so as you use user-mode sampling collector it works from context of the target process (i.e. on the same cores as the process was pinned on). But so as amplxe-cl and amplxe-runss processes weren't pinned they can compete for unwanted cores.

you have to determine what exactly you want from this scheme, if you want that both amplxe-cl and theJob have the same CPU affinity mask then just replace taskset and amplxe-cl: taskset -c 0,2,4,6,8,10,12,14,16,18 amplxe-cl -collect threading theJob

also it will be useful to set -data-limit=0 so as the collection launcher will not create a thread to monitor result size.

If you really want to collect data only about some subset of available cores then you will need to switch to hardware-based event sampling and use -cpu-mask option. You can find details about this option here: https://software.intel.com/en-us/vtune-amplifier-help-cpu-mask . In this case you will get something like that:

amplxe-cl -collect threading -knob sampling-and-waits=hw -data-limit=0 -cpu-mask=0,2 taskset -c 0,2,4,6,8,10,12,14,16,18 theJob

BR,

Vladimir

leggett__charles · ‎07-14-2019

hi Vladimir - Thanks for the info! With both methods I still see information about the other 30 cores beyond the 10 I want to monitor. Perhaps I'm misunderstanding what the GUI is supposed to show (see attached image)? > amplxe-cl -collect threading -knob sampling-and-waits=hw -data-limit=0 -cpu-mask=0,2,4,6,8,10,12,14,16,18 taskset -c 0,2,4,6,8,10,12,14,16,18 theJob cheers, Charles.

Vladimir_R_Intel · ‎07-18-2019

Hi Charles,

could you please upload a result? it's very hard to analyze this by the picture.

BR,

Vladimir

leggett__charles · ‎07-19-2019

hi Vladimir -

see attached result.

cheers, Charles.

Vladimir_R_Intel · ‎07-22-2019

Hi Charles,

this seems on inaccuracy of sampling collection. Just imagine that you have 2 threads working on a single logic core, and one of threads was interrupted at a middle sampling point and execution was switched to another one. Then the new thread was executed during some time (10 ms , 2 samples has occurred in your case) and then it was preempted by the first one and a sample has happened on this thread too, its sampling interval is ~ 15 ms in this case with a small inaccuracy on a kernel work. Then you will have 100% CPU utilization for the 2nd thread during 10ms and ~ 33% utilization of the first thread and in result you see that more than 1 logic core was utilized during this time.

If you switch on Process/HW Context/Function/Thread/Call Stack grouping (this is a custom grouping) you will see that actually no thread were executed on cores different from you set.

I can suggest to disable call stacks and to decrease sampling interval to get more accurate data, of course if you need it. Disabling Call Stacks here will improve your situation because you will be switched to another collector which doesn't catch context switches.

BR,

Vladimir