According to the tutorials bandwidth analysis can be performed in 2 ways. 1.knc-cutom analysis (from core) 2. knc-bandwidth (just uncore)
I would like to do it the first way using the formula and I have certain doubts for the same.
|Read bandwidth (bytes/clock)||(L2_DATA_READ_MISS_MEM_FILL + L2_DATA_WRITE_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED|
|Write bandwidth (bytes/clock)||(L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED|
I run my multi-threaded application from a script which does some environment setting before calling the application.
amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED:sa=10000, (other events with their sampling frequency) -- ssh mic0 "./script.sh"
Q1: Is this the statistics of the script or of all the process running while collecting statistics? How can I determine these statistics of the application which my script started?
I get the event summery like this
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
----------------------------- ------------------------- -------------------------------- -----------------
HWP_L2MISS 91000 13 1000
CPU_CLK_UNHALTED 49840000 712 10000
L2_DATA_READ_MISS_CACHE_FILL 336000 48 1000
L2_DATA_READ_MISS_MEM_FILL 714000 102 1000
L2_DATA_WRITE_MISS_CACHE_FILL 0 0 1000
L2_DATA_WRITE_MISS_MEM_FILL 0 0 1000
L2_VICTIM_REQ_WITH_DATA 0 0 1000
SNP_HITM_L2 0 0 1000
then I see my result using
amplxe-cl -report hw-events -format=csv -csv-delimiter=comma -report-output=output.csv -show-as=sample -r /home//bandwidth2/ -call-stack-mode=user-only -cumulative-threshold-percent=loop -group-by=process
Q2: Again is this giving the statistics of the script? As they are very different from the summary. How can i get the statistics of just the application am interested in which was spawned by the script?
Q3: As I have collected the samples with certain number of events (sa:1000) when i calculate the bandwidth I should multiply the Hardware Event Sample Count:L2_DATA_WRITE_MISS_MEM_FILL:Self with the "sa" value to get the correct bandwidth value?
Q4: While collecting there is a parameter "cpu-mask". If i set it to 0 does it mean it will monitor the hw-events only in core 0? if i set it to "all" then it monitors all 240 cores? If so, wont my statistics be wrong with the information form applications other than my multi-threaded application? I would like to know how to use this parameter.
I do not know how it is implemented on Xeon Phi,but on Intel CPUs performance counters are not pinned to specific OS thread although they can be set to track user or kernel mode activity.For example when you are measuring performance of some application(process) then OS scheduler decides to swap out your thread which is currently monitored and the other unrelated thread is scheduled to run on the same core so in such a situation performance counter will record events generated by different process.You can set affinity to specific core and run set your thread's priority to very high in order to prevent swapping out.Still it is related to Windows and to general purpose CPU.It is the job of VTune to resolve addresses(tracking IP ) of currently executing thread.