bandwidth analysis on xeon phi using Vtune

Surya_Narayanan_N_ · ‎12-03-2013

According to the tutorials bandwidth analysis can be performed in 2 ways. 1.knc-cutom analysis (from core) 2. knc-bandwidth (just uncore)

http://www.youtube.com/watch?v=vnOqpyzui_s

I would like to do it the first way using the formula and I have certain doubts for the same.

given formula:

Read bandwidth (bytes/clock) (L2_DATA_READ_MISS_MEM_FILL + L2_DATA_WRITE_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED Write bandwidth (bytes/clock) (L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED

I run my multi-threaded application from a script which does some environment setting before calling the application.

like

amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED:sa=10000, (other events with their sampling frequency) -- ssh mic0 "./script.sh"

Q1: Is this the statistics of the script or of all the process running while collecting statistics? How can I determine these statistics of the application which my script started?

I get the event summery like this

Event summary
-------------
Hardware Event Type       Hardware Event Count:Self      Hardware Event Sample Count:Self    Events Per Sample
----------------------------- -------------------------    --------------------------------                                              -----------------
HWP_L2MISS                   91000                      13                                1000
CPU_CLK_UNHALTED            49840000                   712                               10000
L2_DATA_READ_MISS_CACHE_FILL   336000                     48                                1000
L2_DATA_READ_MISS_MEM_FILL        714000                     102                               1000
L2_DATA_WRITE_MISS_CACHE_FILL         0                          0                                 1000
L2_DATA_WRITE_MISS_MEM_FILL              0                          0                                 1000
L2_VICTIM_REQ_WITH_DATA                       0                          0                                 1000
SNP_HITM_L2                     0                          0                                 1000

then I see my result using

amplxe-cl -report hw-events -format=csv -csv-delimiter=comma -report-output=output.csv -show-as=sample -r /home//bandwidth2/ -call-stack-mode=user-only -cumulative-threshold-percent=loop -group-by=process

Q2: Again is this giving the statistics of the script? As they are very different from the summary. How can i get the statistics of just the application am interested in which was spawned by the script?

Q3: As I have collected the samples with certain number of events (sa:1000) when i calculate the bandwidth I should multiply the Hardware Event Sample Count:L2_DATA_WRITE_MISS_MEM_FILL:Self with the "sa" value to get the correct bandwidth value?

Q4: While collecting there is a parameter "cpu-mask". If i set it to 0 does it mean it will monitor the hw-events only in core 0? if i set it to "all" then it monitors all 240 cores? If so, wont my statistics be wrong with the information form applications other than my multi-threaded application? I would like to know how to use this parameter.

Loc_N_Intel · ‎12-04-2013

Hi Surya,

I am forwarding your questions to an VTune expert, he/she will get back to you. Thank you.

Vladimir_T_Intel · ‎12-06-2013

1. Collecting hardware events is done with support by the Performance Monitoring Unit (PMU) which is in hardware and therefore you are analyzing system wide, not just your script.

2. Yes you can filter the result by your application / module - "-filter module=libexample.so" (amplxe-cl -help report)

3. Your result report shows both, "Hardware Event Count" as well as "Hardware Event Sample Count", so you don't need to do the math. However, there is a trick in there. There is a multiplier x7 as a number of events groups multiplexed.

4. If you monitor off-core events like for memory bandwidth, it will not make a difference as it is not related to a specific core. Therefore, reducing the number of cores to count events will also reduce the overhead. The flag is "-cpu-mask=x,y" (amplxe-cl -help collect)

Surya_Narayanan_N_ · ‎01-14-2014

Hello,

I was trying to compute the bandwidth using the formula by varying the number of threads. Am pretty sure there are some mistakes either in calculation or the methodology. I ran the following command to collect the bandwidth data with $2 being 2,4,8,16,32 in each run

~/amplxe-cl -collect-with runsa-knc -knob event config=CPU_CLK_UNHALTED:sa=2000000,HWP_L2MISS:sa=1000000,L2_DATA_READ_MISS_CACHE_FILL:sa=1000000,L2_DATA_READ_
MISS_MEM_FILL:sa=1000000,L2_DATA_WRITE_MISS_CACHE_FILL:sa=1000000,L2_DATA_WRITE_MISS_MEM_FILL:sa=1000000,L2_VICTIM_REQ_WITH_DATA:sa=1000000,SNP_HITM_L2:sa=1000000 -app-working-dir ~/threads/$1 -r ~/threads/$1/bw$2 -- ssh mic0 "~/script.sh $2"

I am reading the report using this command and filtering just the application process. (PS: am calling my application via the script)

amplxe-cl -report hw-events -format=csv -csv-delimiter=comma -report-output=~/threads/$1/out$2 -show-as=event -r ~/threads/$1/bw$2/ -group-by=process

which results in

Process,Hardware Event Count:HWP_L2MISS:Self,Hardware Event Count:CPU_CLK_UNHALTED:Self,Hardware Event Count:L2_DATA_READ_MISS_CACHE_FILL:Self,Hardware Event Count:L2_DATA_READ_MISS_MEM_FILL:Self,Hardware Event Count:L2_DATA_WRITE_MISS_CACHE_FILL:Self,Hardware Event Count:L2_DATA_WRITE_MISS_MEM_FILL:Self,Hardware Event Count:L2_VICTIM_REQ_WITH_DATA:Self,Hardware Event Count:SNP_HITM_L2:Self
bfs,7000000,8442000000,0,14000000,0,14000000,14000000,0
Pid 0x121C,0,28000000,0,0,0,0,0,0
coi_daemon,0,2086000000,0,0,0,0,0,0
sep_mic_server3.10,0,14000000,0,0,0,0,0,0
sshd,0,140000000,0,0,0,0,0,0
vmlinux,0,2926000000,0,0,0,0,42000000,0

here bfs is the application which was invoked by my script and am interested only in its bandwidth consumption, so ignoring others.

Now I compute the read and write bandwidth for different thread data which are collected as shown in

http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding

And the results looks very weird.

No.of.Threads Read BW Write BW Total BW(GB/sec)

2 0.265339966833 0.106135986733 0.390792703151

4 0.126149802891 0.126149802891 0.265419185283

8 0.037558685446 0.112676056338 0.158046948357

16 0.0252565114444 0.101026045777 0.132849250197

32 0.0104200586128 0.0104200586128 0.0219238033214

Ok, here the events are collected on different sampling frequency if you notice the command line. Should I normalize something in the formulae to correct it? or what is the right way of collecting? why is the bandwidth sooo low?

Surya_Narayanan_N_ · ‎01-16-2014

Another doubt I have, I would like to know the bandwidth computation using knc-bandwidth, the summary looks like this

CPU
---
Parameter bw_org_2
----------------- -----------------------------
Frequency 1052000000
Logical CPU Count 240
Name Intel(R) Xeon(R) E5 processor

Summary
-------
Elapsed Time: 2.984
CPU Usage: 2.893

Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------- ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED 9140000000 914 10000000

Uncore Event summary
--------------------
Hardware Event Type Hardware Event Count:Self
----------------------------- -------------------------
UNC_F_CH0_NORMAL_WRITE[UNIT0] 10103918
UNC_F_CH0_NORMAL_WRITE[UNIT1] 10109727
UNC_F_CH0_NORMAL_WRITE[UNIT2] 10095707
UNC_F_CH0_NORMAL_WRITE[UNIT3] 10102520
UNC_F_CH0_NORMAL_WRITE[UNIT4] 10095936
UNC_F_CH0_NORMAL_WRITE[UNIT5] 10100786
UNC_F_CH0_NORMAL_WRITE[UNIT6] 10109940
UNC_F_CH0_NORMAL_WRITE[UNIT7] 10100599
UNC_F_CH0_NORMAL_READ[UNIT0] 8574334
UNC_F_CH0_NORMAL_READ[UNIT1] 8588694
UNC_F_CH0_NORMAL_READ[UNIT2] 8562949
UNC_F_CH0_NORMAL_READ[UNIT3] 8611755
UNC_F_CH0_NORMAL_READ[UNIT4] 8566964
UNC_F_CH0_NORMAL_READ[UNIT5] 8573352
UNC_F_CH0_NORMAL_READ[UNIT6] 8590854
UNC_F_CH0_NORMAL_READ[UNIT7] 8589209
UNC_F_CH1_NORMAL_WRITE[UNIT0] 10100922
UNC_F_CH1_NORMAL_WRITE[UNIT1] 10101943
UNC_F_CH1_NORMAL_WRITE[UNIT2] 10105199
UNC_F_CH1_NORMAL_WRITE[UNIT3] 10100272
UNC_F_CH1_NORMAL_WRITE[UNIT4] 10106579
UNC_F_CH1_NORMAL_WRITE[UNIT5] 10123764
UNC_F_CH1_NORMAL_WRITE[UNIT6] 10115382
UNC_F_CH1_NORMAL_WRITE[UNIT7] 10100624
UNC_F_CH1_NORMAL_READ[UNIT0] 8576649
UNC_F_CH1_NORMAL_READ[UNIT1] 8566361
UNC_F_CH1_NORMAL_READ[UNIT2] 8592849
UNC_F_CH1_NORMAL_READ[UNIT3] 8591451
UNC_F_CH1_NORMAL_READ[UNIT4] 8577494
UNC_F_CH1_NORMAL_READ[UNIT5] 8624924
UNC_F_CH1_NORMAL_READ[UNIT6] 8615670
UNC_F_CH1_NORMAL_READ[UNIT7] 8585177
amplxe: Executing actions 100 % done

But when i load the result file in GUI I see

Average Bandwidth
Package	Bandwidth, GB/sec
package_0	6.414

How is this 6.414GB/Sec computed?

McCalpinJohn · ‎01-16-2014

The Xeon Phi has 8 memory controllers, each of which has two 32-bit GDDR5 channels. Each memory controller has multiple performance counters, and the results above appear to correspond to using four counters for each controller -- measuring writes on channel 0, reads on channel 0, writes on channel 1, and reads on channel 1, respectively.

I wrote a special driver to program the counters in exactly the same fashion and read the counts before and after running the STREAM benchmark. From these results it is clear that each increment of each counter corresponds to a cache line transfer. (The number of reads was about 8% higher than I expected and the number of writes was about 11% higher than I expected. My "expected" counts did not include memory references for page table lookups or memory references for ECC, and I did not repeat the experiments with multiple iteration counts to get a good estimate of the data initialization overhead, so this is as close as I expected to get.)

For the case above, simply add up the 32 counter values, multiply the sum by 64 Bytes, and divide by the 2.984 seconds elapsed time. This gives 6.414e9 Bytes per second, exactly matching the result from the GUI.

Surya_Narayanan_N_ · ‎01-19-2014

How to analyze the data bandwidth in the ring network of xeon-phi using the l2_data_read/write_miss_cache_fill? I hope this contributes to the coherency related traffic created by the threads running in different core.