Community
cancel
Showing results for 
Search instead for 
Did you mean: 
chou_m_
Beginner
54 Views

CPU_CLK_UNHALTED.THREAD事件中的CLOCK是时钟周期还是机器周期呢?@Peter Wang

我是一个新手。在收集内存访问事件时有一些疑惑,现有如下几个问题想请假大家。

1.对于CPU_CLK_UNHALTED.THREAD事件中的CLOCK指的是时钟周期还是机器周期?

2.如果CPU_CLK_UNHALTED.THREAD指的是机器周期数的话,而且从教科书上知道执行一条指令所花费的机器周期数至少应该是1,那通过CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY计算CPI时,结果应该大于等于1啊,但是从我的测试结果来看是小于1的,这是为什么呢?

3.MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4,这个事件描述是载入数据超过4个什么(时钟周期还是机器周期)的延时?类似的量纲,我该到哪儿去查呢?

4.在收集访存事件时,最后的统计结果出现了Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS这一个数据项,它的单位是什么呢?我的测试结果是182144273216,如果是纳秒的话,那应该是182秒多一点,但我的整个程序只运行了不到一分钟。我想知道这是为什么?

谢谢!

 

0 Kudos
10 Replies
Peter_W_Intel
Employee
54 Views

1. CPU Clocktick, it means there are N CPU cycles per second 

2. Consider there are pipelines in architecture to prefetch / decode / execute uops / retired simultaneously, so each instruction will be little than 1, if you don't run SSE/AVX instructions.

3. MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4, loads with latency value 4 ; for example, if L1 cache misses, load has 4-6 cycle delay. See detail info from VTune(TM) Amplifier's helper.

4. Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS; accumulate latency (cycles) from all memory loads in your program

chou_m_
Beginner
54 Views

 

@Peter Wang。王先生,您好。麻烦想再向您确认一下。

1. CPU Clocktick, it means there are N CPU cycles per second 

2. Consider there are pipelines in architecture to prefetch / decode / execute uops / retired simultaneously, so each instruction will be little than 1, if you don't run SSE/AVX instructions.

3. MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4, loads with latency value 4 ; for example, if L1 cache misses, load has 4-6 cycle delay. See detail info from VTune(TM) Amplifier's helper.

4. Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS; accumulate latency (cycles) from all memory loads in your program

对于您在3、4回答的cycle是不同于1中cycle,3、4中的cycle是指的是时钟周期对吗?是通过1/(2.4G)来计算的吧?(我的CPU是Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz)。那对于4中,我的测试结果是182144273216,换成绝对时间是 182144273216/(2.4*10^9) = 75s,但是我的程序只运行了不到一分钟,访存延时应该不会超过1分钟吧?问题出在哪儿呢?

谢谢您的回答!

 
chou_m_
Beginner
54 Views

Peter Wang wrote:

1. CPU Clocktick, it means there are N CPU cycles per second 

2. Consider there are pipelines in architecture to prefetch / decode / execute uops / retired simultaneously, so each instruction will be little than 1, if you don't run SSE/AVX instructions.

3. MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4, loads with latency value 4 ; for example, if L1 cache misses, load has 4-6 cycle delay. See detail info from VTune(TM) Amplifier's helper.

4. Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS; accumulate latency (cycles) from all memory loads in your program

王先生,您好。麻烦想再向您确认一下。

对于您在3、4回答的cycle是不同于1中cycle,3、4中的cycle是指的是时钟周期对吗?是通过1/(2.4G)来计算的吧?(我的CPU是Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz)。那对于4中,我的测试结果是182144273216,换成绝对时间是 182144273216/(2.4*10^9) = 75s,但是我的程序只运行了不到一分钟,访存延时应该不会超过1分钟吧?问题出在哪儿呢?

谢谢您的回答!

Peter_W_Intel
Employee
54 Views

@chu m

Total latency was accumulated from multiple cores, please consider *parallelism* - don't compare with elapsed time.
 

chou_m_
Beginner
54 Views

Peter Wang wrote:

@chu m

Total latency was accumulated from multiple cores, please consider *parallelism* - don't compare with elapsed time.
 

王先生,您好。我的程序是单进程,没有使用到多核啊,怎么会这样?

Peter_W_Intel
Employee
54 Views

Please attach your zipped vtune's result. Also you can submit a ticket to https://premier.intel.com with data.

chou_m_
Beginner
54 Views

Peter Wang wrote:

Please attach your zipped vtune's result. Also you can submit a ticket to https://premier.intel.com with data.

王先生,您好。我将vtune的测试结果上传上来了,劳烦您帮忙看看。您后边给的网页打不开啊。:(

我的执行命令是:amplxe-cl -c memory-access -knob analyze-mem-objects=true -data-limit=0 -d 60 -- ./test_demo

501869

 

Peter_W_Intel
Employee
54 Views

Thanks for your result file. See summary report:

CPU_CLK_UNHALTED.THREAD  38,904,000,000

MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4  403,612,108

Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 4,158,524,752

That was why Average Load latency =  Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 / MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 = 10

Your CPU frequency is 2.4Ghz, CPU time (serial code, mostly) = CPU_CLK_UHHALED.THREAD counts / 2.4G = 16.2s

Time spent in Load = Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT / 2.4G = 1.84s

Load Memory bound = Load time / CPU time = (1.84 / 16.2) * 100% = 11.4%

 

 

chou_m_
Beginner
54 Views

Peter Wang wrote:

Thanks for your result file. See summary report:

CPU_CLK_UNHALTED.THREAD  38,904,000,000

MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4  403,612,108

Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 4,158,524,752

That was why Average Load latency =  Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 / MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 = 10

Your CPU frequency is 2.4Ghz, CPU time (serial code, mostly) = CPU_CLK_UHHALED.THREAD counts / 2.4G = 16.2s

Time spent in Load = Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT / 2.4G = 1.84s

Load Memory bound = Load time / CPU time = (1.84 / 16.2) * 100% = 11.4%

 

 

王先生,您好。感谢您能耐心的回答我的问题。

但我还是有一些疑问想请教您,:)。

1.我注意到,您在计算Time spend in Load时,用的是Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT _44,158,524,752),我想知道为什么不使用Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS (429,400,644,100)?

2.上述两个参数之间有100倍的差距,这说明什么问题呢?

3.Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4是不是专门指的是从主存Load的时间,而Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS是不是指的是全部的Load时间(包括访问1、2、3级缓存及主存)?

4.就从上边的测试结果来看,有没有可能通过提高1、2、3缓存的命中率来大幅改善程序的性能?

问题有点多,:),谢谢您的回答。

 

Peter_W_Intel
Employee
54 Views

You need to use (trust) the formula that VTune uses, see "Memory Usage viewpoint" in bottom-up report, metric named "Average Latency (cycles)", move mouse to that item, says "..This metric shows average load of latency in cycles...Formula: Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 /  MEM_TRANS_RETIRED.LOAD_LATENCY_GT "