- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
我是一个新手。在收集内存访问事件时有一些疑惑,现有如下几个问题想请假大家。
1.对于CPU_CLK_UNHALTED.THREAD事件中的CLOCK指的是时钟周期还是机器周期?
2.如果CPU_CLK_UNHALTED.THREAD指的是机器周期数的话,而且从教科书上知道执行一条指令所花费的机器周期数至少应该是1,那通过CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY计算CPI时,结果应该大于等于1啊,但是从我的测试结果来看是小于1的,这是为什么呢?
3.MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4,这个事件描述是载入数据超过4个什么(时钟周期还是机器周期)的延时?类似的量纲,我该到哪儿去查呢?
4.在收集访存事件时,最后的统计结果出现了Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS这一个数据项,它的单位是什么呢?我的测试结果是182144273216,如果是纳秒的话,那应该是182秒多一点,但我的整个程序只运行了不到一分钟。我想知道这是为什么?
谢谢!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. CPU Clocktick, it means there are N CPU cycles per second
2. Consider there are pipelines in architecture to prefetch / decode / execute uops / retired simultaneously, so each instruction will be little than 1, if you don't run SSE/AVX instructions.
3. MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4, loads with latency value 4 ; for example, if L1 cache misses, load has 4-6 cycle delay. See detail info from VTune(TM) Amplifier's helper.
4. Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS; accumulate latency (cycles) from all memory loads in your program
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Peter Wang。王先生,您好。麻烦想再向您确认一下。
1. CPU Clocktick, it means there are N CPU cycles per second
2. Consider there are pipelines in architecture to prefetch / decode / execute uops / retired simultaneously, so each instruction will be little than 1, if you don't run SSE/AVX instructions.
3. MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4, loads with latency value 4 ; for example, if L1 cache misses, load has 4-6 cycle delay. See detail info from VTune(TM) Amplifier's helper.
4. Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS; accumulate latency (cycles) from all memory loads in your program
对于您在3、4回答的cycle是不同于1中cycle,3、4中的cycle是指的是时钟周期对吗?是通过1/(2.4G)来计算的吧?(我的CPU是Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz)。那对于4中,我的测试结果是182144273216,换成绝对时间是 182144273216/(2.4*10^9) = 75s,但是我的程序只运行了不到一分钟,访存延时应该不会超过1分钟吧?问题出在哪儿呢?
谢谢您的回答!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter Wang wrote:
1. CPU Clocktick, it means there are N CPU cycles per second
2. Consider there are pipelines in architecture to prefetch / decode / execute uops / retired simultaneously, so each instruction will be little than 1, if you don't run SSE/AVX instructions.
3. MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4, loads with latency value 4 ; for example, if L1 cache misses, load has 4-6 cycle delay. See detail info from VTune(TM) Amplifier's helper.
4. Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS; accumulate latency (cycles) from all memory loads in your program
王先生,您好。麻烦想再向您确认一下。
对于您在3、4回答的cycle是不同于1中cycle,3、4中的cycle是指的是时钟周期对吗?是通过1/(2.4G)来计算的吧?(我的CPU是Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz)。那对于4中,我的测试结果是182144273216,换成绝对时间是 182144273216/(2.4*10^9) = 75s,但是我的程序只运行了不到一分钟,访存延时应该不会超过1分钟吧?问题出在哪儿呢?
谢谢您的回答!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@chu m
Total latency was accumulated from multiple cores, please consider *parallelism* - don't compare with elapsed time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter Wang wrote:
@chu m
Total latency was accumulated from multiple cores, please consider *parallelism* - don't compare with elapsed time.
王先生,您好。我的程序是单进程,没有使用到多核啊,怎么会这样?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please attach your zipped vtune's result. Also you can submit a ticket to https://premier.intel.com with data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter Wang wrote:
Please attach your zipped vtune's result. Also you can submit a ticket to https://premier.intel.com with data.
王先生,您好。我将vtune的测试结果上传上来了,劳烦您帮忙看看。您后边给的网页打不开啊。:(
我的执行命令是:amplxe-cl -c memory-access -knob analyze-mem-objects=true -data-limit=0 -d 60 -- ./test_demo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your result file. See summary report:
CPU_CLK_UNHALTED.THREAD 38,904,000,000
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 403,612,108
Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 4,158,524,752
That was why Average Load latency = Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 / MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 = 10
Your CPU frequency is 2.4Ghz, CPU time (serial code, mostly) = CPU_CLK_UHHALED.THREAD counts / 2.4G = 16.2s
Time spent in Load = Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT / 2.4G = 1.84s
Load Memory bound = Load time / CPU time = (1.84 / 16.2) * 100% = 11.4%
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter Wang wrote:
Thanks for your result file. See summary report:
CPU_CLK_UNHALTED.THREAD 38,904,000,000
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 403,612,108
Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 4,158,524,752
That was why Average Load latency = Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 / MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 = 10
Your CPU frequency is 2.4Ghz, CPU time (serial code, mostly) = CPU_CLK_UHHALED.THREAD counts / 2.4G = 16.2s
Time spent in Load = Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT / 2.4G = 1.84s
Load Memory bound = Load time / CPU time = (1.84 / 16.2) * 100% = 11.4%
王先生,您好。感谢您能耐心的回答我的问题。
但我还是有一些疑问想请教您,:)。
1.我注意到,您在计算Time spend in Load时,用的是Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT _4(4,158,524,752),我想知道为什么不使用Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS (429,400,644,100)?
2.上述两个参数之间有100倍的差距,这说明什么问题呢?
3.Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4是不是专门指的是从主存Load的时间,而Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS是不是指的是全部的Load时间(包括访问1、2、3级缓存及主存)?
4.就从上边的测试结果来看,有没有可能通过提高1、2、3缓存的命中率来大幅改善程序的性能?
问题有点多,:),谢谢您的回答。
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need to use (trust) the formula that VTune uses, see "Memory Usage viewpoint" in bottom-up report, metric named "Average Latency (cycles)", move mouse to that item, says "..This metric shows average load of latency in cycles...Formula: Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 / MEM_TRANS_RETIRED.LOAD_LATENCY_GT "

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page