- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am looking to find the traces of IPC or CPI of the applications. For example at 1k instructions the IPC is this, at 2k instructions the IPC is this, so on and so forth. I can also work with the cycles or specific time interval instead of the instructions.
Is there any possibility that I can extract this information from intel vtune?
Or any other tool that you can think of that I can use for this particular purpose.
I have intel xeon and intel atom to work with.
Regards,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, you said, "I can also work with the cycles or specific time interval instead of the instructions." That's pretty much what VTune Amplifier does. ;)
Using Advanced Hotspots, the EIP is recorded after a specified number of cycles. You can modify the sample after value of the CPU_CLK_UNHALTED events to change how often a sample is collected.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ Maria M
I don't know if I understand your requirements correctly. Mr.Anderson is right - you need to use advanced-hotspots collector which can get CPI value in the report after profiling...CPI is average of clocks per instructions, so IPC is average of instruction per clocks - in other world, IPC = 1/PCI
What level do you find the trace CPI or IPC of the program? You may get CPI or IPC from report for different level. For example,
1. Application (process) level
# amplxe-cl -report hw-events -group-by process -r r002ah/
2. Module level (you may only have interest of specific modules, don't care of 3rd-part libraries, runtime modules)
# amplxe-cl -report hw-events -group-by module -r r002ah/
3. Function level (care of functions in specific module)
# amplxe-cl -report hw-events -filter module=primes.icc -group-by function -r r002ah
Note: all instructions retired counters, cpu clocks are ready in report, you may write a script/analyzer to get CPI or IPC value.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I don't know if I understand your requirements correctly. Mr.Anderson is right - you need to use advanced-hotspots collector which can get CPI value in the report after profiling...CPI is average of clocks per instructions, so IPC is average of instruction per clocks - in other world, IPC = 1/PCI >>>
In simple words how many CPU cycles were needed to process some machine code instruction. One of the most time consuming instructions in terms of CPU cycles are x87 transcendental fsin and fcos. Their execution can take dozens of cycles probably because of Horner scheme approximation done in micro-code/hardware.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>In simple words how many CPU cycles were needed to process some machine code instruction. One of the most time consuming instructions in terms of CPU cycles are x87 transcendental fsin and fcos. Their execution can take dozens of cycles probably because of Horner scheme approximation done in micro-code/hardware.
In current x86 microarhitecture, 0.25 for CPI is best in theory - it means that one cycle can execute 4 instruction in parallel, that is, processor's capability for simple instructions. Usually CPI value locates at 0.6-1.0, it can be accepted. However this is not for complex instructions, such as x87 instructions, SSE 4.2/4.3/4.4 instructions, AVX instructions. They are SIMD basis, except x87 instructions.
So, CPI is only useful to measure performance for instructions, they are single instruction single data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes you are right.
My intention was to write about the most CPU cycles consuming instructions.
>>> it means that one cycle can execute 4 instruction in parallel>>>
For example two store/loads and one branch and one int arithmetic instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for replying.
I know I can get the CPI from the advanced-hotspots analysis. the thing is CPI from advanced-hotspots is the CPI of the application from start til end execution. If my particular application has 1Million instructions in total. I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI).
If traces of CPI is not possible with instruction then I can also work with cycles.
MrAnderson: I thought you are referring that I can change the value of CPU_CLK_UNHALTED.THREAD and INST_RETIRED.ANY (500,5000,5000) to work out my problem?
/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=500 INST_RETIRED.ANY:sa=500 -knob enable-stack-collection=true -- ./proj
Peter Wang: I will be working at the application level.
iliyapolak: I do not exactly worry about what type of instructions the application has. At the start of the application CPI might be low but in the middle of the application CPI will increase (I am expecting this behavior).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI>>>
Bear in mind that proper and accurate formula for calculating CPI should take into account every counted instruction (sorted by groups) clock cycles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> If my particular application has 1Million instructions in total. I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI).
As Mr.Anderson said, you change SAV of INST_RETIRED.ANY to trace what instructions are sampled in your code when meet 1k instructions, 2k instructions,... AND, you said -
/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=500 INST_RETIRED.ANY:sa=500 -knob enable-stack-collection=true -- ./proj
But wait,
THIS IS NOT RECOMMENDED TO BE USED!!! The result will be unexpected, and overhead is huge and some samples will be lost since previous sample hasn't been processed. So 1K instructions tracing should not be considered, please use 10K instructions as sample interval. Use this way,
/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=10000 INST_RETIRED.ANY:sa=10000 -knob enable-stack-collection=true -- ./proj
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI>>>
Bear in mind that proper and accurate formula for calculating CPI should take into account every counted instruction (sorted by groups) clock cycles.
Of course I meant scenario where you are calculating manually CPI of few dozens or few hundreds of instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I do not exactly worry about what type of instructions the application has>>>
Instruction types will affect CPI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you everyone for contributing to help me out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Maria M
As always you are welcome.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@MrAnderson
>>>Using Advanced Hotspots, the EIP is recorded after a specified number of cycles>>>
I thought that VTune mainly uses clock interrupt to perform sampling of instruction pointer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@iliyapolak, You are correct. In the statement that you quoted, I did not specify *how* the EIP is recorded. ;)
In the case of CPU_CLK_UNHALTED, an interrupt is generated after a specific number of clock ticks/cycles and the EIP is recorded.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @MrAnderson :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I just found this thread similar to what I need. I have the following questions.
1. From above post, when using amplxe-cl -report, it seems that it is group by something. How can I show *each* samples individually(sorted by time) without grouping. In each sample, the HW-event counts are also shown. If there is no such option, I believe that such information must be in the raw results. Is there guide/tutorial for parsing the raw data?
2. I need to work with instructions rather than cycles as sampling intervals, such as every 1K instruction, a sample is taken. In addition, the sample needs to be taken per-thread. For example, for every 1K instructions of a thread, a sample is taken. I remember there is a sampling method option, but can't find it now. Any idea how to do this?
Thank you very much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> How can I show *each* samples individually(sorted by time) without grouping
The report (use "amplxe-cl") is to display samples on hot functions by default, you can use group-by process | thread | module if you like. However all samples will drop on functions. You can view samples on source line or assembly code by using GUI, or use command line by referencing this article.
>...In addition, the sample needs to be taken per-thread. For example, for every 1K instructions of a thread, a sample is taken.
No, all samples are taken per-core, not per-thread. Actually you don't know your thread will work on core 1 or core 2 or other, unless you use affinity-process. You can get use "amplxe-cl -report hw-events -r r000ah -group-by thread,function", for example. This can display data on function by thread, you expected?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MrAnderson (Intel) wrote:
@iliyapolak, You are correct. In the statement that you quoted, I did not specify *how* the EIP is recorded. ;)
In the case of CPU_CLK_UNHALTED, an interrupt is generated after a specific number of clock ticks/cycles and the EIP is recorded.
Do you know that interrupt frequency? Is this Clock Interrupt?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, Peter, for your reply.
* The report (use "amplxe-cl") is to display samples on hot functions by default, you can use group-by process | thread | module if you like. However all samples will drop on functions. You can view samples on source line or assembly code by using GUI, or use command line by referencing this article.
I tried to avoid samples been grouped into functions. Instead, each individual sample should be sorted by the time when it happened. This is exactly the same as what GUI mode shows. See the following figure. We can see when the samples are taken and how many of them. What I need is to get this info in command line mode (possibly using amplxe-cl -report) instead of GUI so that I can parse the results. The information of time when the sample is taken is important to me, but I don't know how to reveal it using the command line. All the samples are grouped by functions.

* No, all samples are taken per-core, not per-thread. Actually you don't know your thread will work on core 1 or core 2 or other, unless you use affinity-process. You can get use "amplxe-cl -report hw-events -r r000ah -group-by thread,function", for example. This can display data on function by thread, you expected?
Yes, thanks. This is what I need.
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
