Can I use intel vtune to find the traces of CPI or IPC of application

Ayam · ‎07-09-2014

Hello,

I am looking to find the traces of IPC or CPI of the applications. For example at 1k instructions the IPC is this, at 2k instructions the IPC is this, so on and so forth. I can also work with the cycles or specific time interval instead of the instructions.

Is there any possibility that I can extract this information from intel vtune?

Or any other tool that you can think of that I can use for this particular purpose.

I have intel xeon and intel atom to work with.

Regards,

David_A_Intel1 · ‎07-09-2014

Well, you said, "I can also work with the cycles or specific time interval instead of the instructions." That's pretty much what VTune Amplifier does. ;)

Using Advanced Hotspots, the EIP is recorded after a specified number of cycles. You can modify the sample after value of the CPU_CLK_UNHALTED events to change how often a sample is collected.

Peter_W_Intel · ‎07-09-2014

@ Maria M

I don't know if I understand your requirements correctly. Mr.Anderson is right - you need to use advanced-hotspots collector which can get CPI value in the report after profiling...CPI is average of clocks per instructions, so IPC is average of instruction per clocks - in other world, IPC = 1/PCI

What level do you find the trace CPI or IPC of the program? You may get CPI or IPC from report for different level. For example,

1. Application (process) level

# amplxe-cl -report hw-events -group-by process -r r002ah/

2. Module level (you may only have interest of specific modules, don't care of 3rd-part libraries, runtime modules)

# amplxe-cl -report hw-events -group-by module -r r002ah/

3. Function level (care of functions in specific module)

# amplxe-cl -report hw-events -filter module=primes.icc -group-by function -r r002ah

Note: all instructions retired counters, cpu clocks are ready in report, you may write a script/analyzer to get CPI or IPC value.

Bernard · ‎07-09-2014

>>>I don't know if I understand your requirements correctly. Mr.Anderson is right - you need to use advanced-hotspots collector which can get CPI value in the report after profiling...CPI is average of clocks per instructions, so IPC is average of instruction per clocks - in other world, IPC = 1/PCI >>>

In simple words how many CPU cycles were needed to process some machine code instruction. One of the most time consuming instructions in terms of CPU cycles are x87 transcendental fsin and fcos. Their execution can take dozens of cycles probably because of Horner scheme approximation done in micro-code/hardware.

Peter_W_Intel · ‎07-10-2014

>>In simple words how many CPU cycles were needed to process some machine code instruction. One of the most time consuming instructions in terms of CPU cycles are x87 transcendental fsin and fcos. Their execution can take dozens of cycles probably because of Horner scheme approximation done in micro-code/hardware.

In current x86 microarhitecture, 0.25 for CPI is best in theory - it means that one cycle can execute 4 instruction in parallel, that is, processor's capability for simple instructions. Usually CPI value locates at 0.6-1.0, it can be accepted. However this is not for complex instructions, such as x87 instructions, SSE 4.2/4.3/4.4 instructions, AVX instructions. They are SIMD basis, except x87 instructions.

So, CPI is only useful to measure performance for instructions, they are single instruction single data.

Bernard · ‎07-10-2014

Yes you are right.

My intention was to write about the most CPU cycles consuming instructions.

>>> it means that one cycle can execute 4 instruction in parallel>>>

For example two store/loads and one branch and one int arithmetic instruction.

Ayam · ‎07-10-2014

thanks for replying.

I know I can get the CPI from the advanced-hotspots analysis. the thing is CPI from advanced-hotspots is the CPI of the application from start til end execution. If my particular application has 1Million instructions in total. I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI).

If traces of CPI is not possible with instruction then I can also work with cycles.

MrAnderson: I thought you are referring that I can change the value of CPU_CLK_UNHALTED.THREAD and INST_RETIRED.ANY (500,5000,5000) to work out my problem?

/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=500 INST_RETIRED.ANY:sa=500 -knob enable-stack-collection=true -- ./proj

Peter Wang: I will be working at the application level.

iliyapolak: I do not exactly worry about what type of instructions the application has. At the start of the application CPI might be low but in the middle of the application CPI will increase (I am expecting this behavior).

Bernard · ‎07-10-2014

>>>I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI>>>

Bear in mind that proper and accurate formula for calculating CPI should take into account every counted instruction (sorted by groups) clock cycles.

http://en.wikipedia.org/wiki/Cycles_per_instruction

Peter_W_Intel · ‎07-10-2014

> If my particular application has 1Million instructions in total. I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI).

As Mr.Anderson said, you change SAV of INST_RETIRED.ANY to trace what instructions are sampled in your code when meet 1k instructions, 2k instructions,... AND, you said -

/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=500 INST_RETIRED.ANY:sa=500 -knob enable-stack-collection=true -- ./proj

But wait,

THIS IS NOT RECOMMENDED TO BE USED!!! The result will be unexpected, and overhead is huge and some samples will be lost since previous sample hasn't been processed. So 1K instructions tracing should not be considered, please use 10K instructions as sample interval. Use this way,

/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=10000 INST_RETIRED.ANY:sa=10000 -knob enable-stack-collection=true -- ./proj

Bernard · ‎07-10-2014

iliyapolak wrote:

>>>I want to check what is the CPI at 1k instructions then at 2k instructions then 3k instructions til 1 million instruction of the application. (that is what I meant from the traces of CPI>>>

Bear in mind that proper and accurate formula for calculating CPI should take into account every counted instruction (sorted by groups) clock cycles.

http://en.wikipedia.org/wiki/Cycles_per_instruction

Of course I meant scenario where you are calculating manually CPI of few dozens or few hundreds of instructions.

Bernard · ‎07-10-2014

>>>I do not exactly worry about what type of instructions the application has>>>

Instruction types will affect CPI.

Ayam · ‎07-11-2014

Thank you everyone for contributing to help me out.

Bernard · ‎07-11-2014

@Maria M

As always you are welcome.

Bernard · ‎07-11-2014

@MrAnderson

>>>Using Advanced Hotspots, the EIP is recorded after a specified number of cycles>>>

I thought that VTune mainly uses clock interrupt to perform sampling of instruction pointer.

David_A_Intel1 · ‎07-11-2014

@iliyapolak, You are correct. In the statement that you quoted, I did not specify *how* the EIP is recorded. ;)

In the case of CPU_CLK_UNHALTED, an interrupt is generated after a specific number of clock ticks/cycles and the EIP is recorded.

Bernard · ‎07-12-2014

Thanks @MrAnderson :)

Tommy_W_ · ‎08-13-2014

Hi,

I just found this thread similar to what I need. I have the following questions.

1. From above post, when using amplxe-cl -report, it seems that it is group by something. How can I show *each* samples individually(sorted by time) without grouping. In each sample, the HW-event counts are also shown. If there is no such option, I believe that such information must be in the raw results. Is there guide/tutorial for parsing the raw data?

2. I need to work with instructions rather than cycles as sampling intervals, such as every 1K instruction, a sample is taken. In addition, the sample needs to be taken per-thread. For example, for every 1K instructions of a thread, a sample is taken. I remember there is a sampling method option, but can't find it now. Any idea how to do this?

Thank you very much.

Peter_W_Intel · ‎08-13-2014

> How can I show *each* samples individually(sorted by time) without grouping

The report (use "amplxe-cl") is to display samples on hot functions by default, you can use group-by process | thread | module if you like. However all samples will drop on functions. You can view samples on source line or assembly code by using GUI, or use command line by referencing this article.

>...In addition, the sample needs to be taken per-thread. For example, for every 1K instructions of a thread, a sample is taken.

No, all samples are taken per-core, not per-thread. Actually you don't know your thread will work on core 1 or core 2 or other, unless you use affinity-process. You can get use "amplxe-cl -report hw-events -r r000ah -group-by thread,function", for example. This can display data on function by thread, you expected?

Bernard · ‎08-14-2014

MrAnderson (Intel) wrote:

@iliyapolak, You are correct. In the statement that you quoted, I did not specify *how* the EIP is recorded. ;)

In the case of CPU_CLK_UNHALTED, an interrupt is generated after a specific number of clock ticks/cycles and the EIP is recorded.

Do you know that interrupt frequency? Is this Clock Interrupt?

Tommy_W_ · ‎08-14-2014

Thank you, Peter, for your reply.

* The report (use "amplxe-cl") is to display samples on hot functions by default, you can use group-by process | thread | module if you like. However all samples will drop on functions. You can view samples on source line or assembly code by using GUI, or use command line by referencing this article.

I tried to avoid samples been grouped into functions. Instead, each individual sample should be sorted by the time when it happened. This is exactly the same as what GUI mode shows. See the following figure. We can see when the samples are taken and how many of them. What I need is to get this info in command line mode (possibly using amplxe-cl -report) instead of GUI so that I can parse the results. The information of time when the sample is taken is important to me, but I don't know how to reveal it using the command line. All the samples are grouped by functions.

* No, all samples are taken per-core, not per-thread. Actually you don't know your thread will work on core 1 or core 2 or other, unless you use affinity-process. You can get use "amplxe-cl -report hw-events -r r000ah -group-by thread,function", for example. This can display data on function by thread, you expected?

Yes, thanks. This is what I need.