Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Instructions retired variation and memory pressure counters

halivingston
Beginner
554 Views

I'm using these counters via RDPMC between two code sites. So, I call RDPMC, store this data, and then call RDPMC again.

It's turning out that on multiple iterations of the code, this (Instructions Retired . ANY)seems to be differing. Is this because of speculation? I thought that retired instructions counterwas not affected by speculation.

But then again I read somewhere (maybe here?) that Branch mispredicts can inflate this number, etc. or some such thing.

Next question is about memory allocations. I'm trying to figure out if I can use BUS_TRANS_MEM to somehow get a feel for it? Would a ratio of INST_RETIRED_LOADS/INST_RETIRED_STORES be helpful as well? L1/L2 cache miss events seem like not too suitable for this.

Just a reminder, this is all between two code sites, I don't use sampling. I'm going to turn on 7 counters (3 fixed, and 4 and always run it on our public facing production service).

0 Kudos
1 Solution
Patrick_F_Intel1
Employee
554 Views
I don't have a good reference for lock contention and spin loops. The details would depend on which OS, which type of synchoziation method you are using. The basic idea is that, if there is contention for a lock then one thread may try multiple times to get the lock. This can lead to varying amounts of instructions retired.

For network traffic, if your box box is hooked up to the network, it may have to deal with more network traffic than what you are expecting (such as network shares).

And I forgot to mention things like virus protection which can lead to varying instructions retired.

Unless you are actually hooking into the OS context switch logic then I doubt you are accurately seeing 'every context switch'. The OS can swap you out due to some other higher priority process needing to run or your quantum of time can be used up.

To see EVERY context switch you need to use OS tracing (as I mentioned above) but this creates tons (100s of MBs of info) and is probably more than you want to know.

But all this is just guesswork... you really need some data (such as from VTune) to verify your assumptions.
Plus you could use VTune to select memory usage events and and see which one are useful to your needs.
Pat

View solution in original post

0 Kudos
4 Replies
Patrick_F_Intel1
Employee
554 Views
There are many things that can cause instructions retired to vary.
If your code has any polling or spin loops, or maybe contention for locks then the amount work (instructions) done can vary, or a different load on the cpu (such as on servers) or network traffic, etc.
Or, if your code gets swapped out and something else runs.

I don't think any event will tell you about memory allocations... assuming you actually are trying to measure when memory is allocated (as opposed to just using memory).
OS events (like Windows ETW tracing or linux ftrace data) are probably the appropriate source for memory allocation monitoring. Or just instrument your code.

Although you are not using sampling, it sounds like sampling would be useful to verify, when you have 2 runs that have different number of instruction retired, thatyou really are doing the same work (same code paths, etc). You wouldn't have to run sampling all the time but it would be helpful to check that the assumptions you are making (about your code's behavior) are correct.
Pat
0 Kudos
halivingston
Beginner
554 Views
Thanks, Pat.

Is there more to read about content/spin loop, etc? I didn't get the different load on the cpu or network traffic one.

Every time my code initiates network calls, it switches to the OS, and at every context switch I save the amount of instructions retired.

I was hoping the inst. retired would work exactly for the swapped out case. It'll be different even if I save the counter?

Hmmm

It's not wildly different, and it is certainly better than RDTSC, which varies much more. I'm guessing because RDTSC is a measure of time vs. work done?

Also I did mean memory usage, and not necessarily allocations, my bad. Knowing that would would you recommend?
0 Kudos
Patrick_F_Intel1
Employee
555 Views
I don't have a good reference for lock contention and spin loops. The details would depend on which OS, which type of synchoziation method you are using. The basic idea is that, if there is contention for a lock then one thread may try multiple times to get the lock. This can lead to varying amounts of instructions retired.

For network traffic, if your box box is hooked up to the network, it may have to deal with more network traffic than what you are expecting (such as network shares).

And I forgot to mention things like virus protection which can lead to varying instructions retired.

Unless you are actually hooking into the OS context switch logic then I doubt you are accurately seeing 'every context switch'. The OS can swap you out due to some other higher priority process needing to run or your quantum of time can be used up.

To see EVERY context switch you need to use OS tracing (as I mentioned above) but this creates tons (100s of MBs of info) and is probably more than you want to know.

But all this is just guesswork... you really need some data (such as from VTune) to verify your assumptions.
Plus you could use VTune to select memory usage events and and see which one are useful to your needs.
Pat
0 Kudos
Tokponnon__Parfait
554 Views

I made the same remark when counting instructions in user space between to syscall. I repeat the experiment two times, firs time with my pages are read only (so fault fault expected) and the second time with my pages RW ( so no page fault) and I noticed variation between the two count. I use MSR_PERF_FIXED_CTR0 to count instruction.

Is it true that page faults may lead to  extra instruction count when using INST_RETIRED.ANY?

0 Kudos
Reply