- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As anyone who cares about performance knows, the # of branch mispredicts and the latency to get back on the good path are important drivers of performance. Some of my applications have a fair number, so I decided to explore what this is on the chips in my hands which I have an SB and IB. I was very surprised by the reductions in the "minimum" redirect latency from the uop$ and I was also surprised by the increase when coming from the uop$. Using a list of "indirect jmps" I determine that the minimum redirect latency is 15 clks on SB and IB from the uop$.. and I'm observing 22 (SB) and 23 (IB) when redirecting from the IF/DE which doesn't happen often due to the high hit rate in the uop$.
My question is 2 fold:
* are my estimates of the minimum redirect latency correct from the IF/DE, good to know if I understand my findings above
* if they are correct, then why is it 7-8 clks longer? I know I may not get a response to this and it doesn't affect me since I'm just inquiring from an inquisitive perspective.
Overall.. a very interesting endeavor. Thanks in advance for any pointers or help in understanding..
perfwise
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IF = Instruction Fetch and DE = Decode, essentially what ILD in the Intel Opt Guide is doing. I just wonder why it takes 7 more clocks for the redirect than it does from the uop$.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for explanation.
How did you get your measurements?Was that VTune or other Intel monitoring tool?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote a directed test to measure this... which I do often to understand the drivers of my codes performance.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes...but documentation of the pmcs and msrs is poorly documented.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
perfwise wrote:
Yes...but documentation of the pmcs and msrs is poorly documented.
Me too.Unfortunataly for now Ihave only Core i3 CPU so I cannot use funcionality of Xeon Uncore PMU.If you are interested Ican share with you my work?
Do you use in your code inline assembly?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code is a combination of C and assembly, I don't use inline asm. I'd rather not share my work but I use this forum to relay my experiences.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
perfwise wrote:
My code is a combination of C and assembly, I don't use inline asm. I'd rather not share my work but I use this forum to relay my experiences.
Perfwise
Thanks perfwise
I use mostly inline assembly where I need to access MSR registers for setting and reading counters and registers values.How you display gathered data.I use for this purpose DbgPrint or KdPrint functions with DbView to intercept and print those values.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page