Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4991 Discussions

Sampling Event based - what can be done.

I am sampling the OS where a method in a driver seems to take a long time. I am looking at "instruction retired" and "clock tick" counters, and here is the hotest spot found in this method: (drill down view)

CConnection *pConnection = pConnectionRule->GetConnection(); "1,161" "1,953" ""
mov edi, DWORD PTR [ebp+08h] "3" "6" ""
mov eax, DWORD PTR [edi+08h] "6" "5" ""
mov DWORD PTR [ebp+08h], eax "1,137" "1,916" ""
mov ebx, ecx "9" "14" ""
and eax, 0ffh
mov esi, edi "6" "12" ""

First column is "instruction retired" the second one is "clock ticks".

Some questions:

1. In the "selection summary" view of the same methos I can see the following info:

Instructions retired: 2916
clock ticks: 4629
cpi: 10.95

How is it possible, 4629/2916=1.59

2. Clearly and according the assembly drill down the third assemply instruction is a problem (by the way is it the third or the second instruction?). Can someone explain why this move is so problematic... maybe because data is non cached. In this case is there any way to overcome this problem.

Thanks in advance.
0 Kudos
1 Reply
Honored Contributor III
The events aren't "exact," so it's likely that some of the stalls are in the preceding instruction. With apparently 2 levels of indirection, you have plenty of opportunities for cache misses. The cache miss counters are there to help you verify that.
Unless you can order the events in a more regular way, so that hardware prefetch kicks in, or implement some kind of software prefetch strategy, there may not be much to be done. This might be a place for the helper thread strategy, where you have another thread which just tries to hit all the prefetches. Not easy to program or maintain. You would want to verify the cache misses before working in that direction.
0 Kudos