I am sampling the OS where a method in a driver seems to take a long time. I am looking at "instruction retired" and "clock tick" counters, and here is the hotest spot found in this method: (drill down view)
2. Clearly and according the assembly drill down the third assemply instruction is a problem (by the way is it the third or the second instruction?). Can someone explain why this move is so problematic... maybe because data is non cached. In this case is there any way to overcome this problem.
The events aren't "exact," so it's likely that some of the stalls are in the preceding instruction. With apparently 2 levels of indirection, you have plenty of opportunities for cache misses. The cache miss counters are there to help you verify that. Unless you can order the events in a more regular way, so that hardware prefetch kicks in, or implement some kind of software prefetch strategy, there may not be much to be done. This might be a place for the helper thread strategy, where you have another thread which just tries to hit all the prefetches. Not easy to program or maintain. You would want to verify the cache misses before working in that direction.