Is there a way, using Vtune, to measure the number of processor cycles that a program is stalled waiting for memory requests to be satisfied by the memory subsystem, as might happen in the case of cache misses when dependent instructions cannot be executed until the required data is available?
Intel's documentation mentions that a logical processor might halt on an I/O operation, in which case the counter for Non-Halted Clockticks is not incremented. I would like to know if it is the case with cache misses as well, for which data must be fetched from the main memory. Meaning, if the Non-Halted Clockticks counter were not incremented during pipeline stalls due to cache misses to the main memory, then we could arrive at an approximate estimate of the number of stalled cycles by subtracting the Non-Halted Clockticks from the Clockticks or the Time-Stamp Counter.
Any information in this regard will be appreciated.
I hesitate to attempt to answer this, not counting myself as an expert on Itanium Vtune. Assuming you are talking about Itanium, my answer would be probably yes, there are ways to count all memory stall events. Rather than the method you suggested, you might look at BE_BUBBLE_ALL events and the various more specific events. But when you say logical processor, I wonder if you mean Itanium.