I know how to measure the L2 cache miss of my functions. Is there a way to measure CPU stalls those misses cause (if any) in msec or cpu cycles?
If it can not be measured directly, can i estimate it by other events/ratios such as CPI etc'.
L2 cache misses introduce stalls in the CPU pipeline. The number of stalls is proportional to the number of L2 cache miss events with a factor of penalty. On the Core2 system the penalty is about 130 CPU clockticks (worst case) - check with the optimization manual for a particular microarchitecture. So the rough estimation in (CPU cycles) can be made by multiplying L2 cache miss events by penalty. Please, make sure you are counting events, not samples. You can see the number of events directly in VTune Hotspot results or have number of samples multiplied by SAV (sampling after value) for the events.
Vladimir has pointed out how to estimate the worst case impact.The out-of-order engine might hide some of the latency. An eventthat can give you more insightsis RS_UOPS_DISPATCHED.CYCLES_NONE. It measures the cycles in which no micro-op is dispatched for execution, i.e. the execution units are waiting for work. Obviously, there might be different reasons for this than cache misses, but this event can show you, if you have an issue.
You can also look at the cycle accounting from David Levinthal.
It will give you better idea of where your CPU cycles are utilized