The VTune analyzer does not measure the number of cycles that every instruction takes. That would cause your code to run at about 1/1000 the speed it normally runs! :-( Instead, it periodically notes where the processor is executing code and gives you a statistically accurate representation of where the processor is spending it's time. This means your code runs at almost full speed (we say <5% overhead for sampling).
On the Intel Core2 processor family, using the VTune analyzer, you can roughly calculate stalls due to cache misses. See this paper for more info.