The following is the snapshot from VTune on my Haswell processor. However, I don't understand that why the CPU time and the number of instructions retired for the highlighted code (vpbroadcastq) are so significantly greater than the others in the same basic block. I thought the number of the retired instructions should be not too different, though there might be cache misses or TLB misses. Can someone explain some possible reasons for it? Thanks.
Thanks for your link, Peter. But the question is that the highlighted instruction is in the middle of the basic block. Why the recording of the other instructions before/after it is not affected if this is due to hardware event skid?