We've had some problems in understanding performance counter results on P4/Xeon (collected with VTune 2.0 on Linux). Maybe this is just a series of misunderstandings with the documentation, but anyway:
1) In the IA32 Architecture Optimization document it is said that P4's hardware counter "2nd Level Cache Read Misses" has bugs that can cause miscounting by a factor of two. Since the measurements for same code with same data size delivers reproducable counting results with vtune, this bug has to occur under specific circumstances. Is anything known about under which circumstances this bug occurs? There are some algorithms that seem to result in reliable counts, other algorithms are obviously miscounted. It would be great if a correct result could be drawn out of the measurements and some assumptions or estimations.
2) If data is loaded that is not in 2nd Level cache, the cache loads two cache lines from memory. Is that counted as one or two events for 2nd level Cache Read Misses? And which counters count L2 cache store misses?
3) As "2nd Level Cache Load Misses Retired" counts the Loads from L2 Cache, which caused a cache miss, and "2nd Level Cache Read Misses" counts the memory load misses as seen by the bus queue (VTune Reference), can it be assumed, that - including some error concerning instruction loads a.s.o - the difference of both are a measure for 2nd Level Cache Write Misses? If not, how else could write misses be determined?
4) In the P4 Architecture Optimization document it is noted that P4's event "2nd Level Cache Load Misses Retired" 'is known to undercount when loads are apart'. Could you explain why/when this occurs, and give an estimation of the factor by which it is undercounting for some specific code example?