I'm using TRACER_CYCLES_NE_ALL and TRACKER_OCCUPANCY_READS_LOCAL events to calculate the latency waiting for being processed by the memory controller. As the tracker file of home agent only has 128 entries, occupancy / cycles should not be higher than 128, which can be obseved executing stream benchmarks.
However, if I started one of the SPEC OMP 2012 (363.swim) benchmarks, i got a value higher than 500. There must be some mistakes. Any explanation?
If the value is too high then either the denominator is too high or the numerator is too low.
The denominator can be checked by dividing the change by the total elapsed uncore cycles. If it shows more than 128 increments per cycle, then there is a problem.
The numerator should also be checked against the total elapsed uncore cycles. SWIM should have outstanding memory accesses essentially every cycle, so these numbers should be quite similar.
There is a potential problem with the uncore counters in PCI configuration space. PCI configuration space reads are supposed to be performed with 32-bit load instructions, so it takes 2 instructions to load the lower and upper halves of the 48-bit counter registers. If the lower 32 bits of the counter overflows after you have read the lower half, but before you have read the upper half, the combined result will be inconsistent. In this case it will be about 2^32 counts too high. Reading the top and then the bottom does not help -- it just changes the direction of the error. This should not happen very often (only when the lower half of the counter is within a few hundred or a few thousand increments of 2^32-1), but it may be hard to identify when it does happen -- especially if you compute the differences and throw away the raw data values.
I don't know of any completely general solution to this problem, but it may be helpful to read each counter twice (lower, upper, lower, upper). If the lower half decreases from the first read to the second while the upper half stays the same, then the bottom half rolled over in the middle of the first read and you should use the second pair of values. There are probably other clever ideas that can be applied.
Hello Dr. Bandwidth,
thanks for your answer. The overloaded values can be observed repeatly. With multiple executions of the SWIM benchmark (14 threads, 30 iterations, executed 9 seconds), i got always :
| Event | Counter | Values
| TRACKER_CYCLES_NE_ALL | BBOX0C0 | 26508879772
| TRACKER_OCCUPANCY_READS_LOCAL | BBOX0C1 | 2755375257019
| TRACKER_OCCUPANCY_READS_REMOTE | BBOX0C2 | 53303104001
| BBOX_CLOCKTICKS | BBOX0C3 | 26628464360
| TRACKER_CYCLES_NE_ALL | BBOX1C0 | 26509389571
| TRACKER_OCCUPANCY_READS_LOCAL | BBOX1C1 | 5316730925972
| TRACKER_OCCUPANCY_READS_REMOTE | BBOX1C2 | 47999526
| BBOX_CLOCKTICKS | BBOX1C3 | 26628464360
I'm also wondering why HA0 gives values smaller than 128, while HA1 gives values greater than 128.
Hmm... At least the new numbers are less than 500, but the HA1 result of 199.66 increments per cycle certainly should not be possible with a 128-bit counter.
The values on HA0 look good -- about 106 entries per cycle. The HA1 values look like either a bug or an undisclosed "feature".... :-(
Which processor are you using?
If I were trying to figure this out, I would probably start over with very simple microbenchmarks (including cases where I could control the amount of concurrency -- e.g., pointer chasing latency tests with the HW prefetchers disabled) and collect all of the HA, CBo, and iMC counters, then re-run everything in each of the available snoop modes, then put everything into a giant spreadsheet and stare at it until it started to make sense.
I'm using broadwellEP E5-2600 v4. However, similar values (write instead of reads) can be observed on a HaswellEP E5-2699 processor.
I started with the simple STREAM benchmark and saw all values were fine and then moved to swim. Do you have a better micro-benchmark?