Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

R2PCIe Test

GHui
Novice
373 Views

I have collect RING_THRU_DN_BYTES and RING_THRU_UP_BYTES events.

When I run glxgears program, DN~=5200MB UP~=8000MB.

Whe I run stream, DN~=4200MB UP~=9800MB.

I can't judge whether the events that I collect is right.

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
373 Views

The description of the RING_BL_USED event is:

Definition: Counts the number of cycles that the BL ring is being used at this ring stop. This includes when packets are passing by and when packets are being sunk, but does not include when packets are being sent from the ring stop.

So the event measures *any* traffic that is moving on the ring at this stop.  Since the ring is the only connection between the memory controllers, the L3 caches, and the cores, data moved by any mechanism will often pass by the R2PCIe unit on the BL ring.

Unfortunately, this means that you need to know the relative locations of all the units on the ring.  If you can measure traffic at the R2PCIe unit and at the next stop on the ring, the difference will be packets that were sent from the R2PCIe box minus the packets that were "sunk" at the R2PCIe box.   It looks like you would need counts at all stops on the ring to determine the sources and sinks at each ring stop.

View solution in original post

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
374 Views

The description of the RING_BL_USED event is:

Definition: Counts the number of cycles that the BL ring is being used at this ring stop. This includes when packets are passing by and when packets are being sunk, but does not include when packets are being sent from the ring stop.

So the event measures *any* traffic that is moving on the ring at this stop.  Since the ring is the only connection between the memory controllers, the L3 caches, and the cores, data moved by any mechanism will often pass by the R2PCIe unit on the BL ring.

Unfortunately, this means that you need to know the relative locations of all the units on the ring.  If you can measure traffic at the R2PCIe unit and at the next stop on the ring, the difference will be packets that were sent from the R2PCIe box minus the packets that were "sunk" at the R2PCIe box.   It looks like you would need counts at all stops on the ring to determine the sources and sinks at each ring stop.

0 Kudos
GHui
Novice
373 Views

Thanks for you help.

I can't very clearly understand "does not include when packets are being sent from the ring stop". If there doesn't include it, how did I get the packets that sent from and sent to ring stop.

 

0 Kudos
McCalpinJohn
Honored Contributor III
373 Views

The computed metrics show the traffic on the *inbound* side of the ring at the R2PCIe box.   This will include the traffic that is stopping at the R2PCIe box as well as the traffic that is continuing past the R2PCIe box to a different destination.    Because it measures traffic on the *inbound* side of the R2PCIe box, it will not count traffic that the R2PCIe unit injects onto the ring. 

If the traffic were measured on the *outbound* side of the ring at the R2PCIe box, it would include the traffic that is continuing past the R2PCIe unit as well as the traffic injected by the R2PCIe unit, but it would not include the traffic that stopped at the R2PCIe box.  

One could imagine different performance counter events, but the events that are defined appear to be measuring total traffic on the ring at the *inbound* side of the R2PCIe box.   Given measurements of traffic at all of the boxes on the ring, it should be possible to solve a system of linear equations to derive the quantity of traffic placed onto the ring at each stop and the quantity of traffic removed from the ring at each stop.  I think that is what most of us are interested in measuring.    I have not yet determined whether Intel has provided enough documentation to derive the appropriate system of equations.

To make things worse, the system of equations depends on the relative position of the boxes on the ring, which is only presented schematically in the Intel documentation.   The relative positions of the boxes will be different for each processor configuration, and may be different for different instances of the *same* model if they have different cores or L3 slices disabled.  E.g., 10-core parts are probably built from 12-core die, represented in Figure 1-2 of the Xeon E5 v3 uncore performance monitoring guide.  The relative positions of the boxes (and therefore the structure of the system of linear equations) will be different for each combination of disabled L3 slices.  There are 132 different combinations of two disabled slices in a 12-slice system (any of the 12 slices can be the first one disabled and any of the remaining 11 slices can be the second one disabled).   Many of these combinations are very different -- for example you might have 2, 3, or 4 slices on the secondary ring, so you might have very different traffic requirements through the SBox units.

0 Kudos
Reply