Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

counting memory stall on Sandy Bridge

liu_x_
Beginner
1,488 Views

I have a question about performance counters. I am counting the stall cycles coursed by memory access on Sandy Bridge. I believe the event OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD is a good measurement. However, it may ignore the overlapping among  continuous load. So , the true stall cycles is smaller. Is there any events that taking overlapping into considerate? Or how can i identify the overcounting. As far as i know, it can be done easily by MBA in AMD's processor.  But there is not a MBA in Intel's processors.

Could anyone help me?

Thanks!

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,488 Views

After doing some additional testing, I found events that count stalls at two different places:

  • Two events that count cycles in which uops are not sent from the RAT (Register Alias Table -- the register renaming unit) to the RS (Reservation Station -- queues uops until the instructions defining their source operands have been dispatched, then dispatches "ready" uops to the execution ports):
    • Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
      • Intel's VTune calls this UOPS_ISSUED.STALL_CYCLES
    • Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
      • Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
  • Two events that count cycles in which no uops are dispatched from the RS to any of the execution units (aka "ports").
    • Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4:  0x044304a3
      • I got the CMASK value from VTUNE -- the documentation in Vol 3 of the SW Developer's Guide is not very helpful.
    • Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE:   0x01c302b1
      • This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01.  This will only make a difference on a system with HyperThreading enabled, and I don't have any systems configured that way to test right now.
    • These two events differed by no more than a part per million in my tests.

As discussed in the forum thread at https://software.intel.com/en-us/forums/topic/506515, the first two events can easily overcount stalls in codes that have a "stall-free" IPC of less than 4.  For example, a code with a "stall-free" IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a "non-stall" cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache.   Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS.   (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher.   I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried. 

Note that one can also count cycles in which no instructions are retired.  This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue -- the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls.   Intel provides a very interesting performance counter event that provides some insight into this issue.   Event 0xA3 CYCLE_ACTIVITY has Umasks for "CYCLES_L2_PENDING" (0x01) and "CYCLES_NO_DISPATCH" (0x04).  Again, the documentation in Vol 3 of the SW developer's guide is not adequate to understand how to program this unit, but fortunately Intel's VTune provides an example.   The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3.   (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)   

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier).   The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores.  This pattern makes it clear that this event counts store misses (RFO's) in the "L2_PENDING" category, but it  leaves a "hole" in the memory stall cycle identification in the case where the memory stalls are due to streaming stores. 

  • For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers.   So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
  • For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles.  Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% - 59% of the total cycles,  so I don't yet have a set of events that can identify that all of the stall cycles are actually memory stalls.   (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons.  I only add the two here to show that they are both much too small to account for all of the stall cycles.)

 

View solution in original post

0 Kudos
7 Replies
McCalpinJohn
Honored Contributor III
1,488 Views

I have been looking at this as well, and when you get into the details it becomes apparent that precisely defining "stall" is more difficult than one might initially expect.

The forum thread at https://software.intel.com/en-us/forums/topic/506515 discusses some of these issues, particularly relating to the differences between "stalls" at issue, execution, and retirement.  The main lesson from that thread is that some parts of the processor can show lots of stalls even if there is at least one execution unit busy every cycle.

If one chooses to define "stall" as a cycle in which no execution units dispatch uops, then the performance counter event of interest appears to be Event 0xA3, Umask 0x04 CYCLES_ACTIVITY.CYCLES_NO_DISPATCH.    This has additional Umasks that can be used to count cycles in which there are L1D demand misses pending (Umask 0x02) and L2 demand misses pending (Umask 0x01).  Although you cannot always combine Umasks, Intel's VTune defines both:

  • CYCLE_ACTIVITY.STALL_CYCLES_L1D_PENDING, Event 0xA3, Umask 0x06 (PMC2 only)
  • CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING, Event 0xA3, Umask 0x05

The last time I checked, these were not completely understood in the user community -- see the forum discussion at https://software.intel.com/en-us/forums/topic/501512

Even if this event was known to work correctly, there are still complexities and subtleties with this definition:

  1. This definition does not count the cases in which *fewer* uops are dispatched than would have been dispatched in the absence of cache misses.
  2. Not all uops that are dispatched constitute useful work.  This includes both traditional speculatively executed instructions and instructions that are "retried" because their operands are not in the cache when they try to execute.  (See the discussion of floating-point instruction overcounting in the first forum thread referenced above.)  

I will be doing testing with CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING this week to see if it gives numbers in the right ballpark for a couple of different cases with (presumably) known activity patterns. 

0 Kudos
McCalpinJohn
Honored Contributor III
1,489 Views

After doing some additional testing, I found events that count stalls at two different places:

  • Two events that count cycles in which uops are not sent from the RAT (Register Alias Table -- the register renaming unit) to the RS (Reservation Station -- queues uops until the instructions defining their source operands have been dispatched, then dispatches "ready" uops to the execution ports):
    • Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
      • Intel's VTune calls this UOPS_ISSUED.STALL_CYCLES
    • Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
      • Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
  • Two events that count cycles in which no uops are dispatched from the RS to any of the execution units (aka "ports").
    • Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4:  0x044304a3
      • I got the CMASK value from VTUNE -- the documentation in Vol 3 of the SW Developer's Guide is not very helpful.
    • Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE:   0x01c302b1
      • This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01.  This will only make a difference on a system with HyperThreading enabled, and I don't have any systems configured that way to test right now.
    • These two events differed by no more than a part per million in my tests.

As discussed in the forum thread at https://software.intel.com/en-us/forums/topic/506515, the first two events can easily overcount stalls in codes that have a "stall-free" IPC of less than 4.  For example, a code with a "stall-free" IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a "non-stall" cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache.   Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS.   (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher.   I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried. 

Note that one can also count cycles in which no instructions are retired.  This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue -- the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls.   Intel provides a very interesting performance counter event that provides some insight into this issue.   Event 0xA3 CYCLE_ACTIVITY has Umasks for "CYCLES_L2_PENDING" (0x01) and "CYCLES_NO_DISPATCH" (0x04).  Again, the documentation in Vol 3 of the SW developer's guide is not adequate to understand how to program this unit, but fortunately Intel's VTune provides an example.   The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3.   (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)   

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier).   The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores.  This pattern makes it clear that this event counts store misses (RFO's) in the "L2_PENDING" category, but it  leaves a "hole" in the memory stall cycle identification in the case where the memory stalls are due to streaming stores. 

  • For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers.   So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
  • For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles.  Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% - 59% of the total cycles,  so I don't yet have a set of events that can identify that all of the stall cycles are actually memory stalls.   (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons.  I only add the two here to show that they are both much too small to account for all of the stall cycles.)

 

0 Kudos
Patrick_F_Intel1
Employee
1,488 Views

Hello Liu,

Dr. McCalpin provides excellent info above.

More events you can use: UNC_ARB_TRK_OCCUPANCY.ALL/UNC_CLOCKTICKS which will tell you the average number of memory requests outstanding per uncore clocktick. This gives you an idea of how many requests are simultaneously outstanding.

Also UNC_ARB_TRK_OCCUPANCY.ALL/UNC_ARB_TRK_REQUESTS.ALL tells you average uncore clockticks a memory request is allocated in LLC. This is usually referred to as the LLC latency (Last Level Cache miss latency in uncore clockticks per LLC miss). It doesn't include the time to fetch the cache line from LLC to L1.

The effective latency is UNC_CLOCKTICKS / UNC_ARB_TRK_REQUESTS.ALL (in uncore clockticks / LCC miss)

You should see the LLC latency <= the latency reported by single-threaded, load-to-use, dependent load latency tests and you should see about 1 request outstanding/cycle during this test.

If you run a memory bandwidth test (with say, an 8 byte, sequential stride) then you should see the effective LLC drop (significantly) and you should see multiple requests outstanding per cycle. The LLC latency will probably be larger than the single threaded latency test due to lots of requests outstanding, but the effective latency will be much smaller.

This doesn't really tell you how stalled you are... you could always try reducing the size of you arrays until the work fits in the LLC. Then the difference in performance/unit_of_work (between the in-cache and out-of-cache case) is how stalled you are per unit_of_work.

Pat

0 Kudos
liu_x_
Beginner
1,488 Views

Thanks for you reply.  I found and tested  the event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING. However, i found an similar event  in , the documentation in Vol 3 of the SW developer's guide   Event A3H Umask 01H CYCLE_ACTIVITY.CYCLES_L2_PENDING Cmask=2,  which is larger than CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING in my test. I am no clear about the different between these two events.

0 Kudos
McCalpinJohn
Honored Contributor III
1,488 Views

The Intel documentation on Event 0xA3 is incomplete and confusing, but I did do some testing before I decided to use the Cmask value provided by Intel's VTune for my Sandy Bridge EP processors.   Smaller Cmask values produce larger counts, sometimes exceeding the actual cycle count.  This suggests that the event is incrementing multiple times per cycle (which of course is a prerequisite for the Cmask thresholding function to be useful).

The VTune event for CYCLE_ACTIVITY.CYCLES_NO_DISPATCH (0x044304a3) uses a Cmask of 4, while the VTune event for CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING (0x054305a3) uses a Cmask of 5, and the VTune event for CYCLE_ACTIVITY.STALL_CYCLES_L1D_PENDING (0x064305a3) uses a Cmask of 6.  

These specific values used for the Cmask field are a little strange -- it is hard to understand why an increment of 4 would correspond to the case for which uops are dispatched to zero ports (and not dispatched to 6 ports), but this Cmask value is required to get the counter to return the same values as the UOPS_DISPATCHED.STALL_CYCLES_CORE event, so it seems like the right answer.  

As I noted earlier, combining umasks (when allowed) usually generates a logical OR of the masks, while this event is using the combined umasks to create a logical AND function.  My hypothesis for how this event obtains logical AND functionality is that this event increments the counter by the sum of the umask values on each cycle for each of the umasks for which the corresponding condition is true.  I.e.,

  • with the "L2_PENDING" Umask bit set (0x01) the raw event count is increased by one whenever there is a pending L2 demand miss (load or RFO); 
  • with the "L1D_PENDING" Umask bit set (0x02) the raw event count is increased by two whenever there is a pending L1 Dcache demand miss; and
  • with the "CYCLES_NO_DISPATCH" Umask bit set (0x04), the raw event count is increased by four whenever there are no uop dispatches in a given cycle. 

This would account for all of the valid use cases from VTune:

  • CYCLE_ACTIVITY.CYCLES_L2_PENDING uses Cmask=1
  • CYCLE_ACTIVITY.CYCLES_L1D_PENDING uses Cmask=2
  • CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING uses Cmask=5 (4+1)
  • CYCLE_ACTIVITY.STALL_CYCLES_L1D_PENDING uses Cmask=6 (4+2)

If this interpretation is correct, then other Cmasks will not be useful.  Cmask=0 disables the thresholding feature, so the counter should increment by the sum of the "true" umasks in each cycle, which seems completely useless.   Cmask=3 should provide the same value as Cmask=1, since a demand L2 miss requires that there also be a demand L1 miss.   The same reasoning applies to Cmask=7, which should give the same value as Cmask=5.  (Of course even if my interpretation is correct there is no guarantee that the logic was implemented to provide the sum of more than two umasks, since none of the valid use cases allow more than two umask bits to be defined.)

0 Kudos
Xueqi_L_
Beginner
1,488 Views

I think your reply was so long ago, but it is really helpful for me. I'd like to add another question that where cab I find the documentation 'The Intel documentation on Event 0xA3 is incomplete and confusing, but I did do some testing before I decided to use the Cmask value provided by Intel's VTune for my Sandy Bridge EP processors.' you refer to,  is that located in the path: /opt/intel/vtune_amplifier_xe_2016.1.1.434111/target/linux64/config/sampling3 ? thank you very much ^_^

 

John McCalpin wrote:

The Intel documentation on Event 0xA3 is incomplete and confusing, but I did do some testing before I decided to use the Cmask value provided by Intel's VTune for my Sandy Bridge EP processors.   Smaller Cmask values produce larger counts, sometimes exceeding the actual cycle count.  This suggests that the event is incrementing multiple times per cycle (which of course is a prerequisite for the Cmask thresholding function to be useful).

The VTune event for CYCLE_ACTIVITY.CYCLES_NO_DISPATCH (0x044304a3) uses a Cmask of 4, while the VTune event for CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING (0x054305a3) uses a Cmask of 5, and the VTune event for CYCLE_ACTIVITY.STALL_CYCLES_L1D_PENDING (0x064305a3) uses a Cmask of 6.  

These specific values used for the Cmask field are a little strange -- it is hard to understand why an increment of 4 would correspond to the case for which uops are dispatched to zero ports (and not dispatched to 6 ports), but this Cmask value is required to get the counter to return the same values as the UOPS_DISPATCHED.STALL_CYCLES_CORE event, so it seems like the right answer.  

As I noted earlier, combining umasks (when allowed) usually generates a logical OR of the masks, while this event is using the combined umasks to create a logical AND function.  My hypothesis for how this event obtains logical AND functionality is that this event increments the counter by the sum of the umask values on each cycle for each of the umasks for which the corresponding condition is true.  I.e.,

  • with the "L2_PENDING" Umask bit set (0x01) the raw event count is increased by one whenever there is a pending L2 demand miss (load or RFO); 
  • with the "L1D_PENDING" Umask bit set (0x02) the raw event count is increased by two whenever there is a pending L1 Dcache demand miss; and
  • with the "CYCLES_NO_DISPATCH" Umask bit set (0x04), the raw event count is increased by four whenever there are no uop dispatches in a given cycle. 

This would account for all of the valid use cases from VTune:

  • CYCLE_ACTIVITY.CYCLES_L2_PENDING uses Cmask=1
  • CYCLE_ACTIVITY.CYCLES_L1D_PENDING uses Cmask=2
  • CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING uses Cmask=5 (4+1)
  • CYCLE_ACTIVITY.STALL_CYCLES_L1D_PENDING uses Cmask=6 (4+2)

If this interpretation is correct, then other Cmasks will not be useful.  Cmask=0 disables the thresholding feature, so the counter should increment by the sum of the "true" umasks in each cycle, which seems completely useless.   Cmask=3 should provide the same value as Cmask=1, since a demand L2 miss requires that there also be a demand L1 miss.   The same reasoning applies to Cmask=7, which should give the same value as Cmask=5.  (Of course even if my interpretation is correct there is no guarantee that the logic was implemented to provide the sum of more than two umasks, since none of the valid use cases allow more than two umask bits to be defined.)

0 Kudos
McCalpinJohn
Honored Contributor III
1,488 Views

Yes, that is where I found the VTune database files....

There is also useful information on the Intel hardware performance counters in the various model-specific sub-directories at https://download.01.org/perfmon/

 

0 Kudos
Reply