Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.



Hi all, 

After  read the uncore Uncore Programming Guide we want to monitor common event in iMC -- PCT_REQUESTS_PAGE_EMPTY,and its equation is (ACT_COUNT - PRE_COUNT.PAGE_MISS)/ (CAS_COUNT.RD + CAS_COUNT.WR).

But when we get the data of ACT_COUNT and PRE_COUNT.PAGE_MISS,we found that the count of PRE_COUNT.PAGE_MISS is bigger than the data of ACT_COUNT which seems PCT_REQUESTS_PAGE_EMPTY unreasonable. How can it be. Will ACT_COUNT little than  PRE_COUNT.PAGE_MISS?

I have read the Uncore Programming Guide,but can`t clearly understand the meaning of ACT_COUNT and PRE_COUNT.PAGE_MISS. can anyone explain that for me thank you all.

0 Kudos
3 Replies

Can you provide more details on which uncore guide (URL), which chip (brand string, # sockets) you are using? The more I have to dig to figure what you are talking about, the less likely I will have time to research your question.


0 Kudos
Black Belt

ACT_COUNT is the easy one to understand here.   It is the total number of DRAM "ACTIVATE" commands sent on the channel.  This is often called a "page open" command, as it causes a "page" (or "row") of data (usually 1 KiB) to be transferred from the DRAM array to the sense amps of each DRAM chip.  Pages must be "opened" before reading or writing and "ACT" is the only way to do this.  Once a page is "open", the read and write CAS ("Column Address Strobe") commands copy cache-line blocks out of or into the sense amps.

Pages can be held "open" (i.e., with a row in the sense amps) for a long time, but as more time passes, the likelihood that a subsequent access will be to specific row that you are holding "open" decreases, and the likelihood that the a subsequent load will need the data from a different "row" in the same bank increases.  So Intel (like other vendors) has a "page close timer" that sends a "PRECHARGE" (i.e. page close) command to the DRAM after some amount of time that the open page is not accessed.  The specific policies used by Intel processors are not clearly documented, though there are some patents that suggest sophisticated dynamically adapting policies -- keeping pages "open" for longer when there are more "page hits" and closing pages more rapidly if there are more page "conflicts".

So pages can be "closed" ("precharged") because the page idle timer timed out (PRE_COUNT.PAGE_CLOSE) or because another memory reference needed to access a different row in the same bank (PRE_COUNT.PAGE_MISS).

Intel's terminology is very confusing here!  

Usually, the term "page miss" means a reference comes to the DRAM and the desired page (row) is not in the sense amps.
Usually, the term "page conflict" means that a reference comes to the DRAM and a different page (row) is in the sense amps.
Logically, "page conflict" is a subset of "page miss", but in common usage "page miss" generally refers to the case where there is no data in the sense amps, so that all references can be divided into the additive categories of "page hit", "page miss", and "page conflict".

Intel's terminology is a bit different.  In the Xeon E5-2600 Uncore Performance Monitoring Guide, the categories are:  "page hit", "page empty", "page miss".   The "page hit" category is the same (thank goodness!), while the term "page miss" is used to refer to what most of us call "page conflicts" -- i.e., the wrong data is in the sense amps.   Then "page empty" refers to the cases for which there is no data in the sense amps.

Note that ACT commands are only needed if the access is *not* a page hit, so the total number of ACT commands should be equal to the number of PRE_COUNT.PAGE_CLOSE events plus the number of PRE_COUNT.PAGE_MISS events.  Unfortunately, this is just an approximation, because there is another way to close pages -- DRAM_PRE_ALL -- a special command that closes all open pages.  Unfortunately, there is no way to tell from the outside how many pages were actually open when the PRECHARGE_ALL command was issued -- it could be anywhere from 0 pages per rank to 8 pages per rank, and there can be anywhere from 1 to 6 (maybe 8?) ranks on each DRAM channel.   Normally the PRECHARGE_ALL command is used immediately before a DRAM REFRESH period, but it could be used at other times.    It is also possible to close pages by using the "AUTO_PRECHARGE" bit in a read or write (CAS) command.  (This is encoded as bit 10 of the column address.)  Note that the description of the PRE_COUNT events specifically says that page closing due to auto-precharge events is *not* counted -- it is only counting explicit PRECHARGE commands.   (Most systems only use auto-precharge when operating in "closed-page" mode, but it is available at any time.)

So PRECHARGE_ALL and AUTO_PRECHARGE provide cases for which the PRE_COUNT.* events *undercount* the actual number of page closings.  But the report is for PRE_COUNT.PAGE_MISS to exceed ACT_COUNT, which is certainly strange.   I have not seen this on my systems -- the values I get for the PCT_REQUESTS_PAGE_HIT, PCT_REQUESTS_PAGE_EMPTY, and PCT_REQUESTS_PAGE_MISS are all quite reasonable.

This brings up the question of the tool being used to count the events.  My initial experiments were all done by directly reading the performance counter registers in PCI configuration space, so I could be sure that all events are being counted.   Recent versions of the Linux kernel support access to these counters using the "perf" subsystem.  This appears to work on my systems, but it is certainly possible that "perf" is not counting these events all the time.  For example, the standard mode of operation only counts while your process is running, with the (virtual) counter values saved when your process is descheduled and restored when your process is rescheduled.  It is possible that the count of PRE_COUNT.PAGE_MISS (page conflicts) is elevated by the activity of code that ran while your code was descheduled -- for example, page conflicts could be due to evictions of dirty data that was written by a completely different process, and the ACTIVATE and PRE_COUNT events could be separated in time by such mechanisms.   I use the "-a" and "-A" options to "perf stat" to try to avoid missing activity that might lead to confusing results.   If you are using VTune, similar issues might apply, but I don't have any experience with the uncore counters using VTune.

My standard "perf stat" command to get memory traffic uses the same set of four events on all four memory controller channels on each Xeon E5-2680 chip.  The events are CAS_COUNT.RD, CAS_COUNT.WR, ACT_COUNT, and PRE_COUNT.PAGE_MISS.  These allow computing the derived metrics: MEMORY_BW_READS, MEMORY_BW_WRITES, PCT_REQUESTS_PAGE_HIT, PCT_REQUESTS_PAGE_EMPTY, and PCT_REQUESTS_PAGE_MISS.

# Set some variables to hold the very long list of options for "perf stat"
SET1='-e "uncore_imc_0/event=0x04,umask=0x03/" -e "uncore_imc_1/event=0x04,umask=0x03/ -e "uncore_imc_2/event=0x04,umask=0x03/" -e "uncore_imc_3/event=0x04,umask=0x03/"'
SET2='-e "uncore_imc_0/event=0x04,umask=0x0c/" -e "uncore_imc_1/event=0x04,umask=0x0c/ -e "uncore_imc_2/event=0x04,umask=0x0c/" -e "uncore_imc_3/event=0x04,umask=0x0c/"'
SET3='-e "uncore_imc_0/event=0x01,umask=0x00/" -e "uncore_imc_1/event=0x01,umask=0x00/ -e "uncore_imc_2/event=0x01,umask=0x00/" -e "uncore_imc_3/event=0x01,umask=0x00/"'
SET4='-e "uncore_imc_0/event=0x02,umask=0x01/" -e "uncore_imc_1/event=0x02,umask=0x01/ -e "uncore_imc_2/event=0x02,umask=0x01/" -e "uncore_imc_3/event=0x02,umask=0x01/"'

perf stat -o perf.output -x , -a -A -C 0,9 $SET1 $SET2 $SET3 $SET4 a.out

When I run the STREAM benchmark under this harness on a dedicated machine, the reported results are quite reasonable -- especially when I set the benchmark repetition count (NTIMES) to a larger value (e.g., 100) to minimize the fraction of the traffic that is associated with the data initialization phase.  (It is hard to tell how many times the OS is going to touch memory when initializing a 4 KiB page, for example.)

0 Kudos

Interesting.I do a experiment on my SandyBridge platform, which have two Intel E5-2620 processors and twenty-four 4GB DIMMs.

I monitor the ACT_COUNT and PRE_COUNT.PAGE_MISS when do a memory read test using lmbench.

the test code is:

numactl -N 0 -m 1 ./bw_mem -P 6 -N 30 1024M rd

and the monitor result is:

Channel            ACT_COUNT        PRE_COUNT.PAGE_MISS
Socket 0 Channel 0         10338532          5531546
Socket 0 Channel 1         10349095          5561072
Socket 0 Channel 2         10424098          5638015
Socket 0 Channel 3         10476117          5668632
Socket 1 Channel 0           148862             1994
Socket 1 Channel 1            54473             3174
Socket 1 Channel 2            49050             1925
Socket 1 Channel 3            45774             1759

The ACT_COUNT is exactly much bigger than PRE_COUNT.PAGE_MISS.

So,Qian, your problem may be your PCM code, or your platform:)

0 Kudos