Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Montor ECC memory status?

Joe_H_1
Beginner
295 Views

Hello all:

    I followed a post with the same title and I found Intel E7 Xeon CPU manual (datasheet actually). That URL address was listed below:

https://software.intel.com/en-us/forums/topic/393904

Through that manual, there is a FSV event related to ECC memory status.  But what I need is the

related registers when it comes to Intel E5 Xeon 2650. Anyone please help. I will appreciate.

                             Joe

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
295 Views

Background: According to the big study by Google (http://research.google.com/pubs/pub35162.html), only about 8% of DIMMs experience one or more errors per calendar year, but the DIMMs that have errors sometimes have them at fairly high rates.  The average was about 4000 per year (but with a very skewed distribution) -- so if you don't see any errors in a few days of operation, your DIMMs are probably OK.

Procedure: Under Linux it is relatively easy to set up one of the Uncore iMC performance counters to count ECC_CORRECTABLE_ERRORS.    Since this event increments extremely infrequently on most systems you won't need to worry about the counter overflowing --- just set it up and then check it every week or so (provided that you are not using the iMC performance counters for anything else).  

As an example for the Xeon E5-2650 the following code will set up iMC Counter 3 on each of the four channels on each socket to count correctable ECC errors.   First you have to figure out which buses your system uses for the Uncore performance counters.  My systems use either 3f and 7f (for socket 0 and socket 1, respectively) or 7f and ff (for socket 0 and socket 1, respectively).   The easiest way to find this is to run this simple command:

# lspci | grep :10.1
7f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)
ff:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)

The text description will vary from system to system, but the first characters of each line show the buses that correspond to the two sockets.  The examples below will use bus 7f for socket 0 and bus ff for socket 1.

First program the counters using this script.  (The EVENTCODE that is commented out can be used for testing the script, since it will increment on a correctly working system, while the ECC_CORRECTABLE_ERRORS will not increment unless there is really a problem.)

#!/bin/bash

# Program Counter 3 of each iMC on each chip to count ECC correctable errors

export EVENTCODE=0x00400009		# ECC_CORRECTABLE_ERRORS
#export EVENTCODE=0x00400006		# DRAM_PRE_ALL -- use as a test to make sure the counters are actually counting

export SETPCI=/sbin/setpci

# Step 1: disable each counter, then clear the count (lower & upper words)
echo "Disabling and clearing iMC Counter 3 of each channel on each processor"
for BUS in 7f ff
do
	for CHANNEL in 0 1 4 5
	do
		$SETPCI -s ${BUS}:10.${CHANNEL} e4.l=0x00
		$SETPCI -s ${BUS}:10.${CHANNEL} b8.l=0x00
		$SETPCI -s ${BUS}:10.${CHANNEL} bc.l=0x00
	done
done
# Step 2: enable the counter with the new event
echo "Programming iMC Counter 3 in each channel on each processor"
for BUS in 7f ff
do
	for CHANNEL in 0 1 4 5
	do
		$SETPCI -s ${BUS}:10.${CHANNEL} e4.l=$EVENTCODE
	done
done

Next you can read the counters with this script:

#!/bin/bash

export SETPCI=/sbin/setpci

echo "Reading iMC Counter 3 of each channel on each processor"
for BUS in 7f ff
do
	for CHANNEL in 0 1 4 5
	do
		echo -n "Bus $BUS Channel $CHANNEL Counter 3 low:  "
		$SETPCI -s ${BUS}:10.${CHANNEL} b8.l
		echo -n "Bus $BUS Channel $CHANNEL Counter 3 high: "
		$SETPCI -s ${BUS}:10.${CHANNEL} bc.l
	done
done

I tested this with the DRAM_PRE_ALL event on a Xeon E5-2680 system and the script appears to have set things up correctly.  It is running now with the ECC_CORRECTABLE_ERRORS event, but (no surprise) has not shown any events yet.

0 Kudos
Reply