- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello all:
I followed a post with the same title and I found Intel E7 Xeon CPU manual (datasheet actually). That URL address was listed below:
https://software.intel.com/en-us/forums/topic/393904
Through that manual, there is a FSV event related to ECC memory status. But what I need is the
related registers when it comes to Intel E5 Xeon 2650. Anyone please help. I will appreciate.
Joe
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Background: According to the big study by Google (http://research.google.com/pubs/pub35162.html), only about 8% of DIMMs experience one or more errors per calendar year, but the DIMMs that have errors sometimes have them at fairly high rates. The average was about 4000 per year (but with a very skewed distribution) -- so if you don't see any errors in a few days of operation, your DIMMs are probably OK.
Procedure: Under Linux it is relatively easy to set up one of the Uncore iMC performance counters to count ECC_CORRECTABLE_ERRORS. Since this event increments extremely infrequently on most systems you won't need to worry about the counter overflowing --- just set it up and then check it every week or so (provided that you are not using the iMC performance counters for anything else).
As an example for the Xeon E5-2650 the following code will set up iMC Counter 3 on each of the four channels on each socket to count correctable ECC errors. First you have to figure out which buses your system uses for the Uncore performance counters. My systems use either 3f and 7f (for socket 0 and socket 1, respectively) or 7f and ff (for socket 0 and socket 1, respectively). The easiest way to find this is to run this simple command:
# lspci | grep :10.1 7f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07) ff:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)
The text description will vary from system to system, but the first characters of each line show the buses that correspond to the two sockets. The examples below will use bus 7f for socket 0 and bus ff for socket 1.
First program the counters using this script. (The EVENTCODE that is commented out can be used for testing the script, since it will increment on a correctly working system, while the ECC_CORRECTABLE_ERRORS will not increment unless there is really a problem.)
#!/bin/bash # Program Counter 3 of each iMC on each chip to count ECC correctable errors export EVENTCODE=0x00400009 # ECC_CORRECTABLE_ERRORS #export EVENTCODE=0x00400006 # DRAM_PRE_ALL -- use as a test to make sure the counters are actually counting export SETPCI=/sbin/setpci # Step 1: disable each counter, then clear the count (lower & upper words) echo "Disabling and clearing iMC Counter 3 of each channel on each processor" for BUS in 7f ff do for CHANNEL in 0 1 4 5 do $SETPCI -s ${BUS}:10.${CHANNEL} e4.l=0x00 $SETPCI -s ${BUS}:10.${CHANNEL} b8.l=0x00 $SETPCI -s ${BUS}:10.${CHANNEL} bc.l=0x00 done done # Step 2: enable the counter with the new event echo "Programming iMC Counter 3 in each channel on each processor" for BUS in 7f ff do for CHANNEL in 0 1 4 5 do $SETPCI -s ${BUS}:10.${CHANNEL} e4.l=$EVENTCODE done done
Next you can read the counters with this script:
#!/bin/bash export SETPCI=/sbin/setpci echo "Reading iMC Counter 3 of each channel on each processor" for BUS in 7f ff do for CHANNEL in 0 1 4 5 do echo -n "Bus $BUS Channel $CHANNEL Counter 3 low: " $SETPCI -s ${BUS}:10.${CHANNEL} b8.l echo -n "Bus $BUS Channel $CHANNEL Counter 3 high: " $SETPCI -s ${BUS}:10.${CHANNEL} bc.l done done
I tested this with the DRAM_PRE_ALL event on a Xeon E5-2680 system and the script appears to have set things up correctly. It is running now with the ECC_CORRECTABLE_ERRORS event, but (no surprise) has not shown any events yet.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page