Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
1628 Discussions

Mcelog in linux doesnt show the exact DIMM location for ECC/UECC error

Kishore__Nanda
Beginner
716 Views

Hi,

On an Intel SKL platform, dual socket, 24 x 64GB, DDR4 2666 MHz (1.5TB in total) , we were running some memory related workload and seeing lot of DIMM ECC errors. 

OS: RHEL 7.5 

Apparently after lot of ECC, the DIMM encounters an UECC.

After decoding the mcelog, the location of the DIMM doesnt show up correctly.

Please see the o/p from MCElog.

CPU 26 BANK 8 
MISC 200000c020001086 ADDR 1754d88ef40 
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR
Transaction: Memory read error
M2M: MscodDataRdErr
STATUS dc0000c001010090 MCGSTATUS 0
MCGCAP f000c14 APICID 40 SOCKETID 1 
PPIN 1fc0448d77e07d88
CPUID Vendor Intel Family 6 Model 85
Fallback Socket memory error count 4 exceeded threshold: 26 in 24h
Location SOCKET:1 CHANNEL:? DIMM:? []

My question is, how or from where does mcedaemon get the channel and DIMM location? - is it ACPI ?

I decoded the MC_Status and figured out the IMC and channel info, but unable to decode the DIMM Rank (2 x 4R in case of 2DPC).

0 Kudos
0 Replies
Reply