Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Montor ECC memory status?

ddbug1
Beginner
662 Views
Hello there, Does anyone know a tool for monitoring number of errors detected by ECC memory/controller? Thanks, -- dd
0 Kudos
10 Replies
Bernard
Valued Contributor I
662 Views
HP Integrated Lights-Out can report ECC memory errors. Link to HP whitepaper :http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02878598/c02878598.pdf
0 Kudos
ddbug1
Beginner
662 Views
Thank you. I wanted to get an ECC enabled machine to see how often DRAM errors occur in my environment. (Interesting: you cannot understand whether you need ECC, unless you already have it?) But, after reading the xeon-e5-2600-uncore-guide, this HP paper and MS WHEA docum, the whole ECC topic looks too intimidating. I'll surrender for now... - dd
0 Kudos
Roman_D_Intel
Employee
662 Views

Hi dd,

Please look at this manual for Intel Xeon E7 processors. FVC events can be configured to count memory ECC errors (see page 2-126 for example). They can also count corrected/uncorrected memory request responses.

Best regards,

Roman

0 Kudos
Bernard
Valued Contributor I
662 Views

Low level details of hardware and/or its programming interface are not an easy thing to grasp very quickly:)

0 Kudos
ddbug1
Beginner
662 Views

Thanks guys. I see your point, Ilya...  There's an anecdote about senior and junior toilet cleaners... ;)

My goal is to measure how often RAM errors occur on my machines and whether I want ECC.

But the DRAM controller of Xeons (and the ECC RAM itself of course) looks much more complex than on "normal" non-ECC mobos, there are more parts that may fail.  Do you think that measurement of RAM errors rate on ECC enabled machine can be extrapolated to a simpler non-ECC sandy/ivy bridge system?

Building the PCM to get the counters is not a problem.

Regards,

-- dd

0 Kudos
Bernard
Valued Contributor I
662 Views

Does PCM measure ECC errors?

0 Kudos
Patrick_F_Intel1
Employee
662 Views

Hello ddbug,

So... is ECC worth the extra money... that is a good question.

My first response is, how much does it matter whether you can catch memory errors?

If you are doing something where you don't mind rebooting then you probably don't need ECC memory.

For mission critical applications where you absolutely need to know whether there are memory issues (yes, DIMMs do go bad) then ECC is a requirement. This is why servers always have ECC support.

I think you can monitor ECC errors on windows in the system event log in the event viewer (eventvwr.msc).

Pat

0 Kudos
ddbug1
Beginner
662 Views
> Does PCM measure ECC errors? I have not checked this yet. Even if not, the docum explains how to get these counters. > So... is ECC worth the extra money... that is a good question. The ECC RAM modules cost not much more, it is a whole new machine of a higher class that is expensive... Finally we've got approval for a Dell server. The exact model and h/w details not known yet. thanks, -- dd
0 Kudos
Bernard
Valued Contributor I
662 Views

>>>I think you can monitor ECC errors on windows in the system event log in the event viewer (eventvwr.msc).>>>

This is implemented by WHEA architecture.

0 Kudos
SergeyKostrov
Valued Contributor II
662 Views
>>But, after reading the xeon-e5-2600-uncore-guide, this HP paper and MS WHEA docum, the whole ECC topic >>looks too intimidating. I'll surrender for now... In 2012 I saw some Intel equipment and I remember it allowed to simulate some memory errors for server platforms. Honestly, I didn't dare to ask how much it is...
0 Kudos
Reply