There is a system event generated in my server which uses Intel® Server Board based on Intel® Xeon® processor E5-2600.
Event information is as given below:
EvM Revision : 04
Sensor Type : Memory
Event Type : Sensor-specific Discrete
Event Direction : Assertion Event
Event Data : a5ff07
Description : Correctable ECC logging limit reached
I have gone through the document given here:
https://www.intel.com/content/www/us/en/support/articles/000006888/server-products.html System Event Log Troubleshooting Guides for Intel® Server Boards
As per the "Correctable and uncorrectable ECC error sensor typical characteristics table" given in the above document, Event Data 3 would decode as
Event Data 3 [7:5] – Socket ID 0-3 = CPU1-4
[4:3] –Channel 0-3 = Channel A, B, C, D for CPU1 Channel E, F, G, H for CPU2 Channel J, K, L, M for CPU3 Channel N, P, R, T for CPU4
[2:0] DIMM 0-2 = DIMM 1-3 on Channel
In my case, Event Data 3 is 07 which would be 0000 0111 in binary. Could you please help me to understand the DIMM location based on the above data.
Thank you for contacting Intel Customer Support.
Unfortunately with the information provided, it wouldn't be possible to accurately locate the DIMM in question, however, it is possible to achieve with the https://www.intel.com/content/www/us/en/support/server-products/000023940.html System Information Retrieval Utility, it would also be very helpful if could provided the specific board model we're working with to have a better idea of the set up.
For further details and instruction please refer to: https://www.intel.com/content/www/us/en/support/articles/000024007/server-products.html How to do Basic Diagnostics when Having Correctable or Uncorrectable ECC Memory-Related Errors
Intel Customer Support
Thank you for the quick reply.
I just have the System Event log with me and I do not have the system now to run sysinfo as you have suggested.
As per the https://www.intel.com/content/www/us/en/support/articles/000006888/server-products.html System Event Log Troubleshooting Guides for Intel® Server Boards, it clearly mentions that "In both Correctable and Uncorrectable ECC errors, the error can be narrowed down to particular DIMM(s) and the table below shows DIMM identification.
However the document does not mention the case when Event Data 3 bits[2:0] are 111( ie a decimal value of 7). What would this refer to? Please help.
You're quite correct about what is stated in the document, however, it applies for the specific bits listed on the table which doesn't include value "7",
For next steps and troubleshooting you can refer to the table 73 of the same document, and the article mentioned before, according to the documentation available and previous cases reviewed, the Event Log (BMC generated) should show the specific DIMM in question, this log can be obtained trough sysinfo, another option is depending on your OS to run diagnosis commands to check the DIMM status.
Intel Customer Support
This goes as a follow up on the last communication, have you had the chance to review the information provided, is further assistance needed or is it OK to set this trend as closed.
I'll stay tuned to your comments.