Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Novice
2,828 Views

R2308IP4LHPC - EDAC sbridge failed to register device (not an ECC memory error)

After installing fedora on my new server, I'm seeing the following problem/error log messages:

EDAC sbridge: Failed to register device with error -22.

EDAC sbridge: Couldn't find mci handler

As far as I can tell, everything is working fine, but I want to avoid any errors or missing functionality in the future. My searching so far has only found ECC memory errors (like https://forums.linuxmint.com/viewtopic.php?t=230579 here), but those are usually accompanied with other errors about ECC being disabled. I'm not sure which other log files might have information, or whether this issue even needs attention.

Does anyone know how to continue investigating this error?

Here's the output of grepping /var/log/messages for edac, sbridge, and mci:

Dec 22 13:29:03 hostname_removed kernel: ERST: Error Record Serialization Table (ERST) support is initialized.

Dec 22 13:29:03 hostname_removed kernel: pstore: using zlib compression

Dec 22 13:29:03 hostname_removed kernel: pstore: Registered erst as persistent store backend

Dec 22 13:29:03 hostname_removed kernel: ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports.

Dec 22 13:29:03 hostname_removed kernel: ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly.

Dec 22 13:29:03 hostname_removed kernel: ghes_edac: So, the end result of using this driver varies from vendor to vendor.

Dec 22 13:29:03 hostname_removed kernel: ghes_edac: If you find incorrect reports, please contact your hardware vendor

Dec 22 13:29:03 hostname_removed kernel: ghes_edac: to correct its BIOS.

Dec 22 13:29:03 hostname_removed kernel: ghes_edac: This system has 16 DIMM sockets.

Dec 22 13:29:03 hostname_removed kernel: EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)

Dec 22 13:29:03 hostname_removed kernel: EDAC MC1: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)

Dec 22 13:29:03 hostname_removed kernel: GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.

Dec 22 13:29:03 hostname_removed kernel: Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled

Dec 22 13:29:03 hostname_removed kernel: Non-volatile memory driver v1.3

--

Dec 22 13:29:08 hostname_removed kernel: RAPL PMU: hw unit of domain pp0-core 2^-16 Joules

Dec 22 13:29:08 hostname_removed kernel: RAPL PMU: hw unit of domain package 2^-16 Joules

Dec 22 13:29:08 hostname_removed kernel: RAPL PMU: hw unit of domain dram 2^-16 Joules

Dec 22 13:29:08 hostname_removed kernel: EDAC sbridge: Couldn't find mci handler

Dec 22 13:29:08 hostname_removed kernel: EDAC sbridge: Couldn't find mci handler

Dec 22 13:29:08 hostname_removed kernel: EDAC sbridge: Failed to register device with error -22.

Dec 22 13:29:08 hostname_removed kernel: intel_rapl: Found RAPL domain package

Dec 22 13:29:08 hostname_removed kernel: intel_rapl: Found RAPL domain core

Dec 22 13:29:08 hostname_removed kernel: intel_rapl: Found RAPL domain dram

--

Dec 23 08:24:58 hostname_removed kernel: ERST: Error Record Serialization Table (ERST) support is initialized.

Dec 23 08:24:58 hostname_removed kernel: pstore: using zlib compression

Dec 23 08:24:58 hostname_removed kernel: pstore: Registered erst as persistent store backend

Dec 23 08:24:58 hostname_removed kernel: ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports.

Dec 23 08:24:58 hostname_removed kernel: ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly.

Dec 23 08:24:58 hostname_removed kernel: ghes_edac: So, the end result of using this driver varies from vendor to vendor.

Dec 23 08:24:58 hostname_removed kernel: ghes_edac: If you find incorrect reports, please contact your hardware vendor

Dec 23 08:24:58 hostname_removed kernel: ghes_edac: to correct its BIOS.

Dec 23 08:24:58 hostname_removed kernel: ghes_edac: This system has 16 DIMM sockets.

Dec 23 08:24:58 hostname_removed kernel: EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)

Dec 23 08:24:58 hostname_removed kernel: EDAC MC1: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)

Dec 23 08:24:58 hostname_removed kernel: GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.

Dec 23 08:24:58 hostname_removed kernel: Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled

Dec 23 08:24:58 hostname_removed kernel: Non-volatile memory driver v1.3

<span style...

Tags (1)
10 Replies
Highlighted
Community Manager
247 Views

Hello Joseph,

 

Thank you for contacting Intel® Technical Support.

 

Be aware that this server system is Out of Warranty. At this point the best way to go is by running the Intel® System Information Retrieval Utility. These logs are from the BMC which monitors the hardware, so we can take a close look to see if the issue is hardware related. You can download the Utility from https://downloadcenter.intel.com/download/26991/System-Information-Retrieval-Utility-SysInfo- here. Instructions on how to run the tool you can get them from https://downloadmirror.intel.com/26991/eng/Intel_Sysinfo_UserGuide_V1.02.pdf here.

 

Please try this and let me know your results.

 

Best regards,

 

Jeremiah A.

 

Intel® Technical Support

 

Highlighted
Novice
247 Views

Thank you for the response! We're doing a clean install of a windows server distro for other reasons, so I'll try that utility if the problem is evident after the migration.

Highlighted
Community Manager
247 Views

Hello Joseph,

 

 

Please let me know your results.

 

 

regards,

 

 

Jeremiah A.

 

Intel(R) Technical Support

 

0 Kudos
Highlighted
Community Manager
247 Views

Hi Joseph,

 

 

I investigated more in this issue and I found the following:

 

 

1. This error is not really a error at all, it is just a warning letting you know the ECC is not enabled in BIOS.

 

2. In order for the warning to go away you need to enable Error-correcting code memory (ECC) in your BIOS.

 

3. It is a option for your RAM.

 

 

Please try the above instructions and let me know your results.

 

 

regards,

 

 

Jeremiah A.

 

Intel(R) Technical Support
0 Kudos
Highlighted
Novice
247 Views

Hello Jeremiah,

Thank you for looking into this issue!

The Memory Configuration page of the Advanced BIOS settings does not have an option to enable/disable ECC. The page shows the following:

Memory Configuration

 

Total Memory 64 GB (grayed out)

 

Effective Memory 65536 MB (grayed out)

 

Current Configuration Independent (grayed out)

 

Current Memory Speed DDR3-1066 (grayed out)

 

Memory Operating Speed Selection [Auto]

 

Phase Shedding [Enabled]

 

Memory SPD Override [Enabled]

 

Patrol Scrub [Enabled]

 

Demand Scrub [Enabled]

 

Correctable Error Threshold [10]

 

 

> Memory RAS and Performance Configuration

 

 

DIMM Information

 

DIMM_A1 4GB Installed80Operational

 

DIMM_A2 4GB Installed80Operational

 

DIMM_B1 4GB Installed80Operational

 

DIMM_B2 4GB Installed80Operational

 

DIMM_C1 4GB Installed80Operational

 

DIMM_C2 4GB Installed80Operational

 

DIMM_D1 4GB Installed80Operational

 

DIMM_D2 4GB Installed80Operational

 

DIMM_E1 4GB Installed80Operational

 

DIMM_E2 4GB Installed80Operational

 

DIMM_F1 4GB Installed80Operational

 

DIMM_F2 4GB Installed80Operational

 

DIMM_G1 4GB Installed80Operational

 

DIMM_G2 4GB Installed80Operational

 

DIMM_H1 4GB Installed80Operational

 

DIMM_H2 4GB Installed80Operational

Also, we switched OS to Ubuntu Server 16.04 LTS and the error issue does not appear in dmesg. So the problem seems to only exist in Fedora.

Should we look in other logs to see if the error message is in another location, or is this likely a non-issue?

0 Kudos
Highlighted
Novice
247 Views

Also, the reply formatting doesn't seem to work for a quote with courier font family. I apologize for the misaligned text.

0 Kudos
Highlighted
Community Manager
247 Views

Hello Joseph,

 

 

Thank you for your quick response. It looks like the issue is related to Unknown issue related the OS Fedora that your client is using.

 

As a last option, you can ask you client to run the Intel® System Information Retrieval Utility that mentioned in previous email to discard hardware issues.

 

 

Please let me know your results.

 

 

regards,

 

Jeremiah A.

 

Intel(R) Technical Support
0 Kudos
Highlighted
Community Manager
247 Views

Hello Joseph,

 

 

I hope you are doing well today.

 

 

I'm following up with you to see if you were able to obtain the logs from the tool I sent you?

 

 

Please let me know your results.

 

 

regards,

 

Jeremiah A.

 

Intel(R) Technical Support
0 Kudos
Highlighted
Novice
247 Views

Dear Jeremiah,

Thank you again for your suggestions. Unfortunately, this is a side-project for both of us engineers involved, and we needed to put it on the back-burner for other things at the moment. When we come back to it, likely in the summer, I will surely return here for the excellent support you and your colleagues provide!

Sincerely,

 

Joe
0 Kudos
Highlighted
Community Manager
247 Views

Hello Joe,

 

 

Ok, so I will proceed in closing this case. Once you are ready to continue you can go ahead and open a new case and refer to this one so we can continue assisting you, more than glad to do so.

 

 

I wish you the best in your projects and thank you for contacting Intel(R) Technical Support.

 

 

Best regards,

 

Jeremiah A.

 

Intel Technical Support
0 Kudos