Embedded Server
Consolidate Considerations of Intel® Xeon and Atom server Hardware, Firmware, Software, and Tools
261 Discussions

We need clarification from Intel regarding MCE (Machine check exception) error occurred on following Intel Broad well DE CPU GG8067402569400SR2DK

VV0001
Beginner
4,364 Views

Below are the logs for MCE event occurred :

 

=============================================================================================

 

[ 2882.491953] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 19: be200000000c110a

 

[ 2882.595085] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8139831a> {intel_idle+0xda/0x160}

 

[ 2882.698427] mce: [Hardware Error]: TSC 5d6953ae81a ADDR fa000000 MISC a4fc389602402086

 

[ 2882.794587] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1559870123 SOCKET 0 APIC 1 microcode 7000005

 

[ 2882.906041] mce: [Hardware Error]: Run the above through 'mcelog --ascii'

 

[ 2882.987320] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 17: be200000000c110a

 

[ 2883.090448] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8139831a> {intel_idle+0xda/0x160}

 

[ 2883.193785] mce: [Hardware Error]: TSC 5d6953ae81a ADDR fa002000 MISC 4fc389603402086

 

[ 2883.288902] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1559870123 SOCKET 0 APIC 1 microcode 7000005

 

[ 2883.400356] mce: [Hardware Error]: Run the above through 'mcelog --ascii'

 

[ 2883.481635] mce: [Hardware Error]: Some CPUs didn't answer in synchronization

 

[ 2883.567072] mce: [Hardware Error]: Machine check: Invalid

 

[ 2883.631700] Kernel panic - not syncing: Fatal machine check on current CPU

 

=====================================================================================================

 

 

After decoding MCE log below is the message which shows Generic Cache level-2 Generic error and also Processor context corrupt for Bank 17 and Bank 19.

mcelog: Family 6 Model 56 CPU: only decoding architectural errors Hardware event. This is not a software error.

CPU 4 BANK 17 TSC 5d6953ae81a

RIP !INEXACT! 10:ffffffff8139831a

MISC 4fc389603402086 ADDR fa002000

 

TIME 1559870123 Fri Jun 7 03:15:23 2019

MCG status:RIPV MCIP

MCi status:

Uncorrected error

Error enabled

MCi_MISC register valid

MCi_ADDR register valid

Processor context corrupt

MCA: corrected filtering (some unreported errors in same region)

Generic CACHE Level-2 Generic Error

STATUS be200000000c110a MCGSTATUS 5

CPUID Vendor Intel Family 6 Model 86

 

RIP: intel_idle+0xda/0x160}

SOCKET 0 APIC 1 microcode 7000005

 

mcelog: Family 6 Model 56 CPU: only decoding architectural errors

Hardware event. This is not a software error.

CPU 4 BANK 19 TSC 5d6953ae81a

RIP !INEXACT! 10:ffffffff8139831a

MISC a4fc389602402086 ADDR fa000000

TIME 1559870123 Fri Jun 7 03:15:23 2019

MCG status:RIPV MCIP

MCi status:

Uncorrected error

Error enabled

MCi_MISC register valid

MCi_ADDR register valid

Processor context corrupt

MCA: corrected filtering (some unreported errors in same region)

Generic CACHE Level-2 Generic Error

STATUS be200000000c110a MCGSTATUS 5

CPUID Vendor Intel Family 6 Model 86

RIP: intel_idle+0xda/0x160}

SOCKET 0 APIC 1 microcode 7000005

 

Please provide the clarifications for below:

What does MCE error (kernel panic) mean?

Whether the MCE log decoding mechanism used by us are correct or not??

Whether above MCE log decodes to error: Generic Cache level-2 Generic error and also Processor context corrupt for Bank 17 and Bank 19??

Let us know what is the cause of MCE from the decoded MCE log. Whether is it a Hardware failure (CPU internal itself) or Software failure which handling some function??

What does it mean by Generic CACHE Level-2? Whether Cache memory Internal to the CPU?

 

Please let us know from the above decoded MCE log whether in future it will affect health of the board as node seems to be working fine now.

0 Kudos
7 Replies
CarlosAM_INTEL
Moderator
3,383 Views

Hello, @VV0001​:

 

Thank you for contacting Intel Embedded Community.

 

Could you please clarify if this thread is related to the following forum?

 

https://forums.intel.com/s/question/0D50P00004P3YTLSA3/we-seen-following-error-with-intel-xeon-microprocessor-gg8067402569400sr2dk-in-our-producterror-messagegeneric-cache-level2-generic-error-and-also-processor-context-corrupt-for-bank-17-and-bank-19

 

We are waiting for your answer.

 

Best regards,

@Mæcenas_INTEL​.

0 Kudos
VV0001
Beginner
3,383 Views

Yes, both were the same.

 

We need to know, whether this single occurrence of issue will leads to any functionality issues in future?

Or these errors can be ignored?

Please confirm.

0 Kudos
CarlosAM_INTEL
Moderator
3,383 Views

Hello, @VV0001​:

 

Thanks for your reply.

 

Could you please tell us the results of our last suggestion (message of the August 2nd, 2019) stated in the cited forum?

 

We are waiting for your answer.

 

Best regards,

@Mæcenas_INTEL​.

0 Kudos
VV0001
Beginner
3,383 Views

We have not replaced the CPU, it is a big decision and work. Please let us know is there any alternate solution.

0 Kudos
CarlosAM_INTEL
Moderator
3,383 Views

Hello, @VV0001​:

 

Thanks for your reply.

 

Reviewing the information provided in the cited forum, you have mentioned that just one unit is affected. could you please confirm this information and let us know how many units have been manufactured?

 

By the way, could you please review that the affected design has been properly soldered (NO cold joint, poor or non wetting, over heat, solder lack, leads floating, or too much solder)?

 

We are waiting for your reply.

 

Best regards,

@Mæcenas_INTEL​.

0 Kudos
VV0001
Beginner
3,383 Views

Around 800 units manufactured and deployed in field. The problem found in customer place after unit was sold, worked for more than 6 months.

Unit passed all in house production tests.

Hence we do not anticipate NO cold joint, poor or non wetting, over heat, solder lack, leads floating, or too much solder

0 Kudos
CarlosAM_INTEL
Moderator
3,383 Views

Hello, @VV0001​:

 

Thanks for your reply.

 

We suggest you contact the place of purchase of the affected processor to apply the process stated in section 7.2.5, on pages 58 and 59 of the Intel Quality System Handbook that can be found at:

 

https://www.intel.com/content/dam/www/public/us/en/documents/reference-guides/quality-system-handbook.pdf

 

Best regards,

@Mæcenas_INTEL​.

0 Kudos
Reply