Embedded Intel Atom® Processors
Technological Conversations about Intel Atom® Hardware, Software, Firmware, Graphics
1157 Discussions

How to interpret the generated mce dump files on the device when kernel panic happens?

CharlieChenEC
Beginner
955 Views

Hi Intel Experts,

 

On Edgecore's switch product 'as4630-54te', which uses Intel Atom Processor Cxx series.
A 'mce' error happens and the device froze until power off and power on the device.
3 'mce' dump files of pstore are found after 'mce' error happened.(Find the file in the attached file 'mce-erst.zip')
I would like to know the way to identify the mce error details from the 3 dump file.
The device installs Sonic.

Excerpted syslog messages related to mce dump files. The device froze on 6/21 and was powered off and on again on 6/26.

Jun 21 15:37:00.843656 2023 AS4630-54TE INFO snmp#snmp-subagent [sonic_ax_impl] INFO: vid = 2
Jun 26 12:51:00.064298 2023 AS4630-54TE INFO systemd-pstore[282]: PStore mce-erst-7247272388717445121 moved to /var/lib/systemd/pstore/mce-erst-7247272388717445121
Jun 26 12:51:00.064308 2023 AS4630-54TE INFO systemd-pstore[282]: PStore mce-erst-7247272388717445122 moved to /var/lib/systemd/pstore/mce-erst-7247272388717445122
Jun 26 12:51:00.064316 2023 AS4630-54TE INFO systemd-pstore[282]: PStore mce-erst-7247272388717445123 moved to /var/lib/systemd/pstore/mce-erst-7247272388717445123

 

Here are some output got from the device for your reference.

admin@sonic:~$ show platform summary
Platform: x86_64-accton_as4630_54te-r0
HwSKU: Accton-AS4630-54TE
ASIC: broadcom
ASIC Count: 1
Serial Number: 463054TE2102006
Model Number: F0PZZ4654043A
Hardware Revision: N/A

 

 

root@sonic:~# cat /proc/cpuinfo
processor : 0
vendor_id : Genuine
Intel cpu family : 6
model : 95
model name : Intel(R) Atom(TM) CPU C3558 @ 2.20GHz
stepping : 1
microcode : 0x32
cpu MHz : 1500.050
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts md_clear arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
bugs : spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 95
model name : Intel(R) Atom(TM) CPU C3558 @ 2.20GHz
stepping : 1
microcode : 0x32
cpu MHz : 1500.106
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 6
cpu cores : 4
apicid : 12
initial apicid : 12
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts md_clear arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
bugs : spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 95
model name : Intel(R) Atom(TM) CPU C3558 @ 2.20GHz
stepping : 1
microcode : 0x32
cpu MHz : 1499.978
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 8
cpu cores : 4
apicid : 16
initial apicid : 16
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts md_clear arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
bugs : spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 95
model name : Intel(R) Atom(TM) CPU C3558 @ 2.20GHz
stepping : 1
microcode : 0x32
cpu MHz : 1500.000
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 12
cpu cores : 4
apicid : 24
initial apicid : 24
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts md_clear arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
bugs : spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

 

root@sonic:~# uname -r
5.10.0-8-2-amd64

 

0 Kudos
9 Replies
Diego_INTEL
Moderator
929 Views

Hello @CharlieChenEC,

 

Thank you for contacting Intel Embedded Community.

 

I want to confirm, this error only happened once? And those files are from the same error?


Best regards,

@Diego_INTEL 

0 Kudos
CharlieChenEC
Beginner
895 Views

The error only happened once so far.

 

According to the syslog below, the files are moved to the directory '/var/lib/systemd/pstore' on 2023/6/26 after power off and power on the DUT again. Supposedly, the 3 mce dump files were generated on 2023/6/21 before the DUT froze.

Jun 21 15:37:00.843656 2023 AS4630-54TE INFO snmp#snmp-subagent [sonic_ax_impl] INFO: vid = 2
Jun 26 12:51:00.064298 2023 AS4630-54TE INFO systemd-pstore[282]: PStore mce-erst-7247272388717445121 moved to /var/lib/systemd/pstore/mce-erst-7247272388717445121
Jun 26 12:51:00.064308 2023 AS4630-54TE INFO systemd-pstore[282]: PStore mce-erst-7247272388717445122 moved to /var/lib/systemd/pstore/mce-erst-7247272388717445122
Jun 26 12:51:00.064316 2023 AS4630-54TE INFO systemd-pstore[282]: PStore mce-erst-7247272388717445123 moved to /var/lib/systemd/pstore/mce-erst-7247272388717445123

 

0 Kudos
Diego_INTEL
Moderator
884 Views

Hello @CharlieChenEC,

 

You can try installing mcelog, this should help to read the mce files that were generated by PStore.

https://mcelog.org/installation.html


Best regards,

@Diego_INTEL 

0 Kudos
CharlieChenEC
Beginner
857 Views

Hello @Digeo_INTEL,

 

I've downloaded the source code of mcelog from https://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git/snapshot/mcelog-194.tar.gz

and build it successfully.

 

I upload the mcelog binary and 3 mce dump files to the hardware device AS4630-54-TE

I expect to see some register information after executing mcelog with mce dump files.

However, I get nothing when I execute the commands shown below on the device.

 

Could you please tell me what's to do next to analyze the dump file using mcelog?

 

Please find the commands I executed below.

root@sonic:/home/admin/mce_dump# ls -l
total 524
-rw-r--r-- 1 admin admin 128 Jul 19 05:44 mce-erst-7247272388717445121
-rw-r--r-- 1 admin admin 128 Jul 19 05:44 mce-erst-7247272388717445122
-rw-r--r-- 1 admin admin 128 Jul 19 05:44 mce-erst-7247272388717445123
root@sonic:/home/admin/mce_dump# ./mcelog < mce-erst-7247272388717445122
root@sonic:/home/admin/mce_dump# ./mcelog < mce-erst-7247272388717445123
root@sonic:/home/admin/mce_dump# ./mcelog < mce-erst-7247272388717445123

root@sonic:/home/admin/mce_dump# ./mcelog --ascii < mce-erst-7247272388717445121
▒▒G>

root@sonic:/home/admin/mce_dump# ./mcelog --ascii < mce-erst-7247272388717445122
u

root@sonic:/home/admin/mce_dump# ./mcelog --ascii < mce-erst-724727238871744513

root@sonic:/home/admin/mce_dump# ./mcelog --raw < mce-erst-7247272388717445122
root@sonic:/home/admin/mce_dump# ./mcelog --raw < mce-erst-7247272388717445123
root@sonic:/home/admin/mce_dump# ./mcelog --raw < mce-erst-7247272388717445123

 

Here is the dump of the 3 mce dump files

root@sonic:/home/admin/mce_dump# od -t x1 mce-erst-7247272388717445121
0000000 03 00 00 20 00 00 00 90 00 00 00 00 00 00 00 00
0000020 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
0000040 00 00 00 00 00 00 00 00 0a c0 9a ec 14 47 3e 00
0000060 d7 7b 93 64 00 00 00 00 00 00 07 00 f1 06 05 00
0000100 00 00 01 00 01 00 00 00 00 00 00 00 0c 00 00 00
0000120 09 0c 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000160 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000200
root@sonic:/home/admin/mce_dump# od -t x1 mce-erst-7247272388717445122
0000000 75 01 11 00 00 00 00 b6 00 00 00 00 00 00 00 00
0000020 00 95 9d 0f 01 00 00 00 04 00 00 00 00 00 00 00
0000040 00 00 00 00 00 00 00 00 18 b0 9a ec 14 47 3e 00
0000060 d7 7b 93 64 00 00 00 00 00 00 07 00 f1 06 05 00
0000100 00 03 02 00 02 00 00 00 00 00 00 00 10 00 00 00
0000120 09 0c 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000160 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000200
root@sonic:/home/admin/mce_dump# od -t x1 mce-erst-7247272388717445123
0000000 03 00 00 20 00 00 00 90 00 00 00 00 00 00 00 00
0000020 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
0000040 00 00 00 00 00 00 00 00 92 df ec ed 14 47 3e 00
0000060 d7 7b 93 64 00 00 00 00 00 00 07 00 f1 06 05 00
0000100 00 00 00 00 00 00 00 00 00 00 00 00 04 00 00 00
0000120 09 0c 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000160 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000200

 

 

0 Kudos
Diego_INTEL
Moderator
839 Views

Hello @CharlieChenEC,

 

It seems it worked, we got some registers, they are given in Hexadecimal and we can convert them to Binary, there we will have rows of 64 bits that we can check, but I'm not sure yet if they are directly associated to MC0, MC1, etc.

 

I will try to search for references to understand better those MCE obtained.


Best regards,

@Diego_INTEL 

0 Kudos
CharlieChenEC
Beginner
770 Views

Hi @Diego_INTEL 

 

Any update for the way to understand the dump file?

0 Kudos
Diego_INTEL
Moderator
758 Views

Hello @CharlieChenEC,

 

Not much to add, my apologies.

 

All the mce dumps I found mentioned the bank of the error and the error code. 

I was looking at this document to try to understand the dump file with Table 17-15 Error Record Serialization Table (ERST).

https://uefi.org/sites/default/files/resources/ACPI_4_Errata_A.pdf

 

And some old repositories, not sure if you want to check.

https://github.com/intel/mce-test/tree/master

 

Also you may check the configuration of Pstore for any future error and the use of dmesg.

https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/

 

Best regards,

@Diego_INTEL 

0 Kudos
CharlieChenEC
Beginner
715 Views

Hi @Diego_INTEL

 

Thanks for your information.

 

I've checked the part related to ERST on ACPI_4_Errata_A.pdf and also check the appendix N of the UEFI 2.1 specification mentioned in ACPI_4_Errata_A.pdf.

 

After checking these document, and the linux kernel source code regarding the handling of ERST(https://elixir.bootlin.com/linux/v5.10.46/source/drivers/acpi/apei/erst.c#L1047), I expect that the signature CPER_SIG_RECORD(i.e. "CPER", https://elixir.bootlin.com/linux/v5.10.46/source/include/linux/cper.h#L16 ) should appear in the mce dump file but actually the signature does not exist.

 

Here is the excerpted dmesg content after the DUT boots up, it seems that the handling for ERST in erst.c does work.

[ 0.006336] ACPI: ERST 0x000000007E3686F8 000230 (v01 INTEL VND 00000001 INTL 00000001)
[ 0.006376] ACPI: Reserving ERST table memory at [mem 0x7e3686f8-0x7e368927]
[ 1.147066] ERST: Error Record Serialization Table (ERST) support is initialized.

 

Do you have any idea why the dump file 'mce-erst-7247272388717445121' does not contain the expected signature?

 

I see the information from the page https://wiki.archlinux.org/title/Machine-check_exception says that 'rasdaemon' is the package to replace what has been done by 'mcelog'. Actually, 'rasdaemon' is running on the DUT and not 'mcelog'. I'm not sure is it possible to run those utilities at the same time? Do you have any suggestion on using 'rasdaemon' or 'mcelog' in favor of finding out the underlying mce issues?

 

Thanks and best regards,

 

Chalrie Chen

 

0 Kudos
Diego_INTEL
Moderator
693 Views

Hello @CharlieChenEC,

 

That's an interesting find that you got. In the files that I found looking for an MCE file to compare with yours, they were like heavier in size, I remember one of 18 Kb, it contained a lot of information, the files I found were more than 10 Kb at least, the ones you got are 1 Kb each, and that's why I thought that there may be some limit and that you may check the configuration of PStore for a future crash, also, I suspect that this can be related to why we don't have that signature from these files, but I'm not sure if it is the case were mcelog shows only system registers when there is not much details of the computer to make the conversion.

 

This rasdaemon looks interesting, I looked internally and it seems that is a recommended tool, you may try it with confidence.

 

Also, have you checked that you are using the latest microcode? Sometimes, some MCE can be solved updating the microcode.

 

Best regards,

@Diego_INTEL 

0 Kudos
Reply