Résolu : Xeon 8270 & Xeon 4216 based servers reset by BMC watchdog due to MCE

OttoChow · ‎06-13-2025

I just got a report from a CSP customer that 4 CLX based systems (Xeon 8270 and Xeon 4216) were rebooted by the BMC watch dog with the same trace in a month (their server base is > 50K units). According to the customer, MCE routines are run regularly. Fault got trigger in a regular MCE run.

The reason why BMC watchdog reset the system (an event is logged in BMC Server Log), it takes too long to do the kernel dump (no idea why taking so long) and the recorded kernel dump file is incomplete.

The trace is as followed:

=====================

[5225781.544154] [ C60] BUG: kernel NULL pointer dereference, address: 0000000000000065

[5225781.552135] [ C60] #PF: supervisor write access in kernel mode

[5225781.558324] [ C60] #PF: error_code(0x0002) - not-present page

[5225781.564442] [ C60] PGD 166229067 P4D 166229067 PUD 16622a067 PMD 0

[5225781.571052] [ C60] Oops: 0002 [#1] SMP NOPTI

[5225781.575647] [ C60] CPU: 60 PID: 0 Comm: swapper/60 Kdump: loaded Tainted: G S OE 5.15.0-162.011-XXXXX #162~24.04

[5225781.587169] [ C60] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.21.2 02/19/2024

[5225781.595753] [ C60] RIP: 0010:mce_setup+0x117/0x1c0

[5225781.600904] [ C60] Code: 84 04 f4 00 00 00 89 43 4c 0f 32 48 c1 e2 20 48 09 c2 65 48 8b 05 11 8c 5e 78 48 0f ba e0 2e 48 89 53 50 73 51 b9 4f 00 00 00 <0f> 32 48 c1 e2 20 48 09 c2 48 89 53 68 8b 05 5e 94 5f 01 89 43 70

[5225781.621453] [ C60] RSP: 0018:ffffa36e19d64d60 EFLAGS: 00010207

[5225781.627371] [ C60] RAX: ce6f60da00000121 RBX: ffffa36e19d64db0 RCX: 00000000000000ea

[5225781.635156] [ C60] RDX: 000000000f000814 RSI: 0000000000000000 RDI: 0000000000000065

[5225781.642898] [ C60] RBP: ffffa36e19d64d98 R08: 0000000000000000 R09: 0000000000000000

[5225781.650697] [ C60] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000019460

[5225781.658939] [ C60] R13: 000000000000003c R14: 0000000000019460 R15: 0000000000000002

[5225781.666697] [ C60] FS: 0000000000000000(0000) GS:ffff9172bfb80000(0000) knlGS:0000000000000000

[5225781.675474] [ C60] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[5225781.681862] [ C60] CR2: 0000000000000065 CR3: 00000001f84a2003 CR4: 00000000007706e0

[5225781.689579] [ C60] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[5225781.697305] [ C60] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[5225781.705042] [ C60] PKRU: 55555554

[5225781.708347] [ C60] Call Trace:

[5225781.711346] [ C60] <IRQ>

[5225781.713897] [ C60] ? show_trace_log_lvl+0x1de/0x30f

[5225781.718786] [ C60] ? show_trace_log_lvl+0x1de/0x30f

[5225781.724107] [ C60] ? machine_check_poll+0x64/0x2b0

[5225781.729714] [ C60] ? show_regs.part.0+0x23/0x31

[5225781.734296] [ C60] ? __die_body.cold+0x8/0xd

[5225781.738586] [ C60] ? __die+0x2a/0x3b

[5225781.742159] [ C60] ? page_fault_oops+0x174/0x1b0

[5225781.746834] [ C60] ? do_user_addr_fault+0x34a/0x650

[5225781.751701] [ C60] ? exc_page_fault+0x81/0x190

[5225781.756139] [ C60] ? asm_exc_page_fault+0x27/0x30

[5225781.760938] [ C60] ? mce_setup+0x117/0x1c0

[5225781.765005] [ C60] ? mce_cpu_restart+0xf0/0xf0

[5225781.769418] [ C60] machine_check_poll+0x64/0x2b0

[5225781.774000] [ C60] ? mce_cpu_restart+0xf0/0xf0

[5225781.778415] [ C60] mce_timer_fn+0xaa/0xf0

[5225781.782389] [ C60] call_timer_fn+0x2c/0x130

[5225781.786521] [ C60] __run_timers+0x23f/0x2c0

[5225781.790629] [ C60] ? tick_sched_handle+0x33/0x70

[5225781.795165] [ C60] run_timer_softirq+0x1d/0x40

[5225781.799671] [ C60] handle_softirqs+0xe0/0x300

[5225781.803961] [ C60] irq_exit_rcu+0x9e/0xd0

[5225781.807867] [ C60] sysvec_apic_timer_interrupt+0x92/0xd0

[5225781.813067] [ C60] </IRQ>

[5225781.815584] [ C60] <TASK>

[5225781.818099] [ C60] asm_sysvec_apic_timer_interrupt+0x1b/0x20

[5225781.823653] [ C60] RIP: 0010:cpuidle_enter_state+0xd9/0x630

[5225781.829074] [ C60] Code: 3d dc 90 91 78 e8 b7 28 62 ff 49 89 c7 0f 1f 44 00 00 31 ff e8 08 37 62 ff 80 7d d0 00 0f 85 72 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 7e 01 00 00 4d 63 ee 49 83 fd 0a 0f 83 f4 03 00 00

[5225781.848690] [ C60] RSP: 0018:ffffa36e18fcfe28 EFLAGS: 00000246

[5225781.854361] [ C60] RAX: 0000000000000000 RBX: ffffc2ebffb879a0 RCX: 0000000000000000

[5225781.861927] [ C60] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000

[5225781.869486] [ C60] RBP: ffffa36e18fcfe78 R08: 0000000000000000 R09: 0000000000000000

[5225781.877031] [ C60] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff88ed7fa0

[5225781.884551] [ C60] R13: 0000000000000001 R14: 0000000000000001 R15: 001290d2149bfc14

[5225781.892115] [ C60] ? cpuidle_enter_state+0xc8/0x630

[5225781.897079] [ C60] ? tick_nohz_stop_tick+0x13a/0x200

[5225781.902071] [ C60] cpuidle_enter+0x2e/0x50

[5225781.906038] [ C60] cpuidle_idle_call+0x154/0x1f0

[5225781.910517] [ C60] do_idle+0x83/0x100

[5225781.914181] [ C60] cpu_startup_entry+0x1d/0x20

[5225781.918507] [ C60] start_secondary+0x12a/0x190

[5225781.922910] [ C60] secondary_startup_64_no_verify+0xc2/0xcb

[5225781.928447] [ C60] </TASK>

[5225781.931157] [ C60] Modules linked in: XXXXX_linux_hotfix_fsync_thread(OE) xfrm_user xfrm_algo XXXXX_linux_hotfix_elastic_io_scheduler(OE) XXXXX_linux_hotfix_domain_dirty_limits(OE) tcp_diag inet_diag unix_diag br_netfilt

The software developer of the infrastructure trace the starting point of the fault to the kernel source as followed. The software developer believed that this is because the instruction indicated in RIP register is executed and point to the wrong page address. But he does not know why a wrong page address occur in this case.

There is no MCE events reported such as memory ECC correction.

The fault is traced to the kernel source file.

Have anyone come across a similar case in the past? Is it a processor issue?

Thank you.

Kind Regards,

Otto

Sazirah · ‎06-17-2025

Hi OttoChow,

Thank you for your reply.

As per checked on the dump log share, we found there are no CPU-related errors present. The issue appears to be related to the kernel behavior on a Dell system. As such, this falls outside the scope of what we can directly address.

The BMC is managed by the system board, which is manufactured by Dell. We recommend reaching out to Dell support for further assistance regarding the BMC-related concerns.

Hope this helps.

Regards,

Sazzy_Intel

Intel Customer Support Technician

Voir la solution dans l'envoi d'origine

Sazirah · ‎06-13-2025

Hi OttoChow,

Thank you for posting in Intel Community Forum.

Regarding this issue, kindly give us some time as we are currently checking it at our end. We will get back to you soon with any update.

Thank you.

Regards,

Sazzy_Intel

Intel Customer Support Technician

Sazirah · ‎06-17-2025

Hi OttoChow,

Greetings.

Thank you for patiently waiting for our update.

Upon checking this issue, we found that the issue is with the Kernel using a Dell system. We have checked the dump shared, and there is no error related to the CPU. The BMC also managed by the board, which is a Dell system. Therefore, we strongly advise you to contact Dell support regarding this issue with BMC.

As for the CPU, you may try to use Intel Data Center Diagnostic Tool instead of BMC, to check the health of the processor.

Hope this clarifies.

Regards,

Sazzy_Intel

Intel Customer Support Technician

OttoChow · ‎06-17-2025

Could share with me about the issue you mentioned on the kernel with Dell system? My customers have multiple incidents with the same trace and this lead to mce_setup() in Linux. If it is not ok to post the information in community.intel.com, please reach out to me at otto.chow@intel.com or Teams.

Thank you.

Regards,

Otto

Sazirah · ‎06-17-2025

Hi OttoChow,

Thank you for your reply.

As per checked on the dump log share, we found there are no CPU-related errors present. The issue appears to be related to the kernel behavior on a Dell system. As such, this falls outside the scope of what we can directly address.

The BMC is managed by the system board, which is manufactured by Dell. We recommend reaching out to Dell support for further assistance regarding the BMC-related concerns.

Hope this helps.

Regards,

Sazzy_Intel

Intel Customer Support Technician

Vik3 · ‎06-20-2025

Hello OttoChow,

We have reviewed the logs and it appears that the issue may be kernel-related on your Dell system. We did not find any CPU-related errors in the dump logs you provided. Since the BMC is managed by the system manufacturer (Dell), we recommend reaching out to Dell Support for further assistance with BMC-related concerns.

And for CPU health verification, please use the Intel Data Center Diagnostic Tool instead of BMC.

If you have a dedicated Field Application Engineer (FAE), we recommend raising an IPS case instead, as it would be more suitable for your request.

Regards,

Vikas

Intel Customer Support Technician

Azhari_Intel · ‎06-23-2025

Hi OttoChow,

Good day to you.

Just wanted to follow up with you, kindly let us know if you have any further concerns.

Looking forward to your response.

Regards,

Azhari_Intel

Ragulan_Intel · ‎06-29-2025

Hi OttoChow,

Good day to you.

We’re writing to inform you that your thread will be archived, as there are currently no pending actions required from Intel Customer Support.

Should you need any further assistance in the future, please don’t hesitate to reach out. We’re always here to help.

Best regards,

Ragulan_Intel