Software Archive
Read-only legacy content
17060 Discussions

Kernel oops with Intel Xeon Phi 5110P

Bharath_R_
Beginner
4,249 Views

One one of our cluster nodes with the Intel Xeon Phi 5110P steeping B01 we are noticing the node is in a oops reboot loop. If we remove the Phi cards the node boots fine. We are running CentOS 6.3 on these nodes. With the Phi cards installed we get the following oops every single time.

EDAC sbridge: Seeking for: dev 0d.6 PCI ID 8086:3cf5
EDAC MC0: Giving out device to 'sbridge_edac.c' 'Sandy Bridge
Socket#0': DEV 0000:7f:0e.0
EDAC MC1: Giving out device to 'sbridge_edac.c' 'Sandy Bridge
Socket#1': DEV 0000:ff:0e.0
EDAC sbridge: Driver loaded.
[ OK ]
vnet: mode: dma, buffers: 62
mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 1, fatal
[Hardware Error]: section: 0, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x4010, status: 0x0547
[Hardware Error]: device_id: 0000:00:03.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x03
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c08
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[Hardware Error]: aer_status: 0x00004000, aer_mask: 0x00100000
[Hardware Error]: Completion Timeout
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x0037f030
Kernel panic - not syncing: Fatal hardware error!
Pid: 1671, comm: work_for_cpu Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
<NMI> [<ffffffff814fd11a>] ? panic+0xa0/0x168
[<ffffffff8130012c>] ? ghes_notify_nmi+0x17c/0x180
[<ffffffff81503325>] ? notifier_call_chain+0x55/0x80
[<ffffffff8150338a>] ? atomic_notifier_call_chain+0x1a/0x20
[<ffffffff810980ae>] ? notify_die+0x2e/0x30
[<ffffffff81500fd1>] ? do_nmi+0x1a1/0x2b0
[<ffffffff815008b0>] ? nmi+0x20/0x30
[<ffffffffa0151c90>] ? mic_irq_isr+0x0/0x30 [mic]
[<ffffffff812a5cad>] ? msi_set_mask_bit+0x4d/0x90
<<EOE>> [<ffffffff812a5d00>] ? unmask_msi_irq+0x10/0x20
[<ffffffff810ddc09>] ? default_enable+0x29/0x40
[<ffffffff810ddbce>] ? default_startup+0x1e/0x30
[<ffffffff810dc89a>] ? __setup_irq+0x32a/0x3c0
[<ffffffff810dd024>] ? request_threaded_irq+0x154/0x2f0
[<ffffffffa0151c90>] ? mic_irq_isr+0x0/0x30 [mic]
[<ffffffffa015238d>] ? mic_probe+0x3fd/0x5d0 [mic]
[<ffffffff81060262>] ? default_wake_function+0x12/0x20
[<ffffffff8108cc00>] ? do_work_for_cpu+0x0/0x30
[<ffffffff81292037>] ? local_pci_probe+0x17/0x20
[<ffffffff8108cc18>] ? do_work_for_cpu+0x18/0x30
[<ffffffff81091d66>] ? kthread+0x96/0xa0
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Rebooting in 30 seconds..
ACPI MEMORY or I/O RESET_REG.

Any help in resolving this issue is appreciated.

0 Kudos
22 Replies
Bharath_R_
Beginner
900 Views

BELINDA L. (Intel) wrote:

Hi,

we want to isolate whether the power management on the card might be triggering  this.  Can you turn off power management and retry to see if this can be reproduced ? 

 

1) modify the mic0.conf and mic1.conf (in the /etc/sysconfig/mic/ directory ) with the following change

PowerManagement "cpufreq_on;corec6_off;pc3_off;pc6_off"

 

2) restart the mpss service

user_prompt> service mpss restart

 

 If there is still an issue, will it be possible to access your system remotely for further debug?

 

I will get the config changes done and let you know if the issue persists. With regards to remote access I will talk to the admins and other people concerned on setting up remote access for trouble shooting.

0 Kudos
Bharath_R_
Beginner
900 Views

The system has hasnt oopsed in the last 72 hours as opposed to earlier. Seems like turning of the power management does help. If power management is triggering this oops is this an issue with the hardware or the s/w side of the power management.

0 Kudos
Reply