- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One one of our cluster nodes with the Intel Xeon Phi 5110P steeping B01 we are noticing the node is in a oops reboot loop. If we remove the Phi cards the node boots fine. We are running CentOS 6.3 on these nodes. With the Phi cards installed we get the following oops every single time.
EDAC sbridge: Seeking for: dev 0d.6 PCI ID 8086:3cf5
EDAC MC0: Giving out device to 'sbridge_edac.c' 'Sandy Bridge
Socket#0': DEV 0000:7f:0e.0
EDAC MC1: Giving out device to 'sbridge_edac.c' 'Sandy Bridge
Socket#1': DEV 0000:ff:0e.0
EDAC sbridge: Driver loaded.
[ OK ]
vnet: mode: dma, buffers: 62
mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 1, fatal
[Hardware Error]: section: 0, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x4010, status: 0x0547
[Hardware Error]: device_id: 0000:00:03.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x03
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c08
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[Hardware Error]: aer_status: 0x00004000, aer_mask: 0x00100000
[Hardware Error]: Completion Timeout
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x0037f030
Kernel panic - not syncing: Fatal hardware error!
Pid: 1671, comm: work_for_cpu Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
<NMI> [<ffffffff814fd11a>] ? panic+0xa0/0x168
[<ffffffff8130012c>] ? ghes_notify_nmi+0x17c/0x180
[<ffffffff81503325>] ? notifier_call_chain+0x55/0x80
[<ffffffff8150338a>] ? atomic_notifier_call_chain+0x1a/0x20
[<ffffffff810980ae>] ? notify_die+0x2e/0x30
[<ffffffff81500fd1>] ? do_nmi+0x1a1/0x2b0
[<ffffffff815008b0>] ? nmi+0x20/0x30
[<ffffffffa0151c90>] ? mic_irq_isr+0x0/0x30 [mic]
[<ffffffff812a5cad>] ? msi_set_mask_bit+0x4d/0x90
<<EOE>> [<ffffffff812a5d00>] ? unmask_msi_irq+0x10/0x20
[<ffffffff810ddc09>] ? default_enable+0x29/0x40
[<ffffffff810ddbce>] ? default_startup+0x1e/0x30
[<ffffffff810dc89a>] ? __setup_irq+0x32a/0x3c0
[<ffffffff810dd024>] ? request_threaded_irq+0x154/0x2f0
[<ffffffffa0151c90>] ? mic_irq_isr+0x0/0x30 [mic]
[<ffffffffa015238d>] ? mic_probe+0x3fd/0x5d0 [mic]
[<ffffffff81060262>] ? default_wake_function+0x12/0x20
[<ffffffff8108cc00>] ? do_work_for_cpu+0x0/0x30
[<ffffffff81292037>] ? local_pci_probe+0x17/0x20
[<ffffffff8108cc18>] ? do_work_for_cpu+0x18/0x30
[<ffffffff81091d66>] ? kthread+0x96/0xa0
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Rebooting in 30 seconds..
ACPI MEMORY or I/O RESET_REG.
Any help in resolving this issue is appreciated.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BELINDA L. (Intel) wrote:
Hi,
we want to isolate whether the power management on the card might be triggering this. Can you turn off power management and retry to see if this can be reproduced ?
1) modify the mic0.conf and mic1.conf (in the /etc/sysconfig/mic/ directory ) with the following change
PowerManagement "cpufreq_on;corec6_off;pc3_off;pc6_off"
2) restart the mpss service
user_prompt> service mpss restart
If there is still an issue, will it be possible to access your system remotely for further debug?
I will get the config changes done and let you know if the issue persists. With regards to remote access I will talk to the admins and other people concerned on setting up remote access for trouble shooting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The system has hasnt oopsed in the last 72 hours as opposed to earlier. Seems like turning of the power management does help. If power management is triggering this oops is this an issue with the hardware or the s/w side of the power management.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »