Software Archive
Read-only legacy content
17061 Discussions

"cat /proc/scif/resume" on card crashes host system

Ahmet_Inan
Beginner
738 Views
a simple: # cat /proc/scif/resume on the card crashes the host system instantly. Before diving into solving the issue, i only need a simple "confirmed" to know that i am not at fault with my host-side mpss-modules port to Linux-3.11.6 or with the software or hardware settings. I am still able to see this message from the card: # cat /proc/scif/resume # Resuming/Waking up node "uname -a" on the card gives the following output: Linux sauron-mic0 2.6.38.8+mpss3.1 #1 SMP Tue Oct 15 11:49:30 PDT 2013 k1om GNU/Linux this installation is more or less the result of "micctrl --initdefaults" after unpacking of the following suse rpm's to "/": glibc2.12.2pkg-libmicmgmt0-3.1-0.1.build0.glibc2.12.2.x86_64.rpm libscif0-3.1-0.1.build0.glibc2.12.2.x86_64.rpm mpss-boot-files-3.1-0.1.build0.glibc2.12.2.x86_64.rpm mpss-daemon-3.1-0.1.build0.glibc2.12.2.x86_64.rpm mpss-miccheck-bin-3.1-r1.glibc2.12.2.x86_64.rpm mpss-micmgmt-3.1-0.1.build0.glibc2.12.2.x86_64.rpm I was able to salvage the following kernel messages from the log server: Nov 6 17:15:25 sauron kernel: [16796.780345] dmar: DRHD: handling fault status reg 2 Nov 6 17:15:25 sauron kernel: [16796.780355] dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr fdbcf000 Nov 6 17:15:25 sauron kernel: [16796.780355] DMAR:[fault reason 06] PTE Read access is not set Nov 6 17:15:25 sauron kernel: [16797.071390] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 Nov 6 17:15:25 sauron kernel: [16797.071669] {1}[Hardware Error]: APEI generic hardware error status Nov 6 17:15:25 sauron kernel: [16797.071862] {1}[Hardware Error]: severity: 1, fatal Nov 6 17:15:25 sauron kernel: [16797.072012] {1}[Hardware Error]: section: 0, severity: 1, fatal Nov 6 17:15:25 sauron kernel: [16797.072197] {1}[Hardware Error]: flags: 0x01 Nov 6 17:15:25 sauron kernel: [16797.072331] {1}[Hardware Error]: primary Nov 6 17:15:25 sauron kernel: [16797.072452] {1}[Hardware Error]: section_type: PCIe error Nov 6 17:15:25 sauron kernel: [16797.072619] {1}[Hardware Error]: port_type: 0, PCIe end point Nov 6 17:15:25 sauron kernel: [16797.072797] {1}[Hardware Error]: version: 1.0 Nov 6 17:15:25 sauron kernel: [16797.072931] {1}[Hardware Error]: command: 0x0407, status: 0x2810 Nov 6 17:15:25 sauron kernel: [16797.073118] {1}[Hardware Error]: device_id: 0000:02:00.0 Nov 6 17:15:25 sauron kernel: [16797.073283] {1}[Hardware Error]: slot: 4 Nov 6 17:15:25 sauron kernel: [16797.073406] {1}[Hardware Error]: secondary_bus: 0x00 Nov 6 17:15:25 sauron kernel: [16797.073559] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x225d Nov 6 17:15:25 sauron kernel: [16797.073758] {1}[Hardware Error]: class_code: 00400b Nov 6 17:15:25 sauron kernel: [16797.073908] Kernel panic - not syncing: Fatal hardware error! Nov 6 17:15:25 sauron kernel: [16797.076283] ------------[ cut here ]------------ Nov 6 17:15:25 sauron kernel: [16797.076439] WARNING: CPU: 0 PID: 38489 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x54/0x60() Nov 6 17:15:25 sauron kernel: [16797.076746] Modules linked in: mic(O) dm_mod msr fuse x86_pkg_temp_thermal joydev kvm_intel kvm iTCO_wdt lpc_ich mfd_core [last unloaded: mic] Nov 6 17:15:25 sauron kernel: [16797.077247] CPU: 0 PID: 38489 Comm: sshd Tainted: G W O 3.11.6-ainan #16 Nov 6 17:15:25 sauron kernel: [16797.077482] Hardware name: Dell Inc. PowerEdge T620/07HNGV, BIOS 2.0.19 09/02/2013 Nov 6 17:15:25 sauron kernel: [16797.077722] 000000000000007c ffff880ff490f4b0 ffffffff81c9c556 0000000000000007 Nov 6 17:15:25 sauron kernel: [16797.077985] 0000000000000000 ffff880ff490f4f0 ffffffff8106eec2 ffff880ff490f4e0 Nov 6 17:15:25 sauron kernel: [16797.078247] 0000000000000001 ffff88203f211d40 0000000000000001 ffff88103fa11d40 Nov 6 17:15:25 sauron kernel: [16797.078510] Call Trace: Nov 6 17:15:25 sauron kernel: [16797.078594] [] dump_stack+0x46/0x58 Nov 6 17:15:25 sauron kernel: [16797.078761] [] warn_slowpath_common+0x82/0xb0 Nov 6 17:15:25 sauron kernel: [16797.078953] [] warn_slowpath_null+0x15/0x20 Nov 6 17:15:25 sauron kernel: [16797.079140] [] native_smp_send_reschedule+0x54/0x60 Nov 6 17:15:25 sauron kernel: [16797.079355] [] trigger_load_balance+0x18b/0x250 Nov 6 17:15:25 sauron kernel: [16797.084642] [] scheduler_tick+0xa9/0xe0 Nov 6 17:15:25 sauron kernel: [16797.090027] [] update_process_times+0x64/0x80 Nov 6 17:15:25 sauron kernel: [16797.095341] [] tick_sched_handle.isra.11+0x31/0x40 Nov 6 17:15:25 sauron kernel: [16797.100578] [] tick_sched_timer+0x44/0x70 Nov 6 17:15:25 sauron kernel: [16797.105639] [] __run_hrtimer.isra.31+0x4a/0xd0 Nov 6 17:15:25 sauron kernel: [16797.110570] [] hrtimer_interrupt+0x103/0x240 Nov 6 17:15:25 sauron kernel: [16797.115423] [] ? load_balance+0xf1/0x740 Nov 6 17:15:25 sauron kernel: [16797.120271] [] local_apic_timer_interrupt+0x36/0x60 Nov 6 17:15:25 sauron kernel: [16797.125197] [] smp_apic_timer_interrupt+0x3e/0x60 Nov 6 17:15:25 sauron kernel: [16797.130160] [] apic_timer_interrupt+0x6a/0x70 Nov 6 17:15:25 sauron kernel: [16797.135188] [] ? finish_task_switch+0x4e/0xe0 Nov 6 17:15:25 sauron kernel: [16797.140235] [] __schedule+0x41b/0x990 Nov 6 17:15:25 sauron kernel: [16797.145307] [] ? __intel_map_single+0x159/0x1c0 Nov 6 17:15:25 sauron kernel: [16797.150441] [] ? start_flush_work+0x103/0x140 Nov 6 17:15:25 sauron kernel: [16797.155594] [] schedule+0x24/0x70 Nov 6 17:15:25 sauron kernel: [16797.160831] [] schedule_hrtimeout_range_clock+0x115/0x130 Nov 6 17:15:25 sauron kernel: [16797.166087] [] ? tty_ldisc_try+0x4b/0x60 Nov 6 17:15:25 sauron kernel: [16797.171378] [] ? tty_write_room+0x18/0x20 Nov 6 17:15:25 sauron kernel: [16797.176643] [] ? n_tty_poll+0x1eb/0x200 Nov 6 17:15:25 sauron kernel: [16797.181893] [] schedule_hrtimeout_range+0xe/0x10 Nov 6 17:15:25 sauron kernel: [16797.187156] [] poll_schedule_timeout+0x5a/0xc0 Nov 6 17:15:25 sauron kernel: [16797.192382] [] do_select+0x70b/0x7b0 Nov 6 17:15:25 sauron kernel: [16797.197452] [] ? __pollwait+0xf0/0xf0 Nov 6 17:15:25 sauron kernel: [16797.202342] [] ? __pollwait+0xf0/0xf0 Nov 6 17:15:25 sauron kernel: [16797.206999] [] ? __pollwait+0xf0/0xf0 Nov 6 17:15:25 sauron kernel: [16797.211413] [] ? __pollwait+0xf0/0xf0 Nov 6 17:15:25 sauron kernel: [16797.215650] [] ? check_preempt_curr+0x84/0xa0 Nov 6 17:15:25 sauron kernel: [16797.219814] [] ? ttwu_do_wakeup+0x12/0x90 Nov 6 17:15:25 sauron kernel: [16797.223875] [] ? check_preempt_curr+0x84/0xa0 Nov 6 17:15:25 sauron kernel: [16797.227967] [] ? ttwu_do_wakeup+0x12/0x90 Nov 6 17:15:25 sauron kernel: [16797.232036] [] ? try_to_wake_up+0x22e/0x2a0 Nov 6 17:15:25 sauron kernel: [16797.236002] [] ? default_wake_function+0xd/0x10 Nov 6 17:15:25 sauron kernel: [16797.239961] [] ? __wake_up_common+0x58/0x90 Nov 6 17:15:25 sauron kernel: [16797.243901] [] core_sys_select+0x1fd/0x2f0 Nov 6 17:15:25 sauron kernel: [16797.247877] [] ? set_next_entity+0x7a/0xe0 Nov 6 17:15:25 sauron kernel: [16797.251827] [] ? __schedule+0x41b/0x990 Nov 6 17:15:25 sauron kernel: [16797.255745] [] ? sock_getsockopt+0xd9/0x740 Nov 6 17:15:25 sauron kernel: [16797.259651] [] ? tty_write+0x1d0/0x2a0 Nov 6 17:15:25 sauron kernel: [16797.263529] [] ? n_tty_ioctl+0xd0/0xd0
0 Kudos
2 Replies
Loc_N_Intel
Employee
738 Views

Hi Ahmet,

Sorry for the delay. I confirm that only issuing the follow command in the coprocessor will cause the host to stop communicating with the coprocessor:

# cat /proc/scif/resume

However, I think what you need to do is to issue the command to "suspend" before "resume" SCIF service. In other words, issuing the following commands in this order should not cause any problem at all:

# cat /proc/scif/suspend

# cat /proc/scif/resume 

Hope this help. Thank you.

0 Kudos
Ahmet_Inan
Beginner
738 Views

I've updated to mpss-3.1.1 and applied an flash update to see if this is still an issue:

[root@sauron-mic0 ~]# cat /proc/scif/suspend 
[root@sauron-mic0 ~]# cat /proc/scif/resume

And again, the host system crashes hard and waiting for the watchdog to reset it.
It really shouldn't crash the host system in any case.
At least, only the root user on the mic card has access to "/proc/scif", thus preventing this incident:

[ainan@sauron-mic0 ~]$ cat /proc/scif/resume
cat: can't open '/proc/scif/resume': Permission denied

So i will listen to the doctor and not touch it if it hurts :)

Ahmet

0 Kudos
Reply