Hi Loc Q,

Tobias_K_ · ‎06-28-2017

When I try to install mpss 3.8.2 for my Xeon Phi 31S1p coprocessor on CentOS 7.3 the system crashes. Is there anything I can do/try or a possible fix? Any help would be highly appreciated, thank you!

I downloaded mpss-3.8.2(released: April 25, 2017) from the page https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#lx38rel and followed the instructions as provided in the readme file. As the kernel of my system is slightly newer than what the mpss-download provides for, I had to recompile mpss, which worked fine. I can also install the rpm-packages, receiving the following error message (which I am not sure if it is related to the problem at all):

depmod: ERROR: failed to load symbols from /lib/modules/3.10.0-514.21.2.el7.x86_64/extra/nvidia-uvm.ko: Invalid argument

After having installed the mpss-software, however, I can no longer boot the system (see below).

When I execute "modprobe mic", I get the following error message three times:

NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [modprobe:17376]

After displaying this message three times, the command prompt reappears. I can execute "micctrl --initdefaults" without any messages being displayed.
If I then execute "micctrl -s" I get the error "mic0: reset failed".
If I try "/usr/bin/miccheck", the system freezes completely.

After having installed mpss, I get the errors below when rebooting the system. I. e. the system cannot boot anymore. I can correct the problem by entering recovery mode and executing the "uninstall.sh"-script delivered in the mpss-download. After that, I can reboot the system without problems.

The coprocessor is correctly identified by "lspci" as below and large BAR support has been enabled in the BIOS ("above 4G decoding"):

09:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)

---BASIC SYSTEM INFORMATION---

ASUS X99-E WS
Intel Xeon E5-2696V3
64 GB RAM
NVidia GForce 1080

---ERROR LOG WHEN REBOOTING---

[   12.1884] pcieport 0000:00:02.0 PCIe Bus Error: severity: Uncorrected (Non-Fatal), type=Tansaction Layer, id=0010(Requester-ID)
[   12.1885] pcieport 0000:00:02.0   device [8086:2f04] error status/mask=000040000/00000000
[   12.1886] pcieport 0000:00:02.0    [14] Completion Timeout    (First)
[   40.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:784]
[   68.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:784]
[   72.2060] INFO: rcu_sched self-detected stall on CPU { 0}  (t=60001 jiffies g=135 c=134 q=2018)
[  100.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:784]
[  113.4049] ETC timer compensation(-1000000ppm) is much higherthan expected
[  113.4049] pcieport 0000:00:02.0:  device [8086:2f04] error status/mask=000040000/00000000
[  113.4049] pcieport 0000:00:02.0:   [14] Completion Timeout    (First)
...
[  120.8210] mce: [Hardware Error]: CPU 16: Machine Check Exception: 0 Bank 3: fe00000000800400
[  120.8210] mce: [Hardware Error]: TSC 0 ADDR ffe0000000000000 MISC ffffffff81060ff5
[  120.8210] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1498668525 SOCKET 0 APIC 34 microcode 38
...
[  120.8210] mce: [Hardware Error]: CPU 22: Machine Check Exception: 5 Bank 18: be200000008c110a
[  120.8210] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81060fe6> {native_save_halt+0x6/0x10}
[  120.8210] mce: [Hardware Error]: TSC e627fde4082 ADDR e0900fc0 MISC 74fc381600402086
[  120.8210] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1498661446 SOCKET 0 APIC 9 microcode 38
[  120.8210] mce: [Hardware Error]: Some CPUs didn't answer in synchronization
[  120.8210] mce: [Hardware Error]: Machine check: Processor context corrupt
[  120.8210] Kernel panic - not syncing: Fatal machine check on current CPU
[  120.8210] Shutting down cpus with NMI
[  120.8210] Rebooting in 30 seconds..

Loc_N_Intel · ‎06-30-2017

Hi Tobias,

Let me try to reproduce the problem you saw. I will get back to you.

Thanks.

Loc_N_Intel · ‎07-07-2017

I finally setup a host system running CentOS 7.3. The host connected to two Intel Xeon Phi x100 Coprocessors. I then upgraded the kernel to 3.10.0-514.26.2.el7.x86_64 and rebuilt mpss-modules-3.10.0-514.26.2.el7.x86_64-3.8.2-1.x86_64 and mpss-modules-dev-3.10.0-514.26.2.el7.x86_64-3.8.2-1.x86_64.

I installed the MPSS 3.8.2 with the rebuilt mpss-module successfully. I brought the MPSS service up successfully, micinfo showed the information correctly, mpsscheck passed. Everything worked just fine.

The difference is that I did not have any graphic card on my system, not sure if that caused the problem you observed.

Tobias_K_ · ‎07-09-2017

Hi Loc Q,

Thank you very much for taking the time to check this. I really appreciate it!

If I understood you correctly, this might be a problem of a conflict, then, between the Xeon Phi and one of my graphics cards. I will try and see what happens if I replace the graphics cards.

I have one question, though: given that I can see the Xeon Phi coprocessor using "lspci" can I assume that it is properly working? Or is there some other way I might be able to check the integrity of the coprocessor itself?

Again, thank you very much for your help,

Tobias

Loc_N_Intel · ‎07-10-2017

Hi Tobias,

I would remove the graphics card and the driver if necessary, then re-do everything.

For your other question, the "lspci" command can display the coprocessor but it doesn't mean your coprocessor works correctly. You should install the MPSS stack, bring the MPSS service up, and run the "miccheck" utility. If all the testes are "PASS", then I would say your coprocessor is good.

Thank you

Tobias_K_ · ‎07-12-2017

Hi Loc Q,

Unfortunately, it did not work. I completely removed all graphics cards, leaving nothing but the Xeon Phi in the system. In order to make sure no other driver can cause a conflict, I reinstalled a fresh copy of CentOS, without any graphics-server at all. I then used SSH and tried to reinstall mpss again, with the only difference being that, sometimes, the error messages would read "pcieport 0000:00:02.0 PCIe Bus Error", while, at other times, they stayed at "NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s!".

Am I correct in assuming that the most probable cause is not necessarily a faulty coprocessor card, but rather an incompatibility between my motherboard and the coprocessor? May I ask: in the BIOS, I enabled "above 4G decoding", enabled "VT-d support" and set the PCI-bus to "Gen2 speed". Is there anything else I should look out for?

Anyway, thank you very much for help and patience!

Thank you,

Tobias

Tobias_K_ · ‎07-12-2017

P. S. If anyone else should encounter similar problems: it turned out to be a good idea to have two shells open at the same time. As soon as the system becomes unstable (usually after the "modprobe mic"), I execute the "uninstall.sh"-script from within the second shell. After a subsequent reboot, everything is back to normal. Hence kudos to the Intel team who wrote such a handy "uninstall/rescue"-script.

Tobias_K_ · ‎08-08-2017

I think, I finally found the problem: the Xeon Phi coprocessor itself seems to have been damaged. Eventually, I put a new coprocessor into my machine and had absolutely no troubles installing MPSS.

@Loc Q: Thank you for your support!

CentOS 7.3 crashes after installation of MPSS 3.8.2