we are currently facing an issue with one of our two Xeon Phi.
After successfully setting up both, mic0 and mic1, using the MPSS Guide (including SSH, "Hello World" etc.) the mic1 will stop working after one or two reboots. In detail: it get's stuck during transition from reset to ready.
So far we could monitor the Phi getting stuck with the following post codes:
# see dmesg.txt for whole file [ 705.924029] mic1: Resetting (Post Code 09) [ 705.924043] mic1: Transition from state resetting to reset failed [ 705.924048] MIC 1 RESETFAIL postcode 09 14640 [ 745.876033] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs [ 851.203215] mic1: Transition from state reset failed to resetting [ 853.204412] mic1: Resetting (Post Code 3C) [ 854.205516] mic1: Resetting (Post Code 3d) [ 855.206646] mic1: Resetting (Post Code 3d) [ 856.207791] mic1: Resetting (Post Code 3d) [ 857.208920] mic1: Resetting (Post Code 3d) [ 858.210011] mic1: Resetting (Post Code 3E) [ 859.211252] mic1: Resetting (Post Code 3E) [ 860.212399] mic1: Resetting (Post Code 3E) [ 861.213490] mic1: Resetting (Post Code F2) [ 861.213534] Reattempting reset after F2/F4 failure [ 861.213540] mic1: Transition from state resetting to resetting [ 863.215785] mic1: Resetting (Post Code 3C) [ 864.216949] mic1: Resetting (Post Code 3d) [ 865.218093] mic1: Resetting (Post Code 3d) [ 866.219241] mic1: Resetting (Post Code 3d) [ 867.220402] mic1: Resetting (Post Code 3d) [ 868.221492] mic1: Resetting (Post Code 3E) [ 869.222660] mic1: Resetting (Post Code 3E) [ 870.223798] mic1: Resetting (Post Code 3E) [ 871.224933] mic1: Resetting (Post Code 17) # Goes on about 2 minutes [ 1016.391028] mic1: Resetting (Post Code 17) [ 1017.392186] mic1: Resetting (Post Code 09) # again about 2 minutes [ 1153.547936] mic1: Resetting (Post Code 09) [ 1153.547947] mic1: Transition from state resetting to reset failed [ 1153.547952] MIC 1 RESETFAIL postcode 09 14640
We tried running miccheck but the two problems seem to be related, i.e., the MPSS service won't start if the mic1 is not running (or is it the other way around?)
There seems to be a similar issue here: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/535257 - Sadfully the original author never replied. We tried the suggestions but the issue was only resolved before the first reboot after succesfully installing the MPSS and testing it (SSH, "Hello World",...), as before the card stopped working afterwards.
We already tried troubleshooting the issue using the flow chart. Obviously our problem is described by the lower left bubble. We still attached the log files.
In addition we tried:
The hardware itself does not seem to be completely broken as after a fresh install all seems to be working fine but there sure is an issue that we can't seem to find by ourselves...
Is there anything more we could try?
Thanks in advance!
Thank you for your reply!
Yes, both Phis are actively cooled and worked flawlessly for about a year (maybe should have mentioned this earlier). The current condition was triggered by a dead SSD and a following OS reinstall (including going from 7.0 to 7.2).
In the beginning we performed some stress tests and both 31S1P never reached more than 70°C, mostly being sub 65°C. Currently mic0 is idling in "ready" at about 48°C with minimal fan speed.
When you exchanged the PCI slots (I assume you swapped cards), the cards should (my guess) have changed mic numbers. Did the problem stay with (the different) mic1, or did the problem stay with the physical card (IOW move to mic0)?
Note my read of your first post is that mic0 continues to run.
When you exchanged the PCI slots (I assume you swapped cards), the cards should (my guess) have changed mic numbers.
That's exactly what happened.
Did the problem stay with (the different) mic1, or did the problem stay with the physical card (IOW move to mic0)?
Note my read of your first post is that mic0 continues to run.
The problem stayed with the same physical card (verified via the device serial number).
The mic0 in my post (which before the swap has been mic1) is and has been working fine.
Do you mean "micflash"?
If we are trying to run it now, it throws an error about not being able to set the device to maintenance mode.
We ran it during the setup (when mic1 always "magically" gets to the ready state). Here, mic 1 got updated successfully to the most recent version and, after a reboot, was able to be connected to via SSH and to execute offloaded code...
The key Issue is that after another reboot it fails to reset and thus never gets to ready mode (and thus also not to maintenance mode).
I would think that either:
a) it was coincidental
that the issue appeared just after the O/S change and/or the SSD failure. Thinking back when I had installation problems with my dual KNC 5110Ps a couple of additional suggestions/questions come to mind.
a) You didn't mention this, but did you also change your BIOS version at the same time? If so, try reverting.
b) If you did not change the BIOS, then, at least on my motherboard there is an option to, reset the PCI bus scan table to un-initialized. IOW instruct the BIOS to perform the initial full PCI bus scan. This process generally takes longer to run, so the BIOS will remember the found configuration, then on subsequent boots, it looks for differences. Possibly when the SSD died this affected the " remember the found configuration".
Sorry, it took a while to test this. Thank you again!
a) We did a BIOS update as part of narrowing down the error. The error is also showing in the older version.
b) Sadfully, we don't have this option. But hopefully, changing the BIOS version has the same effect.
FWIW my motherboard is an ASUS P9X79 WS. At the time I did the initial installation I found out that the (then) newest BIOS had issues with locating both 5110P's. It ended up that I had to retrograde two versions of BIOS. IOW, do not assume the newest version is the best version.
I've talked to our system administrator and we'll likely try some older BIOS versions, too. Sadfully this can't be done before the end of September.
I'm still quite unsure whether this could be the issue as the Phis ran without problems in the old version until recently.
Is there anything more we could do to test the Phi itself? Especially as most of the suspective post codes are memory related...
P.S.: We are using a Supermicro X10SRA. Again something I should have mentioned, sorry!
I haven't seen the inside of the 31S1P to know if the internal RAM is socketed or surface mount. You might take the can off and examine the inside. If the RAM is socketed, then try removing, examining the contacts, burnish with pencil erasure, re-insert, re-can, and then try again.
If that yields nothing, there may be a MPSS diagnostic that tests for faulty RAM, then maps around it. I haven't looked for such a utility, but it would seen necessary to have such a utility.