- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am seeing long term (few days) stability issues with our Xeon Phi.
First of all, the card runs fine after freshly reboot and we can (productively) run code on it, look at the code with VTune, perf, etc. Everything is fine during the week.
However, if left unused over the weekend it crashes requiring a system shutdown to clear. Messages from dmesg:
[204074.058879] micscif_handle_lostnode 1380 node 1
[204074.071250] micscif_handle_lostnode 1389 node 1 ready for crash dump!
[204074.071254] mic0: Transition from state online to lost
[204124.200117] micvnet_execute_stop: timeout waiting for link down message response
[204159.193852] mic0: Transition from state lost to resetting
[204161.197228] mic0: Resetting (Post Code 3C)
[204162.196934] mic0: Resetting (Post Code 3d)
[204163.196640] mic0: Resetting (Post Code 3d)
[204164.196346] mic0: Resetting (Post Code 3d)
[204165.196054] mic0: Resetting (Post Code 3d)
[204166.195760] mic0: Resetting (Post Code 3E)
[204167.195468] mic0: Resetting (Post Code 3E)
[204168.195174] mic0: Resetting (Post Code 3E)
[204169.194881] mic0: Resetting (Post Code 09)
[204170.194588] mic0: Resetting (Post Code 09)
[204171.194294] mic0: Resetting (Post Code 12)
[204171.194298] mic0: Transition from state resetting to ready
[204345.785180] mic0: Transition from state ready to booting
[204345.785193] MIC 0 Booting
[204350.889584] Waiting for MIC 0 boot 5
[204355.888117] Waiting for MIC 0 boot 10
[204360.886635] Waiting for MIC 0 boot 15
[204360.886638] MIC 0 Network link is up
[204381.435474] mic0: Transition from state booting to online
[295897.240261] micscif_handle_lostnode 1380 node 1
[295897.273333] Warning: Core image elf header not found
[295897.273337] Kdump: vmcore not initialized
[295897.273341] micscif_handle_lostnode 1392 node 1 crash dump failed status -22
[295897.273357] mic0: Transition from state online to lost
[295897.273364] micscif_handle_lostnode 1407 stopping node 1 to recover lost node!
[295901.271071] micvnet_execute_stop: timeout waiting for link down message response
[295936.276809] dma_mark_wait 1080 TO chan 0x0
[295936.276815] drain_dma_intr 1151 err -16
[295941.275334] dma_mark_wait 1080 TO chan 0x0
[295941.275340] drain_dma_intr 1151 err -16
[295946.285873] dma_mark_wait 1080 TO chan 0x1
[295946.285879] drain_dma_intr 1151 err -16
[295951.296411] dma_mark_wait 1080 TO chan 0x2
[295951.296417] drain_dma_intr 1151 err -16
[295956.310921] dma_mark_wait 1080 TO chan 0x3
[295956.310927] drain_dma_intr 1151 err -16
[295961.325468] mic0: Transition from state lost to resetting
[295963.361505] mic0: Resetting (Post Code \xffffffff\xffffffff)
[295963.361517] mic0: Transition from state resetting to reset failed
[295963.361527] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[295963.361588] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[295963.361595] adapter_start_device 1354 state 8??
It looks like the card sucessfully reset, but then the boot failed.
Any suggestions ?
thank you !
Vladimir Dergachev
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you recently update to the latest MPSS? (mpss_gold_update_3-2.1.6720-13, released May 9 2013) And update the flash and bootloader? If yes, had you seen the problem with the card rebooting after being idle for a day before you updated? If not, could you try updating?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I am using mpss_gold_update_3-2.1.6720-13. I did update the flash, are there any particular steps to update the bootloader ?
I have only run mpss_gold_update_3-2.1.6720-13 on this card, it is brand new - we are exploring what kind of performance we can get for our CPU intensive codes.
The card runs fine otherwise.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, after updating flash, you should reboot the host and make sure the card is back in the ready state (or if it is not in the ready state, bring it to the ready state with "micctrl -rw".) Then:
if the card is NOT C0 stepping:
/opt/intel/mic/bin/micflash -update -smcbootloader -device all
if the card IS C0 stepping:
/opt/intel/mic/bin/micflash -update -device all
So if the card is C0, which you can tell by running /opt/intel/mic/bin/micinfo and looking for the line "Coprocessor Stepping : C0", you will be repeating the flash command you used before; otherwise, you will be using a different command.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks ! This is probably it, the old bootloader was version 1.7.4172, but it is updating now with version EXT_HP2_SMC_Bootloader_1_8_4326.css_ab.
I expect this should fix the failed reboot - do you have any suggestions on why it became lost in the first place ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Others have seen their system stabilize after updating the bootloader (not just reboot properly.) Keep an eye on things and see if the problem recurs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately the update to new bootloader did not help: even though the card was stable during the week, over the weekend it became lost with the following messages in dmesg:
[416841.007040] micscif_handle_lostnode 1380 node 1
[416841.040224] Warning: Core image elf header not found
[416841.040226] Kdump: vmcore not initialized
[416841.040228] micscif_handle_lostnode 1392 node 1 crash dump failed status -22
[416841.040236] mic0: Transition from state online to lost
[416841.040242] micscif_handle_lostnode 1407 stopping node 1 to recover lost node!
[416845.037816] micvnet_execute_stop: timeout waiting for link down message response
[416880.043489] dma_mark_wait 1080 TO chan 0x0
[416880.043495] drain_dma_intr 1151 err -16
[416885.042031] dma_mark_wait 1080 TO chan 0x0
[416885.042037] drain_dma_intr 1151 err -16
[416890.052541] dma_mark_wait 1080 TO chan 0x1
[416890.052547] drain_dma_intr 1151 err -16
[416895.063052] dma_mark_wait 1080 TO chan 0x2
[416895.063058] drain_dma_intr 1151 err -16
[416900.077581] dma_mark_wait 1080 TO chan 0x3
[416900.077587] drain_dma_intr 1151 err -16
[416905.092105] mic0: Transition from state lost to resetting
[416907.128398] mic0: Resetting (Post Code \xffffffff\xffffffff)
[416907.128404] mic0: Transition from state resetting to reset failed
[416907.128415] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[416907.128463] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[416907.128470] adapter_start_device 1354 state 8??
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Relevant portion of mpssd log:
Thu May 23 10:26:36 2013: mic0: State online -> shutdown
Thu May 23 10:26:59 2013: mic0: State shutdown -> resetting
Thu May 23 10:27:11 2013: mic0: State resetting -> ready
Thu May 23 10:27:59 2013: MPSS Daemon start
Thu May 23 10:27:59 2013: Configuration version 0.6
Thu May 23 10:27:59 2013: Overlay /opt/intel/mic/sep3.10 /opt/intel/mic/sep3.10/k1om/sep.filelist declaration style is deprecated
Thu May 23 10:27:59 2013: mic0: Command line: "quiet root=ramfs console=hvc0 highres=off clocksource=tsc cgroup_disable=memory micpm=cpufreq_on;corec6_off;pc3_on;pc6_on"
Thu May 23 10:27:59 2013: mic0: log_buf_addr: ffffffff839672d0
Thu May 23 10:27:59 2013: mic0: log_buf_len: ffffffff81724c70
Thu May 23 10:27:59 2013: mic0: Booting /lib/firmware/mic/uos.img
Thu May 23 10:27:59 2013: mic0: State ready -> booting
Thu May 23 10:28:01 2013: Wait for download requests
Thu May 23 10:28:14 2013: Configure node 0
Thu May 23 10:28:14 2013: mic0: Configure Connection
Thu May 23 10:28:19 2013: mic0: Set time of day
Thu May 23 10:28:19 2013: mic0: Transfer file system /opt/intel/mic/filesystem/mic0.image
Thu May 23 10:28:22 2013: mic0: Configuration Finished
Thu May 23 10:28:35 2013: mic0: State booting -> online
Sun May 26 05:27:55 2013: mic0: State online -> lost
Sun May 26 05:27:55 2013: mic0: open /proc/mic_vmcore/mic0 failed No such file or directory
Sun May 26 05:28:59 2013: mic0: State lost -> resetting
Sun May 26 05:29:01 2013: mic0: State resetting -> reset failed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you getting any "dazed and confused" error messages before the system goes down? (See forum topic 392967)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the pointer !
No, the host system is nice and stable. And this time around I tested that Xeon Phi can be brought back online by host reboot (not shutdown), however, resetting or shutting down Xeon Phi does not bring it back online.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the problem you are seeing actually is the same as that other forum topic. The problem seems to be related to the the coprocessor returning from one of the power saving states. I have let the developers know that you are seeing this problem consistently. As a work around, you could try disabling power management.
For each coprocessor, in the mic
PowerManagement "cpufreq_on;corec6_off;pc3_off;pc6_off"
Then
[bash]
service mpss stop
micctrl --resetconfig
service mpss start
[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The lack of NMIs (and hence the lovely "dazed and confused" messages) may be related to host firmware differences - what sort of system are you using as a host (ours are IBM dx360 M4s in an iDataplex cluster)?
All the best,
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another lost card from overnight - here is dmesg log from card startup to when I came back (~24 hours total)
[ 327.013600] vnet: mode: dma, buffers: 62
[ 327.013746] mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
[ 327.013755] mic 0000:03:00.0: setting latency timer to 64
[ 327.013763] mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
[ 327.013874] mic 0000:03:00.0: irq 127 for MSI/MSI-X
[ 327.013914] mic0: Transition from state ready to resetting
[ 336.932327] sched: RT throttling activated
[ 336.934157] mic_probe 3:0:0 as board #0
[ 336.934285] mic: number of devices detected 1
[ 337.931932] mic0: Resetting (Post Code 12)
[ 337.931939] mic0: Transition from state resetting to ready
[ 337.931979] My Phys addrs: 0x883d1a0000 and scif_addr 0x88467fd5c0
[ 339.276852] mic0: Transition from state ready to booting
[ 339.276863] MIC 0 Booting
[ 344.374037] Waiting for MIC 0 boot 5
[ 347.337168] mic0: no IPv6 routers present
[ 349.372569] Waiting for MIC 0 boot 10
[ 354.371103] Waiting for MIC 0 boot 15
[ 359.369635] Waiting for MIC 0 boot 20
[ 364.368169] Waiting for MIC 0 boot 25
[ 366.367572] MIC 0 Network link is up
[ 386.507455] mic0: Transition from state booting to online
[35518.725307] micscif_handle_lostnode 1380 node 1
[35518.758922] Warning: Core image elf header not found
[35518.758925] Kdump: vmcore not initialized
[35518.758929] micscif_handle_lostnode 1392 node 1 crash dump failed status -22
[35518.758963] mic0: Transition from state online to lost
[35518.758970] micscif_handle_lostnode 1407 stopping node 1 to recover lost node!
[35522.756086] micvnet_execute_stop: timeout waiting for link down message response
[35557.761827] dma_mark_wait 1080 TO chan 0x0
[35557.761833] drain_dma_intr 1151 err -16
[35562.760365] dma_mark_wait 1080 TO chan 0x0
[35562.760371] drain_dma_intr 1151 err -16
[35567.770911] dma_mark_wait 1080 TO chan 0x1
[35567.770917] drain_dma_intr 1151 err -16
[35572.785412] dma_mark_wait 1080 TO chan 0x2
[35572.785418] drain_dma_intr 1151 err -16
[35577.795961] dma_mark_wait 1080 TO chan 0x3
[35577.795967] drain_dma_intr 1151 err -16
[35582.810482] mic0: Transition from state lost to resetting
[35584.847105] mic0: Resetting (Post Code \xffffffff\xffffffff)
[35584.847111] mic0: Transition from state resetting to reset failed
[35584.847127] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[35584.847176] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[35584.847183] adapter_start_device 1354 state 8??
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a SuperMicro server from SabrePC. 2x Intel Xeon E5-2690, 64 GB RAM, single Xeon Phi 5110P - the server can handle up to 4, but we are testing how much of a speedup we can get first. OS is Ubuntu 12.04 with kernel 3.2.0, this might account for lack of "dased and confused" messages. We use ubuntu mainly because this is the OS installed on other computers here.. It would be nice if there were some Debian (or Ubuntu) mpss packages in the future.
I'll try the powersave suggestion, but first I want to see whether it becomes lost again next night. It used to be that this only happenned on weekends with a long time between use.
Another piece of information that might be helpful is that I was using VTune amplifier (i.e. performance profiler) tool to track bottlenecks in our code. Perhaps its collection module interferes somehow.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just found out something very useful - to bring the card back from "reset failed" state - which cannot be cleared with micctrl alone, one just needs to do "echo 1 > reset" into the reset file of appropriate /sys/pci/devices/xxx/reset file and then reset as usual.
I attach the log of this, which shows some PCI registers being restored - this might be a clue to allow the driver to bring the card back from the reset failed state automatically.
[35582.810482] mic0: Transition from state lost to resetting
[35584.847105] mic0: Resetting (Post Code \xffffffff\xffffffff)
[35584.847111] mic0: Transition from state resetting to reset failed
[35584.847127] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[35584.847176] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[35584.847183] adapter_start_device 1354 state 8??
[86134.641320] mic0: Transition from state reset failed to resetting
[86136.678120] mic0: Resetting (Post Code \xffffffff\xffffffff)
[86136.678126] mic0: Transition from state resetting to reset failed
[86136.678140] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[86352.529746] mic 0000:03:00.0: restoring config space at offset 0xf (was 0x100, writing 0x10b)
[86352.529762] mic 0000:03:00.0: restoring config space at offset 0x8 (was 0x4, writing 0xdfa00004)
[86352.529770] mic 0000:03:00.0: restoring config space at offset 0x5 (was 0x0, writing 0x3c0c)
[86352.529777] mic 0000:03:00.0: restoring config space at offset 0x3 (was 0x0, writing 0x10)
[86352.529784] mic 0000:03:00.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100407)
root@ypsilon1:/sys/bus/pci/devices/0000:03:00.0# micctrl -r
mic0: resetting
root@ypsilon1:/sys/bus/pci/devices/0000:03:00.0# micctrl -w
mic0: ready
root@ypsilon1:/sys/bus/pci/devices/0000:03:00.0# /etc/init.d/mpss restart
* Restarting Start MPSS stack processing mpss Stopping MPSS Stack
Starting MPSS Stack
ifup: interface mic0 already configured
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How has the stability been? More? Less? The same? Did you change the power settings?
(Looking back over the thread, I realized that there was confusion over the "dazed and confused" messages. Had you seen them, they would have been coming from the kernel on the coprocessor, not on the host. Not that it is relevant at this point but I thought I would just say.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for asking !
I left the system as is until last Friday, which resulted in two more lockups.
On Friday I changed the power settings to what you suggested and so far there have not been any lockups, including a long period over the weekend. I'll post another update once a week passes.
Dased and confused - I only saw the coprocessor in the lockup state.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
(Looking back over the thread, I realized that there was confusion over the "dazed and confused" messages. Had you seen them, they would have been coming from the kernel on the coprocessor, not on the host. Not that it is relevant at this point but I thought I would just say.)
Actually on our systems the "Dazed and confused" messages are on the host (where the kernel gets them before the IBM firmware resets the host system because of the NMI), not on the cards.
All the best!
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So far the card was rock-solid, no hangs.
So the new power settings have definitely fixed it - thank you very much !
Vladimir Dergachev

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page