Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Xeon Phi stability

Vladimir_Dergachev
1,720 Views

I am seeing long term (few days) stability issues with our Xeon Phi.

First of all, the card runs fine after freshly reboot and we can (productively) run code on it, look at the code with VTune, perf, etc. Everything is fine during the week.

However, if left unused over the weekend it crashes requiring a system shutdown to clear. Messages from dmesg:

[204074.058879] micscif_handle_lostnode 1380 node 1
[204074.071250] micscif_handle_lostnode 1389 node 1 ready for crash dump!
[204074.071254] mic0: Transition from state online to lost
[204124.200117] micvnet_execute_stop: timeout waiting for link down message response
[204159.193852] mic0: Transition from state lost to resetting
[204161.197228] mic0: Resetting (Post Code 3C)
[204162.196934] mic0: Resetting (Post Code 3d)
[204163.196640] mic0: Resetting (Post Code 3d)
[204164.196346] mic0: Resetting (Post Code 3d)
[204165.196054] mic0: Resetting (Post Code 3d)
[204166.195760] mic0: Resetting (Post Code 3E)
[204167.195468] mic0: Resetting (Post Code 3E)
[204168.195174] mic0: Resetting (Post Code 3E)
[204169.194881] mic0: Resetting (Post Code 09)
[204170.194588] mic0: Resetting (Post Code 09)
[204171.194294] mic0: Resetting (Post Code 12)
[204171.194298] mic0: Transition from state resetting to ready
[204345.785180] mic0: Transition from state ready to booting
[204345.785193] MIC 0 Booting
[204350.889584] Waiting for MIC 0 boot 5
[204355.888117] Waiting for MIC 0 boot 10
[204360.886635] Waiting for MIC 0 boot 15
[204360.886638] MIC 0 Network link is up
[204381.435474] mic0: Transition from state booting to online
[295897.240261] micscif_handle_lostnode 1380 node 1
[295897.273333] Warning: Core image elf header not found
[295897.273337] Kdump: vmcore not initialized
[295897.273341] micscif_handle_lostnode 1392 node 1 crash dump failed status -22
[295897.273357] mic0: Transition from state online to lost
[295897.273364] micscif_handle_lostnode 1407 stopping node 1 to recover lost node!
[295901.271071] micvnet_execute_stop: timeout waiting for link down message response
[295936.276809] dma_mark_wait 1080 TO chan 0x0
[295936.276815] drain_dma_intr 1151 err -16
[295941.275334] dma_mark_wait 1080 TO chan 0x0
[295941.275340] drain_dma_intr 1151 err -16
[295946.285873] dma_mark_wait 1080 TO chan 0x1
[295946.285879] drain_dma_intr 1151 err -16
[295951.296411] dma_mark_wait 1080 TO chan 0x2
[295951.296417] drain_dma_intr 1151 err -16
[295956.310921] dma_mark_wait 1080 TO chan 0x3
[295956.310927] drain_dma_intr 1151 err -16
[295961.325468] mic0: Transition from state lost to resetting
[295963.361505] mic0: Resetting (Post Code \xffffffff\xffffffff)
[295963.361517] mic0: Transition from state resetting to reset failed
[295963.361527] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[295963.361588] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[295963.361595] adapter_start_device 1354 state 8??

It looks like the card sucessfully reset, but then the boot failed.

Any suggestions ?

thank you !

Vladimir Dergachev

0 Kudos
18 Replies
Frances_R_Intel
Employee
1,720 Views

Did you recently update to the latest MPSS? (mpss_gold_update_3-2.1.6720-13, released May 9 2013) And update the flash and bootloader? If yes, had you seen the problem with the card rebooting after being idle for a day before you updated? If not, could you try updating? 

0 Kudos
Vladimir_Dergachev
1,720 Views

Yes, I am using mpss_gold_update_3-2.1.6720-13. I did update the flash, are there any particular steps to update the bootloader ?

I have only run mpss_gold_update_3-2.1.6720-13 on this card, it is brand new - we are exploring what kind of performance we can get for our CPU intensive codes.

The card runs fine otherwise.

0 Kudos
Frances_R_Intel
Employee
1,720 Views

Yes, after updating flash, you should reboot the host and make sure the card is back in the ready state (or if it is not in the ready state, bring it to the ready state with "micctrl -rw".) Then:

if the card is NOT C0 stepping:

 /opt/intel/mic/bin/micflash -update  -smcbootloader -device all

if the card IS C0 stepping:

/opt/intel/mic/bin/micflash -update -device all

So if the card is C0, which you can tell by running /opt/intel/mic/bin/micinfo and looking for the line "Coprocessor Stepping     : C0", you will be repeating the flash command you used before; otherwise, you will be using a different command. 

0 Kudos
Vladimir_Dergachev
1,720 Views

Thanks ! This is probably it, the old bootloader was version 1.7.4172, but it is updating now with version EXT_HP2_SMC_Bootloader_1_8_4326.css_ab.

I expect this should fix the failed reboot - do you have any suggestions on why it became lost in the first place ?

0 Kudos
Frances_R_Intel
Employee
1,720 Views

Others have seen their system stabilize after updating the bootloader (not just reboot properly.) Keep an eye on things and see if the problem recurs.

0 Kudos
Vladimir_Dergachev
1,720 Views

Unfortunately the update to new bootloader did not help: even though the card was stable during the week, over the weekend it became lost with the following messages in dmesg:

[416841.007040] micscif_handle_lostnode 1380 node 1
[416841.040224] Warning: Core image elf header not found
[416841.040226] Kdump: vmcore not initialized
[416841.040228] micscif_handle_lostnode 1392 node 1 crash dump failed status -22
[416841.040236] mic0: Transition from state online to lost
[416841.040242] micscif_handle_lostnode 1407 stopping node 1 to recover lost node!
[416845.037816] micvnet_execute_stop: timeout waiting for link down message response
[416880.043489] dma_mark_wait 1080 TO chan 0x0
[416880.043495] drain_dma_intr 1151 err -16
[416885.042031] dma_mark_wait 1080 TO chan 0x0
[416885.042037] drain_dma_intr 1151 err -16
[416890.052541] dma_mark_wait 1080 TO chan 0x1
[416890.052547] drain_dma_intr 1151 err -16
[416895.063052] dma_mark_wait 1080 TO chan 0x2
[416895.063058] drain_dma_intr 1151 err -16
[416900.077581] dma_mark_wait 1080 TO chan 0x3
[416900.077587] drain_dma_intr 1151 err -16
[416905.092105] mic0: Transition from state lost to resetting
[416907.128398] mic0: Resetting (Post Code \xffffffff\xffffffff)
[416907.128404] mic0: Transition from state resetting to reset failed
[416907.128415] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[416907.128463] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[416907.128470] adapter_start_device 1354 state 8??

0 Kudos
Vladimir_Dergachev
1,720 Views

Relevant portion of mpssd log:

Thu May 23 10:26:36 2013: mic0: State online -> shutdown
Thu May 23 10:26:59 2013: mic0: State shutdown -> resetting
Thu May 23 10:27:11 2013: mic0: State resetting -> ready
Thu May 23 10:27:59 2013: MPSS Daemon start
Thu May 23 10:27:59 2013: Configuration version 0.6
Thu May 23 10:27:59 2013: Overlay /opt/intel/mic/sep3.10 /opt/intel/mic/sep3.10/k1om/sep.filelist declaration style is deprecated
Thu May 23 10:27:59 2013: mic0: Command line: "quiet root=ramfs console=hvc0 highres=off clocksource=tsc cgroup_disable=memory micpm=cpufreq_on;corec6_off;pc3_on;pc6_on"
Thu May 23 10:27:59 2013: mic0: log_buf_addr: ffffffff839672d0
Thu May 23 10:27:59 2013: mic0: log_buf_len: ffffffff81724c70
Thu May 23 10:27:59 2013: mic0: Booting /lib/firmware/mic/uos.img
Thu May 23 10:27:59 2013: mic0: State ready -> booting
Thu May 23 10:28:01 2013: Wait for download requests
Thu May 23 10:28:14 2013: Configure node 0
Thu May 23 10:28:14 2013: mic0: Configure Connection
Thu May 23 10:28:19 2013: mic0: Set time of day
Thu May 23 10:28:19 2013: mic0: Transfer file system /opt/intel/mic/filesystem/mic0.image
Thu May 23 10:28:22 2013: mic0: Configuration Finished
Thu May 23 10:28:35 2013: mic0: State booting -> online
Sun May 26 05:27:55 2013: mic0: State online -> lost
Sun May 26 05:27:55 2013: mic0: open /proc/mic_vmcore/mic0 failed No such file or directory
Sun May 26 05:28:59 2013: mic0: State lost -> resetting
Sun May 26 05:29:01 2013: mic0: State resetting -> reset failed

0 Kudos
Frances_R_Intel
Employee
1,720 Views

Are you getting any "dazed and confused" error messages before the system goes down? (See forum topic 392967)

0 Kudos
Vladimir_Dergachev
1,720 Views

Thank you for the pointer !

No, the host system is nice and stable. And this time around I tested that Xeon Phi can be brought back online by host reboot (not shutdown), however, resetting or shutting down Xeon Phi does not bring it back online.

0 Kudos
Frances_R_Intel
Employee
1,720 Views

I think the problem you are seeing actually is the same as that other forum topic. The problem seems to be related to the the coprocessor returning from one of the power saving states. I have let the developers know that you are seeing this problem consistently. As a work around, you could try disabling power management.

For each coprocessor, in the mic.conf file (where n is the coprocessor number), change the PowerManagement entry to:

PowerManagement "cpufreq_on;corec6_off;pc3_off;pc6_off"

Then

[bash]

service mpss stop

micctrl --resetconfig

service mpss start

[/bash]

0 Kudos
Chris_Samuel
Beginner
1,720 Views

The lack of NMIs (and hence the lovely "dazed and confused" messages) may be related to host firmware differences - what sort of system are you using as a host (ours are IBM dx360 M4s in an iDataplex cluster)?

All the best,
Chris

0 Kudos
Vladimir_Dergachev
1,720 Views

Another lost card from overnight - here is dmesg log from card startup to when I came back (~24 hours total)

[  327.013600] vnet: mode: dma, buffers: 62
[  327.013746] mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
[  327.013755] mic 0000:03:00.0: setting latency timer to 64
[  327.013763] mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
[  327.013874] mic 0000:03:00.0: irq 127 for MSI/MSI-X
[  327.013914] mic0: Transition from state ready to resetting
[  336.932327] sched: RT throttling activated
[  336.934157] mic_probe 3:0:0 as board #0
[  336.934285] mic: number of devices detected 1
[  337.931932] mic0: Resetting (Post Code 12)
[  337.931939] mic0: Transition from state resetting to ready
[  337.931979] My Phys addrs: 0x883d1a0000 and scif_addr 0x88467fd5c0
[  339.276852] mic0: Transition from state ready to booting
[  339.276863] MIC 0 Booting
[  344.374037] Waiting for MIC 0 boot 5
[  347.337168] mic0: no IPv6 routers present
[  349.372569] Waiting for MIC 0 boot 10
[  354.371103] Waiting for MIC 0 boot 15
[  359.369635] Waiting for MIC 0 boot 20
[  364.368169] Waiting for MIC 0 boot 25
[  366.367572] MIC 0 Network link is up
[  386.507455] mic0: Transition from state booting to online
[35518.725307] micscif_handle_lostnode 1380 node 1
[35518.758922] Warning: Core image elf header not found
[35518.758925] Kdump: vmcore not initialized
[35518.758929] micscif_handle_lostnode 1392 node 1 crash dump failed status -22
[35518.758963] mic0: Transition from state online to lost
[35518.758970] micscif_handle_lostnode 1407 stopping node 1 to recover lost node!
[35522.756086] micvnet_execute_stop: timeout waiting for link down message response
[35557.761827] dma_mark_wait 1080 TO chan 0x0
[35557.761833] drain_dma_intr 1151 err -16
[35562.760365] dma_mark_wait 1080 TO chan 0x0
[35562.760371] drain_dma_intr 1151 err -16
[35567.770911] dma_mark_wait 1080 TO chan 0x1
[35567.770917] drain_dma_intr 1151 err -16
[35572.785412] dma_mark_wait 1080 TO chan 0x2
[35572.785418] drain_dma_intr 1151 err -16
[35577.795961] dma_mark_wait 1080 TO chan 0x3
[35577.795967] drain_dma_intr 1151 err -16
[35582.810482] mic0: Transition from state lost to resetting
[35584.847105] mic0: Resetting (Post Code \xffffffff\xffffffff)
[35584.847111] mic0: Transition from state resetting to reset failed
[35584.847127] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[35584.847176] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[35584.847183] adapter_start_device 1354 state 8??

0 Kudos
Vladimir_Dergachev
1,720 Views

This is a SuperMicro server from SabrePC. 2x Intel Xeon E5-2690, 64 GB RAM, single Xeon Phi 5110P  - the server can handle up to 4, but we are testing how much of a speedup we can get first. OS is Ubuntu 12.04 with kernel 3.2.0, this might account for lack of "dased and confused" messages. We use ubuntu mainly because this is the OS installed on other computers here.. It would be nice if there were some Debian (or Ubuntu) mpss packages in the future.

I'll try the powersave suggestion, but first I want to see whether it becomes lost again next night. It used to be that this only happenned on weekends with a long time between use.

Another piece of information that might be helpful is that I was using VTune amplifier (i.e. performance profiler) tool to track bottlenecks in our code. Perhaps its collection module interferes somehow.

0 Kudos
Vladimir_Dergachev
1,720 Views

I just found out something very useful - to bring the card back from "reset failed" state - which cannot be cleared with micctrl alone, one just needs to do "echo 1 > reset" into the reset file of appropriate /sys/pci/devices/xxx/reset file and then reset as usual.

I attach the log of this, which shows some PCI registers being restored - this might be a clue to allow the driver to bring the card back from the reset failed state automatically.

[35582.810482] mic0: Transition from state lost to resetting
[35584.847105] mic0: Resetting (Post Code \xffffffff\xffffffff)
[35584.847111] mic0: Transition from state resetting to reset failed
[35584.847127] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[35584.847176] micscif_handle_lostnode 1458 booting node 1 to recover lost node!
[35584.847183] adapter_start_device 1354 state 8??
[86134.641320] mic0: Transition from state reset failed to resetting
[86136.678120] mic0: Resetting (Post Code \xffffffff\xffffffff)
[86136.678126] mic0: Transition from state resetting to reset failed
[86136.678140] MIC 0 RESETFAIL postcode \xffffffff\xffffffff -1
[86352.529746] mic 0000:03:00.0: restoring config space at offset 0xf (was 0x100, writing 0x10b)
[86352.529762] mic 0000:03:00.0: restoring config space at offset 0x8 (was 0x4, writing 0xdfa00004)
[86352.529770] mic 0000:03:00.0: restoring config space at offset 0x5 (was 0x0, writing 0x3c0c)
[86352.529777] mic 0000:03:00.0: restoring config space at offset 0x3 (was 0x0, writing 0x10)
[86352.529784] mic 0000:03:00.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100407)
root@ypsilon1:/sys/bus/pci/devices/0000:03:00.0# micctrl -r
mic0: resetting
root@ypsilon1:/sys/bus/pci/devices/0000:03:00.0# micctrl -w
mic0: ready
root@ypsilon1:/sys/bus/pci/devices/0000:03:00.0# /etc/init.d/mpss restart
 * Restarting Start MPSS stack processing mpss                                                                                                                                                                      Stopping MPSS Stack
Starting MPSS Stack
ifup: interface mic0 already configured
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)

0 Kudos
Frances_R_Intel
Employee
1,720 Views

How has the stability been? More? Less? The same? Did you change the power settings?

(Looking back over the thread, I realized that there was confusion over the "dazed and confused" messages. Had you seen them, they would have been coming from the kernel on the coprocessor, not on the host. Not that it is relevant at this point but I thought I would just say.)

0 Kudos
Vladimir_Dergachev
1,720 Views

Thanks for asking !

I left the system as is until last Friday, which resulted in two more lockups.

On Friday I changed the power settings to what you suggested and so far there have not been any lockups, including a long period over the weekend. I'll post another update once a week passes.

Dased and confused - I only saw the coprocessor in the lockup state.

0 Kudos
Chris_Samuel
Beginner
1,720 Views

Frances Roth (Intel) wrote:

(Looking back over the thread, I realized that there was confusion over the "dazed and confused" messages. Had you seen them, they would have been coming from the kernel on the coprocessor, not on the host. Not that it is relevant at this point but I thought I would just say.)

Actually on our systems the "Dazed and confused" messages are on the host (where the kernel gets them before the IBM firmware resets the host system because of the NMI), not on the cards.

All the best!
Chris

0 Kudos
Vladimir_Dergachev
1,720 Views

So far the card was rock-solid, no hangs.

So the new power settings have definitely fixed it - thank you very much !

Vladimir Dergachev

0 Kudos
Reply