- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi !
I begin with Xeon Phi and MPI. After a lot of troubles, yesterday I managed to run MPI programs on symmetric mode (both host and co-processor) as well as directly on mic0 after ssh.
But today I have troubles again.
MPSS starts correctly and 3 minutes later the status of mic0 is lost and I can not reset it.
# micctrl -s mic0: lost # service mpss status mpss is running
# micctrl -rw mic0: resetting mic0: reset failed
Even when everything seems to run fine I am not able to reboot or reset with micctrl. I never succeded in reseting or rebooting mic0 without rebooting my computer.
My hardware :
- motherboard : ASUS P9 X79E-WS
- processor : Xeon E5 v2
- co-processor : Xeon Phi 7120P (with a fan)
My OS : CentOS 6.5
A piece of micinfo output :
Flash Version : 2.1.02.0386 SMC Firmware Version : 1.14.4616 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.2
Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Moreover when I shut down my computer the Xeon Phi does not stop and I have to unplug my machine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After 2 days without running, this is a piece of the output of dmesg this morning :
mic0: Transition from state ready to booting mic image: /usr/share/mpss/boot/bzImage-knightscorner MIC 0 Booting Waiting for MIC 0 boot 5 Waiting for MIC 0 boot 10 Waiting for MIC 0 boot 15 MIC 0 Network link is up mic0: Transition from state booting to online micscif_handle_lostnode 1445 node 1 Warning: Core image elf header not found Kdump: vmcore not initialized micscif_handle_lostnode 1457 node 1 crash dump failed status -22 mic0: Transition from state online to lost micscif_handle_lostnode 1472 stopping node 1 to recover lost node! dma_mark_wait 1080 TO chan 0x0 drain_dma_intr 1151 err -16 dma_mark_wait 1080 TO chan 0x0 drain_dma_intr 1151 err -16 dma_mark_wait 1080 TO chan 0x1 drain_dma_intr 1151 err -16 dma_mark_wait 1080 TO chan 0x2 drain_dma_intr 1151 err -16 dma_mark_wait 1080 TO chan 0x3 drain_dma_intr 1151 err -16 mic0: Transition from state lost to resetting mic0: Resetting (Post Code <FF><FF>) mic0: Transition from state resetting to reset failed MIC 0 RESETFAIL postcode <FF><FF> -1 micscif_handle_lostnode 1523 booting node 1 to recover lost node! adapter_start_device 1379 state 8??
And the Xeon Phi was pretty hot on booting !
Now that the status is on "reset failed" the temperature has really decreased.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
while I share the dmesg output with our hardware experts, there is some more data (directly from the coprocessor) that might help.
echo 0 > /sys/class/mic/scif/watchdog_enabled
Then, use the following steps to show the micro-OS kernel log buffer
Mount debugfs on the host: mount -t debugfs none /sys/kernel/debug
Dump the buffer:
cat /sys/kernel/debug/mic_debug/mic0/log_buf > <some file of your choice> (shows contents of the buffer up until now)
sudo tail -f /sys/kernel/debug/mic_debug/mic0/log_buf | tee -a <some file of your choice> (collects any recent and new data as things run; also outputs contents to STDOUT)
Have you verified that all connections to powers and fans are working properly? Additionally, can you share how you obtained this coprocessor (was it through an OEM or other source?)
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, can you confirm that you turned power management off, so we can be sure this isn't a recurrence of a known issue? And could you use micsmc to find out what temperature the coprocessor is running at? You can use micsmc without options, which will bring it up in GUI mode; this will let you look at a number of things in addition to temperature, such as cpu usage and memory, and watch how things change over time. Or you can use the -t option ("micsmc -t") which will write out just the temperature information in command line mode.
Frances
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually - let's back up a bit. I went back to the original start of this string in http://software.intel.com/en-us/comment/1784271#comment-1784271 and noticed that you said you didn't flash the coprocessor when you installed the system. This could actually be the problem both for stability and overheating. There was a similar problem last summer - https://software.intel.com/en-us/forums/topic/402337 - where the system wouldn't stay up and the coprocessor was running hot.
To tell what level of flash is installed, you will need to boot the coprocessor, then, from the host, use micinfo. The flash version will be right at the top of the output. And could you tell me what happened when you tried to flash the card?
Frances
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I spoke too quickly. Belinda pointed me to the micdebug info you had sent. You are running the correct version of Flash.
Edit -
Further apologies on my part - I came in in the middle of this thread and didn't carefully read what came before. The information from micdebug_20140403_062958utc that you posted earlier, showed that you were running:
[bash]
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2
[/bash]
However, at the top of this thread, you say that micinfo shows that you are running and earlier version:
[bash]
Flash Version : 2.1.02.0386
SMC Firmware Version : 1.14.4616
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2
[/bash]
I don't know quite what happened here. I probably missed something on my rereading. Are you truly running with Flash 2.1.02.0386 and Firmware 1.14.4616? Can you get back to 2.1.02.0390 and 1.16.5078? Was this something you did along the way in order to get the coprocessor to boot?
Frances
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
And could you use micsmc to find out what temperature the coprocessor is running at? You can use micsmc without options, which will bring it up in GUI mode; this will let you look at a number of things in addition to temperature, such as cpu usage and memory, and watch how things change over time. Or you can use the -t option ("micsmc -t") which will write out just the temperature information in command line mode.
Hi Frances and thanks for your help !
This is an output of "micsmc -t" 3 minutes after booting :
mic0 (temp): Cpu Temp: ................ 122.00 C Memory Temp: ............. 67.00 C Fan-In Temp: ............. 45.00 C Fan-Out Temp: ............ 67.00 C Core Rail Temp: .......... 60.00 C Uncore Rail Temp: ........ 61.00 C Memory Rail Temp: ........ 61.00 C
Then I ran micsmc in GUI mode and I was able to see the temperature increasing until 130 °C and after 3 more minutes the coprocessor stopped.
Now I must wait a long time to test anything else...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
Also, can you confirm that you turned power management off, so we can be sure this isn't a recurrence of a known issue?
Could you remember me how to know that ?
I find these files. It would perhaps answer to your question.
/home/myDir/.config/Intel\ Corp/MicSmcGUI.ini
[SessionSettings] sessionStart=@Variant(\0\0\0\x10\0%|\xa8\x2\x81<\x16\0) sessionStop=@Variant(\0\0\0\x10\0%|\xa8\x1\xe9\x94\xfa\0) resetAllScript="micctrl -ri; micctrl -w" restartAllScript="micctrl -ri; micctrl -w; micctrl -b; micctrl -w" resetCardsScript="micctrl -ri %1; micctrl -w" restartCardsScript="micctrl -ri %1; micctrl -w; micctrl -b %1; micctrl -w" diagnosticsEnabled=false [CardViewSettings0] viewEnable=true viewEnable\viewState=3 [SystemViewSettings] viewEnable=true viewState=2 [SettingsViewSettings] viewEnable=true viewState=0 [MicSettingsCpuStat] coreUsageDialEnable=true coreUsageGraphEnable=true tempDialEnable=true freqDialEnable=true powerGraphEnable=true coreMemoryBarEnable=true turboModeEnable=true ledAlertEnable=true eccModeEnable=true powerStatecpf=true powerStateco6=true powerStatepc3=true powerStatepc6=true cardUtilGraphEnable=true coreUtilGraphEnable=true coreHistogramEnable=true powerBarEnable=true tempDisplayEnable=true coreTempLimit=85 [MicSettings] logToFile=true appendFile=true logFilename=/tmp/CPL_GUI.log logfileRotation=0 rotationTimestamp=@Variant(\0\0\0\x10\0%|\xa8\x1\xcb\x41H\0) limitCores=0
/root/.config/Intel\ Corp/MicSmcGUI.ini
[SessionSettings] sessionStart=@Variant(\0\0\0\x10\0%|\xbb\x1\xe0\x34]\x1) sessionStop=@Variant(\0\0\0\x10\0%|\xa7\x2v\xf8\x9e\0) resetAllScript="micctrl -ri; micctrl -w" restartAllScript="micctrl -ri; micctrl -w; micctrl -b; micctrl -w" resetCardsScript="micctrl -ri %1; micctrl -w" restartCardsScript="micctrl -ri %1; micctrl -w; micctrl -b %1; micctrl -w" diagnosticsEnabled=false [CardViewSettings0] viewEnable=true viewEnable\viewState=3 [SystemViewSettings] viewEnable=true viewState=2 [SettingsViewSettings] viewEnable=true viewState=0 [MicSettingsCpuStat] coreUsageDialEnable=true coreUsageGraphEnable=true tempDialEnable=true freqDialEnable=true powerGraphEnable=true coreMemoryBarEnable=true turboModeEnable=true ledAlertEnable=true eccModeEnable=true powerStatecpf=true powerStateco6=true powerStatepc3=true powerStatepc6=true cardUtilGraphEnable=true coreUtilGraphEnable=true coreHistogramEnable=true powerBarEnable=true tempDisplayEnable=true coreTempLimit=85 [MicSettings] logToFile=true appendFile=true logFilename=/tmp/CPL_GUI.log logfileRotation=0 rotationTimestamp=@Variant(\0\0\0\x10\0%|\xa7\x2s\xca\xfd\0) limitCores=0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
The information from micdebug_20140403_062958utc that you posted earlier, showed that you were running:
Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.2
However, at the top of this thread, you say that micinfo shows that you are running and earlier version:
Flash Version : 2.1.02.0386 SMC Firmware Version : 1.14.4616 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.2
Sorry ! The information from micdebug_20140403_062958utc is the good one (April 3rd) ! I wrote information from April 1st.
Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078
are the version I actually use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
The information from micdebug_20140403_062958utc that you posted earlier, showed that you were running:
Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.2
However, at the top of this thread, you say that micinfo shows that you are running and earlier version:
Flash Version : 2.1.02.0386 SMC Firmware Version : 1.14.4616 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.2
Sorry ! The information from micdebug_20140403_062958utc is the good one (April 3rd). I wrote a piece of information of April 1st on this thread.
Accept my apologizes...
Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078
are the version I actually use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BELINDA L. (Intel) wrote:
Additionally, can you share how you obtained this coprocessor (was it through an OEM or other source?)
The Xeon Phi have been bought it in ITPatrner.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for anyone following this thread: we came to the conclusion that the platform was not properly equipped to cool the coprocessor. Virginie is in the process of getting a system that can.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FYI. The airflow requirements for actively and passively cooled coprocessors are in the datasheet (April 2014). See section 3.3, "Intel® Xeon Phi™ Coprocessor Thermal Solutions."
Regards
---
Taylor

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page