Software Archive
Read-only legacy content

mic0: lost

Virginie_Favrat
Beginner
1,634 Views

Hi !

I begin with Xeon Phi and MPI. After a lot of troubles, yesterday I managed to run MPI programs on symmetric mode (both host and co-processor) as well as directly on mic0 after ssh.

But today I have troubles again.

MPSS starts correctly and 3 minutes later the status of mic0 is lost and I can not reset it.

# micctrl -s
mic0: lost
# service mpss status
mpss is running
# micctrl -rw
          mic0: resetting
          mic0: reset failed

Even when everything seems to run fine I am not able to reboot or reset with micctrl. I never succeded in reseting or rebooting mic0 without rebooting my computer.

My hardware :

  • motherboard : ASUS P9 X79E-WS
  • processor : Xeon E5 v2
  • co-processor : Xeon Phi 7120P (with a fan)

My OS : CentOS 6.5

A piece of micinfo output :

Flash Version : 2.1.02.0386
SMC Firmware Version : 1.14.4616
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2

Thanks in advance.

0 Kudos
13 Replies
Virginie_Favrat
Beginner
1,634 Views

Moreover when I shut down my computer the Xeon Phi does not stop and I have to unplug my machine.

0 Kudos
Virginie_Favrat
Beginner
1,634 Views

After 2 days without running, this is a piece of the output of dmesg this morning :

mic0: Transition from state ready to booting
mic image: /usr/share/mpss/boot/bzImage-knightscorner
MIC 0 Booting
Waiting for MIC 0 boot 5
Waiting for MIC 0 boot 10
Waiting for MIC 0 boot 15
MIC 0 Network link is up
mic0: Transition from state booting to online
micscif_handle_lostnode 1445 node 1
Warning: Core image elf header not found
Kdump: vmcore not initialized
micscif_handle_lostnode 1457 node 1 crash dump failed status -22
mic0: Transition from state online to lost
micscif_handle_lostnode 1472 stopping node 1 to recover lost node!
dma_mark_wait 1080 TO chan 0x0
drain_dma_intr 1151 err -16
dma_mark_wait 1080 TO chan 0x0
drain_dma_intr 1151 err -16
dma_mark_wait 1080 TO chan 0x1
drain_dma_intr 1151 err -16
dma_mark_wait 1080 TO chan 0x2
drain_dma_intr 1151 err -16
dma_mark_wait 1080 TO chan 0x3
drain_dma_intr 1151 err -16
mic0: Transition from state lost to resetting
mic0: Resetting (Post Code <FF><FF>)
mic0: Transition from state resetting to reset failed
MIC 0 RESETFAIL postcode <FF><FF> -1
micscif_handle_lostnode 1523 booting node 1 to recover lost node!
adapter_start_device 1379 state 8??

And the Xeon Phi was pretty hot on booting !

Now that the status is on "reset failed" the temperature has really decreased.

0 Kudos
BelindaLiviero
Employee
1,634 Views

while I share the dmesg output with our hardware experts, there is some more data (directly from the coprocessor) that might help.  

echo 0 > /sys/class/mic/scif/watchdog_enabled

 Then, use the following steps to show the micro-OS kernel log buffer

Mount debugfs on the host:    mount -t debugfs none /sys/kernel/debug   

Dump the buffer:  

cat /sys/kernel/debug/mic_debug/mic0/log_buf > <some file of your choice>  (shows contents of the buffer up until now)

sudo tail -f /sys/kernel/debug/mic_debug/mic0/log_buf | tee -a <some file of your choice> (collects any recent and new data as things run; also outputs contents to STDOUT)

 

Have you verified that all connections to powers and fans are working properly?  Additionally, can you share how you obtained this coprocessor (was it through an OEM or other source?)

thanks

 

0 Kudos
Frances_R_Intel
Employee
1,634 Views

Also, can you confirm that you turned power management off, so we can be sure this isn't a recurrence of a known issue? And could you use micsmc to find out what temperature the coprocessor is running at? You can use micsmc without options, which will bring it up in GUI mode; this will let you look at a number of things in addition to temperature, such as cpu usage and memory, and watch how things change over time. Or you can use the -t option ("micsmc -t") which will write out just the temperature information in command line mode.

Frances

0 Kudos
Frances_R_Intel
Employee
1,634 Views

Actually - let's back up a bit. I went back to the original start of this string in http://software.intel.com/en-us/comment/1784271#comment-1784271 and noticed that you said you didn't flash the coprocessor when you installed the system. This could actually be the problem both for stability and overheating. There was a similar problem last summer - https://software.intel.com/en-us/forums/topic/402337 - where the system wouldn't stay up and the coprocessor was running hot.

To tell what level of flash is installed, you will need to boot the coprocessor, then, from the host, use micinfo. The flash version will be right at the top of the output. And could you tell me what happened when you tried to flash the card?

Frances

0 Kudos
Frances_R_Intel
Employee
1,634 Views

Sorry, I spoke too quickly. Belinda pointed me to the micdebug info you had sent. You are running the correct version of Flash.

Edit - 

Further apologies on my part - I came in in the middle of this thread and didn't carefully read what came before. The information from micdebug_20140403_062958utc that you posted earlier, showed that you were running:

[bash]

Flash Version            : 2.1.02.0390

SMC Firmware Version     : 1.16.5078

SMC Boot Loader Version  : 1.8.4326

uOS Version              : 2.6.38.8+mpss3.2

[/bash]

However, at the top of this thread, you say that micinfo shows that you are running and earlier version:

[bash]

Flash Version : 2.1.02.0386

SMC Firmware Version : 1.14.4616

SMC Boot Loader Version : 1.8.4326

uOS Version : 2.6.38.8+mpss3.2

[/bash]

I don't know quite what happened here. I probably missed something on my rereading. Are you truly running with Flash 2.1.02.0386 and Firmware 1.14.4616? Can you get back to  2.1.02.0390 and 1.16.5078? Was this something you did along the way in order to get the coprocessor to boot?

Frances

 

 

0 Kudos
Virginie_Favrat
Beginner
1,634 Views

Frances Roth (Intel) wrote:

And could you use micsmc to find out what temperature the coprocessor is running at? You can use micsmc without options, which will bring it up in GUI mode; this will let you look at a number of things in addition to temperature, such as cpu usage and memory, and watch how things change over time. Or you can use the -t option ("micsmc -t") which will write out just the temperature information in command line mode.

Hi Frances and thanks for your help !

This is an output of "micsmc -t" 3 minutes after booting :

mic0 (temp):
   Cpu Temp: ................ 122.00 C
   Memory Temp: ............. 67.00 C
   Fan-In Temp: ............. 45.00 C
   Fan-Out Temp: ............ 67.00 C
   Core Rail Temp: .......... 60.00 C
   Uncore Rail Temp: ........ 61.00 C
   Memory Rail Temp: ........ 61.00 C

Then I ran micsmc in GUI mode and I was able to see the temperature increasing until 130 °C and after 3 more minutes the coprocessor stopped.

Now I must wait a long time to test anything else...

0 Kudos
Virginie_Favrat
Beginner
1,634 Views

Frances Roth (Intel) wrote:

Also, can you confirm that you turned power management off, so we can be sure this isn't a recurrence of a known issue?

Could you remember me how to know that ?

I find these files. It would perhaps answer to your question.

/home/myDir/.config/Intel\ Corp/MicSmcGUI.ini

[SessionSettings]
sessionStart=@Variant(\0\0\0\x10\0%|\xa8\x2\x81<\x16\0)
sessionStop=@Variant(\0\0\0\x10\0%|\xa8\x1\xe9\x94\xfa\0)
resetAllScript="micctrl -ri; micctrl -w"
restartAllScript="micctrl -ri; micctrl -w; micctrl -b; micctrl -w"
resetCardsScript="micctrl -ri %1; micctrl -w"
restartCardsScript="micctrl -ri %1; micctrl -w; micctrl -b %1; micctrl -w"
diagnosticsEnabled=false

[CardViewSettings0]
viewEnable=true
viewEnable\viewState=3

[SystemViewSettings]
viewEnable=true
viewState=2

[SettingsViewSettings]
viewEnable=true
viewState=0

[MicSettingsCpuStat]
coreUsageDialEnable=true
coreUsageGraphEnable=true
tempDialEnable=true
freqDialEnable=true
powerGraphEnable=true
coreMemoryBarEnable=true
turboModeEnable=true
ledAlertEnable=true
eccModeEnable=true
powerStatecpf=true
powerStateco6=true
powerStatepc3=true
powerStatepc6=true
cardUtilGraphEnable=true
coreUtilGraphEnable=true
coreHistogramEnable=true
powerBarEnable=true
tempDisplayEnable=true
coreTempLimit=85

[MicSettings]
logToFile=true
appendFile=true
logFilename=/tmp/CPL_GUI.log
logfileRotation=0
rotationTimestamp=@Variant(\0\0\0\x10\0%|\xa8\x1\xcb\x41H\0)
limitCores=0

/root/.config/Intel\ Corp/MicSmcGUI.ini

[SessionSettings]
sessionStart=@Variant(\0\0\0\x10\0%|\xbb\x1\xe0\x34]\x1)
sessionStop=@Variant(\0\0\0\x10\0%|\xa7\x2v\xf8\x9e\0)
resetAllScript="micctrl -ri; micctrl -w"
restartAllScript="micctrl -ri; micctrl -w; micctrl -b; micctrl -w"
resetCardsScript="micctrl -ri %1; micctrl -w"
restartCardsScript="micctrl -ri %1; micctrl -w; micctrl -b %1; micctrl -w"
diagnosticsEnabled=false

[CardViewSettings0]
viewEnable=true
viewEnable\viewState=3

[SystemViewSettings]
viewEnable=true
viewState=2

[SettingsViewSettings]
viewEnable=true
viewState=0

[MicSettingsCpuStat]
coreUsageDialEnable=true
coreUsageGraphEnable=true
tempDialEnable=true
freqDialEnable=true
powerGraphEnable=true
coreMemoryBarEnable=true
turboModeEnable=true
ledAlertEnable=true
eccModeEnable=true
powerStatecpf=true
powerStateco6=true
powerStatepc3=true
powerStatepc6=true
cardUtilGraphEnable=true
coreUtilGraphEnable=true
coreHistogramEnable=true
powerBarEnable=true
tempDisplayEnable=true
coreTempLimit=85

[MicSettings]
logToFile=true
appendFile=true
logFilename=/tmp/CPL_GUI.log
logfileRotation=0
rotationTimestamp=@Variant(\0\0\0\x10\0%|\xa7\x2s\xca\xfd\0)
limitCores=0

 

0 Kudos
Virginie_Favrat
Beginner
1,634 Views

Frances Roth (Intel) wrote:

The information from micdebug_20140403_062958utc that you posted earlier, showed that you were running:

 

Flash Version            : 2.1.02.0390
SMC Firmware Version     : 1.16.5078
SMC Boot Loader Version  : 1.8.4326
uOS Version              : 2.6.38.8+mpss3.2

 

However, at the top of this thread, you say that micinfo shows that you are running and earlier version:

Flash Version : 2.1.02.0386
SMC Firmware Version : 1.14.4616
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2

Sorry ! The information from micdebug_20140403_062958utc is the good one (April 3rd) ! I wrote information from April 1st.

Flash Version            : 2.1.02.0390
SMC Firmware Version     : 1.16.5078

are the version I actually use.

0 Kudos
Virginie_Favrat
Beginner
1,634 Views

Frances Roth (Intel) wrote:

The information from micdebug_20140403_062958utc that you posted earlier, showed that you were running:

Flash Version            : 2.1.02.0390
SMC Firmware Version     : 1.16.5078
SMC Boot Loader Version  : 1.8.4326
uOS Version              : 2.6.38.8+mpss3.2

 

However, at the top of this thread, you say that micinfo shows that you are running and earlier version:

Flash Version : 2.1.02.0386
SMC Firmware Version : 1.14.4616
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2

Sorry ! The information from micdebug_20140403_062958utc is the good one (April 3rd). I wrote a piece of information of April 1st on this thread.

Accept my apologizes...

Flash Version            : 2.1.02.0390
SMC Firmware Version     : 1.16.5078

are the version I actually use.

0 Kudos
Virginie_Favrat
Beginner
1,634 Views

BELINDA L. (Intel) wrote:

Additionally, can you share how you obtained this coprocessor (was it through an OEM or other source?)

The Xeon Phi have been bought it in ITPatrner.

0 Kudos
BelindaLiviero
Employee
1,634 Views

for anyone following this thread:   we came to the conclusion that the platform was not properly equipped to cool the coprocessor.   Virginie is in the process of getting  a system that can.

 

0 Kudos
TaylorIoTKidd
New Contributor I
1,634 Views

FYI. The airflow requirements for actively and passively cooled coprocessors are in the datasheet (April 2014). See section 3.3, "Intel® Xeon Phi™ Coprocessor Thermal Solutions."

Regards
---
Taylor
 

0 Kudos
Reply