Software Archive
Read-only legacy content
17060 Discussions

Xeon Phi stops working after a while

AWeis7
Beginner
720 Views

I am running SUSE 3.0.76-0.11-default. I installed the RPM's with zypper, which seemed to load fine. The flash command did not work:

/usr/bin/micflash -update -device all -smcbootloader

there was some issue finding the smc bootloader.

So, then I set up a key, and was able to ssh mic0.

After a while say 15 minutes, I lost my connection to the Phi, the ssh session froze up, and I was not able to ssh to the card. I have a fan directly on the card, and it was warm, but not hot to the touch. When I reboot my system, I can again ssh to mic0.

lspci | grep 225
04:00.0 Co-processor: Intel Corporation Device 225e (rev ff)

So, the card appears to be recognized.

ssh mic0
ssh: connect to host mic0 port 22: No route to host

 

 

0 Kudos
8 Replies
AWeis7
Beginner
720 Views

Also,

sudo service mpss stop
just hangs.

sudo micctrl -s
mic0: reset failed

 

 

0 Kudos
AWeis7
Beginner
720 Views

From IPMI, the peripheral temp. is 39C.

When I shutdown the computer, and turn it back on, I can ssh to mic0 fine, but I lose this connection after 10 minutes again.I don't believe this is a cooling issue, the card is not hot to touch (I tried touching the metal plate, it feels warm, but not hot), and nothing on IPMI looks out of normal.

0 Kudos
AWeis7
Beginner
720 Views

sudo micsmc -t

mic0 (temp):
   Cpu Temp: ................ 0.00 C (SMC reports sensor read invalid)
   Memory Temp: ............. 0.00 C
   Fan-In Temp: ............. 0.00 C (SMC reports sensor read invalid)
   Fan-Out Temp: ............ 0.00 C (SMC reports sensor read invalid)
   Core Rail Temp: .......... 0.00 C (SMC reports sensor read invalid)
   Uncore Rail Temp: ........ 0.00 C (SMC reports sensor read invalid)

   Memory Rail Temp: ........ 0.00 C (SMC reports sensor read invalid)

So, I am not getting temperature readings, not sure where this is.

sudo micinfo
MicInfo Utility Log
Created Sun Dec 28 11:02:46 2014


    System Info
        HOST OS            : Linux
        OS Version        : 3.0.76-0.11-default
        Driver Version        : 3.4.2-1
        MPSS Version        : 3.4.2
        Host Physical Memory    : 16302 MB

Device No: 0, Device Name: mic0

    Version
        Flash Version          : 2.1.02.0381
        SMC Firmware Version     : 1.8.4326
        SMC Boot Loader Version     : NotAvailable
        uOS Version          : 2.6.38.8+mpss3.4.2
        Device Serial Number      : ADKC31600467

    Board
        Vendor ID          : 0x8086
        Device ID          : 0x225e
        Subsystem ID          : 0x2500
        Coprocessor Stepping ID     : 3
        PCIe Width          : x16
        PCIe Speed          : 5 GT/s
        PCIe Max payload size     : 256 bytes
        PCIe Max read req size     : 512 bytes
        Coprocessor Model     : 0x01
        Coprocessor Model Ext     : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family     : 0x0b
        Coprocessor Family Ext     : 0x00
        Coprocessor Stepping      : B1
        Board SKU          : B1PRQ-31S1P
        ECC Mode          : Enabled
        SMC HW Revision      : Product 300W Passive CS

    Cores
        Total No of Active Cores : 57
        Voltage          : 0 uV
        Frequency         : 1100000 kHz

    Thermal
        Fan Speed Control      : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 0 C

    GDDR
        GDDR Vendor         : Elpida
        GDDR Version         : 0x1
        GDDR Density         : 2048 Mb
        GDDR Size         : 7936 MB
        GDDR Technology         : GDDR5
        GDDR Speed         : 5.000000 GT/s
        GDDR Frequency         : 2500000 kHz
        GDDR Voltage         : 0 uV

I did try the flash update, which got through most of the way, some error in maintenance mode, before the connections with the card broke. Is the card over-heating, and if so, does Intel offer options to cool this card?

0 Kudos
AWeis7
Beginner
720 Views

Here is my attempt to update the the flash:

sudo /usr/bin/micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0390-02.rom.smc
mic0: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab
mic0: SMC boot-loader update started
mic0: SMC boot-loader update done
mic0: Transitioning to ready state
mic0: Flash update started

micflash: mic0: Flash operation timed out

micflash: mic0: Failed to reset: read: /sys/class/mic/mic0/post_code: No such device or address

0 Kudos
AWeis7
Beginner
720 Views

I aimed a floor fan directly at the Xeon Phi, and was able to get the flash to complete, and also to successfully pass miccheck. Here are the current temperature readings, from micsmc -t:

mic0 (temp):
   Cpu Temp: ................ 66.00 C
   Memory Temp: ............. 42.00 C
   Fan-In Temp: ............. 29.00 C
   Fan-Out Temp: ............ 42.00 C
   Core Rail Temp: .......... 38.00 C
   Uncore Rail Temp: ........ 39.00 C
   Memory Rail Temp: ........ 39.00 C

What solutions/suggestions do you guys have, to address the cooling issue on the passive cooled 3100?

0 Kudos
AWeis7
Beginner
720 Views

I set up the floor fan to blow directly onto the card, and started my neural network program. The temperature on the card went up to 85C, and micsmc gave a warning that the card was over-heating, so I shut the system down.

The card is a 31s1p, how do I cool this thing down, so it is usable?

0 Kudos
TaylorIoTKidd
New Contributor I
720 Views

There have been several discussions in this forum on how to prevent a passively cooled coprocessor from overheating when installed on a host that isn't designed to support the coprocessor. (OEM hosts that support the coprocessor insure that the airflow provided in the host is adequate for cooling the coprocessor.) Jim has been one of the key contributors to these discussions.

Specifics concerning the coprocessor cooling requirements can be found in, "Intel® Xeon Phi™ Coprocessor: Datasheet."

I suggest you search for these other posts. As an example: https://software.intel.com/en-us/forums/topic/498452.

Regards
--
Taylor
 

0 Kudos
Reply