- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running SUSE 3.0.76-0.11-default. I installed the RPM's with zypper, which seemed to load fine. The flash command did not work:
/usr/bin/micflash -update -device all -smcbootloader
there was some issue finding the smc bootloader.
So, then I set up a key, and was able to ssh mic0.
After a while say 15 minutes, I lost my connection to the Phi, the ssh session froze up, and I was not able to ssh to the card. I have a fan directly on the card, and it was warm, but not hot to the touch. When I reboot my system, I can again ssh to mic0.
lspci | grep 225
04:00.0 Co-processor: Intel Corporation Device 225e (rev ff)
So, the card appears to be recognized.
ssh mic0
ssh: connect to host mic0 port 22: No route to host
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also,
sudo service mpss stop
just hangs.
sudo micctrl -s
mic0: reset failed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From IPMI, the peripheral temp. is 39C.
When I shutdown the computer, and turn it back on, I can ssh to mic0 fine, but I lose this connection after 10 minutes again.I don't believe this is a cooling issue, the card is not hot to touch (I tried touching the metal plate, it feels warm, but not hot), and nothing on IPMI looks out of normal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sudo micsmc -t
mic0 (temp):
Cpu Temp: ................ 0.00 C (SMC reports sensor read invalid)
Memory Temp: ............. 0.00 C
Fan-In Temp: ............. 0.00 C (SMC reports sensor read invalid)
Fan-Out Temp: ............ 0.00 C (SMC reports sensor read invalid)
Core Rail Temp: .......... 0.00 C (SMC reports sensor read invalid)
Uncore Rail Temp: ........ 0.00 C (SMC reports sensor read invalid)
Memory Rail Temp: ........ 0.00 C (SMC reports sensor read invalid)
So, I am not getting temperature readings, not sure where this is.
sudo micinfo
MicInfo Utility Log
Created Sun Dec 28 11:02:46 2014
System Info
HOST OS : Linux
OS Version : 3.0.76-0.11-default
Driver Version : 3.4.2-1
MPSS Version : 3.4.2
Host Physical Memory : 16302 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0381
SMC Firmware Version : 1.8.4326
SMC Boot Loader Version : NotAvailable
uOS Version : 2.6.38.8+mpss3.4.2
Device Serial Number : ADKC31600467
Board
Vendor ID : 0x8086
Device ID : 0x225e
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-31S1P
ECC Mode : Enabled
SMC HW Revision : Product 300W Passive CS
Cores
Total No of Active Cores : 57
Voltage : 0 uV
Frequency : 1100000 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 0 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 0 uV
I did try the flash update, which got through most of the way, some error in maintenance mode, before the connections with the card broke. Is the card over-heating, and if so, does Intel offer options to cool this card?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is my attempt to update the the flash:
sudo /usr/bin/micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0390-02.rom.smc
mic0: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab
mic0: SMC boot-loader update started
mic0: SMC boot-loader update done
mic0: Transitioning to ready state
mic0: Flash update started
micflash: mic0: Flash operation timed out
micflash: mic0: Failed to reset: read: /sys/class/mic/mic0/post_code: No such device or address
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I aimed a floor fan directly at the Xeon Phi, and was able to get the flash to complete, and also to successfully pass miccheck. Here are the current temperature readings, from micsmc -t:
mic0 (temp):
Cpu Temp: ................ 66.00 C
Memory Temp: ............. 42.00 C
Fan-In Temp: ............. 29.00 C
Fan-Out Temp: ............ 42.00 C
Core Rail Temp: .......... 38.00 C
Uncore Rail Temp: ........ 39.00 C
Memory Rail Temp: ........ 39.00 C
What solutions/suggestions do you guys have, to address the cooling issue on the passive cooled 3100?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I set up the floor fan to blow directly onto the card, and started my neural network program. The temperature on the card went up to 85C, and micsmc gave a warning that the card was over-heating, so I shut the system down.
The card is a 31s1p, how do I cool this thing down, so it is usable?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is required to push air through the unit, not onto the outside of the card case.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There have been several discussions in this forum on how to prevent a passively cooled coprocessor from overheating when installed on a host that isn't designed to support the coprocessor. (OEM hosts that support the coprocessor insure that the airflow provided in the host is adequate for cooling the coprocessor.) Jim has been one of the key contributors to these discussions.
Specifics concerning the coprocessor cooling requirements can be found in, "Intel® Xeon Phi™ Coprocessor: Datasheet."
I suggest you search for these other posts. As an example: https://software.intel.com/en-us/forums/topic/498452.
Regards
--
Taylor
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page