- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
It seems like I can't get my MIC card working. I followed the instructions in the readme to install MPSS on a fresh Centos 6.3 (which is pretty much the same as RHEL 6.3, I think). The errors I get are not constant, which makes debugging quite hard, but right now, starting mpss using
# service mpss start
fails with this in /var/log/mpssd:
Mon Mar 25 12:10:48 2013: mic0: log_buf_addr: ffffffff832332d0
Mon Mar 25 12:10:48 2013: mic0: log_buf_len: ffffffff81724c70
Mon Mar 25 12:10:48 2013: mic0: Current state "reset failed" cannot boot card
Mon Mar 25 12:10:50 2013: Wait for download requests
The output of miccheck doesn't look good either:
[root@semperphi ~]# /opt/intel/mic/bin/miccheck
miccheck 2.1.5889-14, created 18:10:54 Feb 28 2013
Copyright 2011-2013 Intel Corporation All rights reserved
Test 1 Ensure installation matches manifest : OK
Test 2 Ensure host driver is loaded : OK
Test 3 Ensure driver matches manifest : OK
Test 4 Detect all listed devices : OK
MIC 0 Test 1 Find the device : OK
MIC 0 Test 2 Check the POST code via PCI : FAILED
MIC 0 Test 2> Current POST code is �� (not FF) for MIC 0
MIC 0 Test 3 Connect to the device : SKIPPED
MIC 0 Test 3> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 3> The device is not online
MIC 0 Test 4 Check for normal mode : SKIPPED
MIC 0 Test 4> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 4> The device is not online
MIC 0 Test 5 Check the POST code via SCIF : SKIPPED
MIC 0 Test 5> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 5> The device is not online
MIC 0 Test 6 Send data to the device : SKIPPED
MIC 0 Test 6> Prerequisite 'Check for normal mode' failed:
MIC 0 Test 6> The device is not in normal mode
MIC 0 Test 7 Compare the PCI configuration : OK
MIC 0 Test 8 Ensure Flash version matches manifest : SKIPPED
MIC 0 Test 8> Prerequisite 'Check for normal mode' failed:
MIC 0 Test 8> The device is not in normal mode
Status: The POST code was not "FF"
The output of micinfo:
[root@semperphi ~]# /opt/intel/mic/bin/micinfo
MicInfo Utility Log
Created Mon Mar 25 12:13:53 2013
System Info
HOST OS : Linux
OS Version : 2.6.32-279.el6.x86_64
Driver Version : 5889-14
MPSS Version : 2.1.5889-14
Host Physical Memory : 16300 MB
Device No: 0, Device Name: mic0
Version
Flash Version : NotAvailable
SMC Boot Loader Version : NotAvailable
uOS Version : NotAvailable
Device Serial Number : NotAvailable
Board
Vendor ID : ffff
Device ID : ffff
Subsystem ID : ffff
Coprocessor Stepping ID : f
PCIe Width : x63
PCIe Speed : Unknown
PCIe Max payload size : 16384 bytes
PCIe Max read req size : 16384 bytes
Coprocessor Model : 0x0f
Coprocessor Model Ext : 0x0f
Coprocessor Type : 0x03
Coprocessor Family : 0x0f
Coprocessor Family Ext : 0x0ff
Coprocessor Stepping : B1
Board SKU : NotAvailable
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable
Cores
Total No of Active Cores : NotAvailable
Voltage : NotAvailable
Frequency : NotAvailable
Thermal
Fan Speed Control : NotAvailable
SMC Firmware Version : NotAvailable
FSC Strap : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable
GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable
Do you have any ideas what steps I can take to start debugging this ?
Simon
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This log appears to be empty.
[root@semperphi mic0]# cat /sys/class/mic/scif/watchdog_enabled
0
[root@semperphi mic0]# mount | grep debugfs
/sys/kernel/debug on /debugfs type debugfs (rw)
[root@semperphi mic0]# pwd
/debugfs/mic_debug/mic0
[root@semperphi mic0]# cat log_buf > /tmp/mic0_log
[root@semperphi mic0]# ls -l /tmp/mic0_log
-rw-r--r-- 1 root root 0 Apr 12 10:18 /tmp/mic0_log
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm still no result.
If you prefer, I could find a way to give you access to the system itself and diagnose it directly. Alternatively, if we could have a chat on IRC or something, it could be more efficient.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm still no result.
If you prefer, I could find a way to give you access to the system itself and diagnose it directly. Alternatively, if we could have a chat on IRC or something, it could be more efficient.
Thank you for your help !
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
I have the same problem with a
dmidecode -s system-product-name
X9DRG-HF
System.
With a heavy load the system reports
HPL[mic-server] : MIC No. 0 exceeds allowed temperature ( 96 degree).
HPL Warning [mic-server] : Memory load is too heavy. Performance may be low.
after a couple of minutes the MIC card "hangs" and it's necesary to do a power cycle to recovery the system.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Simon, please send us lspci -vvv output and I will try to arrange a call to discuss this case (and have Intel access your system) offline in the meantime. is it safe to assume you've not seen any indication of temperature issues?
Pedro: do you also have a Supermicro system?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I haven't seen any temperature related messages.
Here is the lspci -vvv output: http://paste.ubuntu.com/5711230/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes X9DRG-HF is a Supermicro system.
This are the last lines of /sys/kernel/debug/mic_debug/mic1/log_buf, when the card hangs there is no information of what happened.
<4>[ 5.034067] Loading RAS module ver 0.9c. Build date: Mar 8 2013
<4>[ 5.036794] RAS: card 22508086:25008086:11 SKU is "B1 SKU2" (60 cores, 16 memch, 0 txs)
<4>[ 5.039416] RAS.elog: rev 1, size 3276, head 3275, tail 3275
<4>[ 5.039472] RAS.elog: init complete
<4>[ 5.039479] RAS.core: init complete
<4>[ 5.039523] RAS.uncore: init complete
<6>[ 5.049389] mic_pm: micpm: Freq/volt table returned to RAS
<6>[ 5.049398] mic_pm: index freq voltage
<6>[ 5.049408] mic_pm: 0 842104 1040000
<6>[ 5.049418] mic_pm: 1 947367 1045000
<6>[ 5.049427] mic_pm: 2 1052630 1050000
<4>[ 5.049434] micpm: RAS module registered
<4>[ 5.049447] RAS module load completed
<4>[ 5.540110] RAS.init: module operational
<4>[ 21.707071] S01fileperms used greatest stack depth: 5752 bytes left
<4>[ 21.761012] ip used greatest stack depth: 5504 bytes left
<4>[ 21.908814] ip used greatest stack depth: 5208 bytes left
<4>[ 22.321287] Module pm_scif loaded at 0xffffffffa001c000
<1>[ 22.328387] [ pm_scif_init : 343 ]:==> pm_scif_init
<1>[ 22.328420] [ pm_scif_init : 344 ]:pm_scif insmoded
<1>[ 22.328472] [ pm_scif_init : 372 ]: scif_bind successfull. Local port number = 1088, ep =
<1>[ 22.329328] [ pm_recv_from_host : 191 ]:==> pm_recv_from_host
<1>[ 22.329425] [ pm_handle_get_latencies : 99 ]:==> pm_handle_get_latencies
<1>[ 22.329504] [ pm_recv_from_host : 191 ]:==> pm_recv_from_host
<7>[ 32.052823] mic0: no IPv6 routers present
<4>[11073.684542] sshd used greatest stack depth: 5152 bytes left
<4>[11158.724079] mount used greatest stack depth: 4128 bytes left
If the problem is a temperature issue, what is the behavior of target in that situation ? and how i can check the temperature of the card?
i just upgrade BIOS and IPMI to the last one but the command
ipmitool sensor shows
CPU1 Temp | 01h | ok | 3.1 | 36 degrees C
CPU2 Temp | 02h | ok | 3.2 | 30 degrees C
System Temp | 11h | ok | 7.1 | 20 degrees C
Peripheral Temp | 12h | ok | 7.2 | 39 degrees C
PCH Temp | 0Ah | ok | 7.3 | 49 degrees C
10G Temp | 0Bh | ok | 7.4 | 59 degrees C
P1-DIMMA TEMP | B0h | ok | 32.64 | 25 degrees C
P1-DIMMB TEMP | B4h | ok | 32.68 | 27 degrees C
P1-DIMMC TEMP | B8h | ok | 32.72 | 29 degrees C
P1-DIMMD TEMP | BCh | ok | 32.76 | 30 degrees C
P2-DIMME TEMP | D0h | ok | 32.80 | 19 degrees C
P2-DIMMF TEMP | D4h | ok | 32.84 | 19 degrees C
P2-DIMMG TEMP | D8h | ok | 32.88 | 21 degrees C
P2-DIMMH TEMP | DCh | ok | 32.92 | 19 degrees C
GPU1 Temp | 71h | ns | 11.1 | No Reading
GPU2 Temp | 72h | ns | 11.2 | No Reading
GPU3 Temp | 73h | ns | 11.3 | No Reading
GPU4 Temp | 74h | ns | 11.4 | No Reading
i'll waiting an answer about that from SM.
-Pedro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem was solved upgrading BIOS & IPMI version of the system.
Now works well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just checked, the BIOS is up to date on my server (X9DRGQF2_C21). IPMI as well (although I don't use it).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
may be compatability problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Simon,
the output of your lspci -vvv was reviewed by a PCIe expert and the recommendation at this point is for you to contact the OEM provider from whom you acquired the platform and card, as this appears to be a hardware problem. Here I will include some rough notes about the potential things it could be, and the things one could try to do (to narrow down the problem) if so inclined, but at this point it could be anything: an electrical problem; coprocessor could be dead, there may be issues with the PCI slot, there may be issues with cooling, and ultimately any of these need to result in a conversation with your OEM, for them to fix (e.g. by sending you a replacement)
(I am logging this below for posterity (others, including OEMs who may be watching this thread), but not expecting you to actually do this :) )
Here are debug notes when looking at “lspci –vvv” dump
- LnkSta from the bridge indicates the following things which indicate PCIe link is not trained
- Width is x0
- DlActive is low
- From the AERCap register the first error pointer is 5. Though nothing is flagged in the uncorrectable error status register (mostly because an AER was generated and the interrupt handler cleared the error after logging it somewhere (I hope)) bit 5 of uncorrectable error is “Surprise Down” which indicates the link was trained at some point and then dropped.
- #2 also explains why lspci actually shows the card but cant get any more information from it
Usual suspects for behavior like this
- Bad PCIe training or electrical problems with PCIe
- Thermtrip
- VR Fault
Possible things to try
- Limit the link to gen1 and see if we train
- Force de-emphasis to -6db and try gen2
- Try slot 8 or slot 10 on the board which have shorter trace lengths
- Ensure the card has sufficient cooling
- Ensure the power rails are not glitchy. Try to read the VR status using the IPMI tool.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »