Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

MIC reset failed

Simon_M_1
Beginner
8,091 Views

Hi,

It seems like I can't get my MIC card working. I followed the instructions in the readme to install MPSS on a fresh Centos 6.3 (which is pretty much the same as RHEL 6.3, I think). The errors I get are not constant, which makes debugging quite hard, but right now, starting mpss using

# service mpss start

fails with this in /var/log/mpssd:

Mon Mar 25 12:10:48 2013: mic0: log_buf_addr: ffffffff832332d0
Mon Mar 25 12:10:48 2013: mic0: log_buf_len: ffffffff81724c70
Mon Mar 25 12:10:48 2013: mic0: Current state "reset failed" cannot boot card
Mon Mar 25 12:10:50 2013: Wait for download requests

The output of miccheck doesn't look good either:

[root@semperphi ~]# /opt/intel/mic/bin/miccheck

miccheck 2.1.5889-14, created 18:10:54 Feb 28 2013
Copyright 2011-2013 Intel Corporation All rights reserved

Test 1 Ensure installation matches manifest : OK
Test 2 Ensure host driver is loaded : OK
Test 3 Ensure driver matches manifest : OK
Test 4 Detect all listed devices : OK
MIC 0 Test 1 Find the device : OK
MIC 0 Test 2 Check the POST code via PCI : FAILED
MIC 0 Test 2> Current POST code is �� (not FF) for MIC 0
MIC 0 Test 3 Connect to the device : SKIPPED
MIC 0 Test 3> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 3> The device is not online
MIC 0 Test 4 Check for normal mode : SKIPPED
MIC 0 Test 4> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 4> The device is not online
MIC 0 Test 5 Check the POST code via SCIF : SKIPPED
MIC 0 Test 5> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 5> The device is not online
MIC 0 Test 6 Send data to the device : SKIPPED
MIC 0 Test 6> Prerequisite 'Check for normal mode' failed:
MIC 0 Test 6> The device is not in normal mode
MIC 0 Test 7 Compare the PCI configuration : OK
MIC 0 Test 8 Ensure Flash version matches manifest : SKIPPED
MIC 0 Test 8> Prerequisite 'Check for normal mode' failed:
MIC 0 Test 8> The device is not in normal mode
Status: The POST code was not "FF"

The output of micinfo:

[root@semperphi ~]# /opt/intel/mic/bin/micinfo
MicInfo Utility Log

Created Mon Mar 25 12:13:53 2013


System Info
HOST OS : Linux
OS Version : 2.6.32-279.el6.x86_64
Driver Version : 5889-14
MPSS Version : 2.1.5889-14
Host Physical Memory : 16300 MB

Device No: 0, Device Name: mic0

Version
Flash Version : NotAvailable
SMC Boot Loader Version : NotAvailable
uOS Version : NotAvailable
Device Serial Number : NotAvailable

Board
Vendor ID : ffff
Device ID : ffff
Subsystem ID : ffff
Coprocessor Stepping ID : f
PCIe Width : x63
PCIe Speed : Unknown
PCIe Max payload size : 16384 bytes
PCIe Max read req size : 16384 bytes
Coprocessor Model : 0x0f
Coprocessor Model Ext : 0x0f
Coprocessor Type : 0x03
Coprocessor Family : 0x0f
Coprocessor Family Ext : 0x0ff
Coprocessor Stepping : B1
Board SKU : NotAvailable
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable

Cores
Total No of Active Cores : NotAvailable
Voltage : NotAvailable
Frequency : NotAvailable

Thermal
Fan Speed Control : NotAvailable
SMC Firmware Version : NotAvailable
FSC Strap : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable

GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable

Do you have any ideas what steps I can take to start debugging this ?

Simon

0 Kudos
32 Replies
Simon_M_1
Beginner
2,318 Views

This log appears to be empty.

[root@semperphi mic0]# cat /sys/class/mic/scif/watchdog_enabled
0
[root@semperphi mic0]# mount | grep debugfs
/sys/kernel/debug on /debugfs type debugfs (rw)
[root@semperphi mic0]# pwd
/debugfs/mic_debug/mic0
[root@semperphi mic0]# cat log_buf > /tmp/mic0_log
[root@semperphi mic0]# ls -l /tmp/mic0_log
-rw-r--r-- 1 root root 0 Apr 12 10:18 /tmp/mic0_log

 

0 Kudos
BelindaLiviero
Employee
2,318 Views
this is probably because the card was already in the 'hung' state, right? We need to actually introduce a step 0 and 0.5 to the above instructions, as what we are trying to do here is log what the card says as it reaches the point where it stops responding. i.e. restart mpss service disable watchdog_enabled mount debugfs filesystem capture log_buf -- for best results, it might make sense to capture it this way script /tmp/micbuf tail -f /sys/kernel/debug/mic_debug/mic0/log_buf Ctrl-D to terminate the script session && send us what went into /tmp/micbuf In the meantime, I'm probing around to see if there are debug options to ofed and/or relevant daemons that we should also consider turning on to capture the activities/errors leading up to the hang. THANK YOU for your patience and for being our eyes on this problem
0 Kudos
Simon_M_1
Beginner
2,318 Views

Hmm still no result.

If you prefer, I could find a way to give you access to the system itself and diagnose it directly. Alternatively, if we could have a chat on IRC or something, it could be more efficient.

0 Kudos
Simon_M_1
Beginner
2,318 Views

Hmm still no result.

If you prefer, I could find a way to give you access to the system itself and diagnose it directly. Alternatively, if we could have a chat on IRC or something, it could be more efficient.

Thank you for your help !

0 Kudos
Pedro_Cruz
Beginner
2,318 Views

Hi.

I have the same problem with   a

dmidecode -s system-product-name
X9DRG-HF

System.

With a heavy load the system reports

HPL[mic-server] : MIC No. 0 exceeds allowed temperature ( 96 degree).
HPL Warning [mic-server] : Memory load is too heavy. Performance may be low.

after a couple of minutes  the MIC card  "hangs"  and  it's necesary to do a power cycle to recovery  the system.

0 Kudos
BelindaLiviero
Employee
2,318 Views

Simon, please send us lspci -vvv output and I will try to arrange a call to discuss this case (and have Intel access your system) offline in the meantime.   is it safe to assume you've not seen any indication of temperature issues?

Pedro:  do you also have a Supermicro system?

0 Kudos
Simon_M_1
Beginner
2,318 Views

I haven't seen any temperature related messages.

Here is the lspci -vvv output: http://paste.ubuntu.com/5711230/

0 Kudos
Pedro_Cruz
Beginner
2,318 Views

Yes  X9DRG-HF   is a Supermicro  system.

This are the last lines of  /sys/kernel/debug/mic_debug/mic1/log_buf,  when the card hangs  there is no information of what happened.

<4>[    5.034067] Loading RAS module ver 0.9c. Build date: Mar  8 2013
<4>[    5.036794] RAS: card 22508086:25008086:11 SKU is "B1 SKU2" (60 cores, 16 memch, 0 txs)
<4>[    5.039416] RAS.elog: rev 1, size 3276, head 3275, tail 3275
<4>[    5.039472] RAS.elog: init complete
<4>[    5.039479] RAS.core: init complete
<4>[    5.039523] RAS.uncore: init complete
<6>[    5.049389] mic_pm: micpm: Freq/volt table returned to RAS
<6>[    5.049398] mic_pm: index       freq       voltage
<6>[    5.049408] mic_pm: 0      842104       1040000
<6>[    5.049418] mic_pm: 1      947367       1045000
<6>[    5.049427] mic_pm: 2      1052630       1050000
<4>[    5.049434] micpm: RAS module registered
<4>[    5.049447] RAS module load completed
<4>[    5.540110] RAS.init: module operational
<4>[   21.707071] S01fileperms used greatest stack depth: 5752 bytes left
<4>[   21.761012] ip used greatest stack depth: 5504 bytes left
<4>[   21.908814] ip used greatest stack depth: 5208 bytes left
<4>[   22.321287] Module pm_scif loaded at 0xffffffffa001c000
<1>[   22.328387] [ pm_scif_init : 343 ]:==> pm_scif_init
<1>[   22.328420] [ pm_scif_init : 344 ]:pm_scif insmoded
<1>[   22.328472] [ pm_scif_init : 372 ]: scif_bind successfull. Local port number = 1088, ep = 
<1>[   22.329328] [ pm_recv_from_host : 191 ]:==> pm_recv_from_host
<1>[   22.329425] [ pm_handle_get_latencies : 99 ]:==> pm_handle_get_latencies
<1>[   22.329504] [ pm_recv_from_host : 191 ]:==> pm_recv_from_host
<7>[   32.052823] mic0: no IPv6 routers present
<4>[11073.684542] sshd used greatest stack depth: 5152 bytes left
<4>[11158.724079] mount used greatest stack depth: 4128 bytes left

If the problem is a temperature issue,   what is the behavior of target in that situation ?  and how i can  check the temperature of the card?

i just upgrade BIOS and IPMI to the last one   but   the command

ipmitool  sensor  shows

CPU1 Temp        | 01h | ok  |  3.1 | 36 degrees C  
CPU2 Temp        | 02h | ok  |  3.2 | 30 degrees C  
System Temp      | 11h | ok  |  7.1 | 20 degrees C  
Peripheral Temp  | 12h | ok  |  7.2 | 39 degrees C  
PCH Temp         | 0Ah | ok  |  7.3 | 49 degrees C  
10G Temp         | 0Bh | ok  |  7.4 | 59 degrees C  
P1-DIMMA TEMP    | B0h | ok  | 32.64 | 25 degrees C 
P1-DIMMB TEMP    | B4h | ok  | 32.68 | 27 degrees C 
P1-DIMMC TEMP    | B8h | ok  | 32.72 | 29 degrees C 
P1-DIMMD TEMP    | BCh | ok  | 32.76 | 30 degrees C 
P2-DIMME TEMP    | D0h | ok  | 32.80 | 19 degrees C 
P2-DIMMF TEMP    | D4h | ok  | 32.84 | 19 degrees C 
P2-DIMMG TEMP    | D8h | ok  | 32.88 | 21 degrees C 
P2-DIMMH TEMP    | DCh | ok  | 32.92 | 19 degrees C 
GPU1 Temp        | 71h | ns  | 11.1 | No Reading    
GPU2 Temp        | 72h | ns  | 11.2 | No Reading    
GPU3 Temp        | 73h | ns  | 11.3 | No Reading    
GPU4 Temp        | 74h | ns  | 11.4 | No Reading    
                                                    
i'll waiting  an answer about that from SM.

-Pedro

0 Kudos
Pedro_Cruz
Beginner
2,318 Views

The problem was solved upgrading  BIOS & IPMI  version of the  system. 

Now works  well.

0 Kudos
Simon_M_1
Beginner
2,318 Views

I just checked, the BIOS is up to date on my server (X9DRGQF2_C21). IPMI as well (although I don't use it).

0 Kudos
PONRAM
Beginner
2,318 Views

may be compatability problem

0 Kudos
BelindaLiviero
Employee
2,318 Views

Hi Simon,

the output of your lspci -vvv was reviewed by a PCIe expert and the recommendation at this point is for you to contact the OEM provider from whom you acquired the platform and card, as this appears to be a hardware problem.   Here I will include some rough notes about the potential things it could be, and the things one could try to do (to narrow down the problem) if so inclined, but at this point it could be anything:  an electrical problem; coprocessor could be dead, there may be issues with the PCI slot, there may be issues with cooling, and ultimately any of these need to result in a conversation with your OEM, for them to fix (e.g. by sending you a replacement)

(I am logging this below for posterity (others, including OEMs who may be watching this thread), but not expecting you to actually do this :) )

Here are debug notes when looking at “lspci –vvv” dump

  1. LnkSta from the bridge indicates the following things which indicate PCIe link is not trained
    1. Width is x0
    2. DlActive is low
    3. From the AERCap register the first error pointer is 5. Though nothing is flagged in the uncorrectable error status register (mostly because an AER was generated and the interrupt handler cleared the error after logging it somewhere (I hope)) bit 5 of uncorrectable error is “Surprise Down” which indicates the link was trained at some point and then dropped.
    4. #2 also explains why lspci actually shows the card but cant get any more information from it

 

Usual suspects for behavior like this

  1. Bad PCIe training or electrical problems with PCIe
  2. Thermtrip
  3. VR Fault

Possible things to try

  1. Limit the link to gen1 and see if  we train
  2. Force de-emphasis to -6db and try gen2
  3. Try slot 8 or slot 10 on the board which have shorter trace lengths
  4. Ensure the card has sufficient cooling
  5. Ensure the power rails are not glitchy. Try to read the VR status using the IPMI tool.

 

 

0 Kudos
Reply