Software Archive
Read-only legacy content
17061 Discussions

MIC reset fails with POST code F2

Shereef_S_
Beginner
399 Views

I just got my code running on the MIC and was performing some benchmarks when the ssh terminal froze and the 'micinfo' command never returned.

I shutdown the machine (as in, turn off the computer), but the shutdown never completed. So I forced it to turn off by pressing and holding the power button. 

Next day, I power it up and the MIC is failing to reset. I ran 'micdebug.sh' and inspected the dmesg output file. I see that the card continually fails with POST code F2.

I did the whole 'unplug/wait/plugin' routine. Same problem. I even unplugged the power, and held the power button to guarantee that I fully discharge the entire system. Still the same problem.

Did I brick my card?

Thanks in advance.

0 Kudos
4 Replies
JJK
New Contributor III
399 Views

a POST code of F2 is listed as "GDDR failed memory training" ; I'd suggest to power down the box, take out the Xeon Phi, wait an hour or so (to let the card cool down) and try a different server (if at all possible) and try again.    Is/was the Xeon Phi running at high temperatures? Is there enough airflow to the card ?

 

0 Kudos
Shereef_S_
Beginner
399 Views

Thanks for getting back with me.

I don't have another machine available to test the card in (yet). However, I had just added "custom" ductwork to improve airflow across the heatsink. While running my benchmark I was monitoring the die temperature, which hovered between 79-81C throughout the benchmark. When idle, the die temperature is between 45-50C. I installed server grade 4000 RPM fans in my case (just to be sure). Aside from the noise, they seem to do well to keep the system cool.

I was ssh'd into the MIC, and after the benchmark completed, the ssh terminal froze. I ran 'micinfo' again, and that command hung. So I closed my terminal and issued a shutdown. But the shutdown hung. So I forced the machine to power down. I waited overnight to let the system cool, and tried again the next day. That's when the problem started.

I have two MIC's installed, but I've only been working with one (as I slowly build my way up). The first MIC (the one I've been working with) has been having trouble starting at boot (no blue LED), but the "unplug/wait/plugin" cycle always solves the problem. I noticed in the output of 'micdebug.sh' that MIC0 is assigned PCI #2, and MIC1 is assigned PCI #1, which could be the source of the problem at boot. But again, it's never been a real issue, just a nuisance. This POST code problem is a real issue. I can't start the MPSS service or interrogate the card at all. Also, MIC1 is still fine and has always worked well. 

I do have another motherboard on the way for a machine that I'm building for my adviser (I've been evangelizing these cards to him), so I'll test my card on that machine when it arrives next week.

But, I fear that I bricked my card when I forced the system to power off.

Is there another way to check the health of the card?

Thanks again.

0 Kudos
Shereef_S_
Beginner
399 Views

OK, I tested my failed MIC on a different motherboard and it transitioned out of the reset state and into the ready state!

According to the "MPSS User Guide", Sec. 15.1.5:

"The driver will perform a soft reset on the card by setting the correct card PCI mapped register."

I have two MICs, and I noted earlier that the PCI mapping was reversed for them. So it seems that this reversal of register mappings was causing the card to fail the memory test because it was never really accessing the on-board memory (that reasoning makes sense, right?).

So, danger averted. I'll need to go back and tighten up my system configuration, but I'm glad see that the MIC isn't as fragile as I thought, and is perfectly capable of surviving my mistakes.

Thanks for the tips, JJK.

 

.
0 Kudos
Shereef_S_
Beginner
399 Views

Yeah, still having this problem.

Basic info:

I'm hosting two 31S1P MICs and one very weak video card on an ASUS X99-PRO with an i7-5930K.

Basic problem:

MIC A --> PCIe Slot 1

MIC B --> PCIe Slot 3

After system boot, I 'sudo modprobe mic', followed by 'sudo micctrl -s'. No MPSS. Just the driver.

MIC A: ready

MIC B: reset failed

Attempts to fix:

I switched the cards.

MIC B --> PCIe Slot 1

MIC A --> PCIe Slot 3

Run same commands as before.

MIC A: ready

MIC B: reset failed.

OK. How about just one MIC?

MIC A --> PCIe Slot 1

Run same commands as before.

MIC A: ready

Now swap MIC A for (the problem child) MIC B --> PCIe Slot 1

Run same commands as before.

MIC B: ready ... ?!

OK. Run 'sudo micflash -update -device all' to reflash MIC B, and hopefully reset to an original state. Now add MIC A --> PCIe Slot 3.

Run same commands as before.

MIC A: ready

MIC B: reset failed ... ??

I ran 'lspci -vv' in all cases, and in all cases each card is recognized and memory mapped. For cases with just one MIC, I ran 'micsmc -t' and found the core temperature to be 47C, for both MIC A and MIC B. So the MICs are recognized and allocated in all cases, and temperature is a non-issue.

I thought it could be a PCI allocation issue, so I disabled the built-in WiFi controller which shares a lane with my video card. Same problem. Though I still haven't ruled out PCI allocation as the culprit.

As it stands, I have two cards that work independently, but not together. My setup was working last week for all of 48 hours. Now it's crippled.

I have read through as many posts on the forum as I can, trying to find some new ideas, or similar problem. I'm out of ideas. Does anyone have a suggestion?

Thanks in advance.

0 Kudos
Reply