BSOD screenshot attached.

Vyacheslav_A_ · ‎06-16-2015

31S1P: BSOD or Device Connection Lost

Hi.

There are some kind of instabilities with my Xeon Phi 31s1p. With or WITHOUT(!) a job on the coprocessor there are events like

june 16 2015 21:02:31: Warning: mic0: Device connection lost!
june 16 2015 21:03:47: Information: mic0: Device connection restored

With an OpenCL job the situation gets worse: it may loose connection or raise blue screen of death :) IRQL_LESS_OR_EQUAL in some MPSS dll (I will take a screenshot later).

In case of BSOD micras log contains nothing. I will try to reproduce connection issue and attach micras log.

There is no overheating. In idle mode Phi has 63C, with a job - 75C.

Nothing changes with downgrade (or upgrate) between MPSS 3.3.4 and 3.5. OpenCL runtime is 14.2.

I afraid it's hardware issue with the coprocessor...

Any ideas? How to investigate? May be it's local overheating? micctrl -t gets me safe values.

Thanks!

Vyacheslav_A_ · ‎06-16-2015

attached:

- BSOD information

- some MPSS information

- device connection lost screenshots

Vyacheslav_A_ · ‎06-16-2015

Attached:

- micras log when 'connection lost' event occurs

- screenshot at the moment before of 'connection lost' event occurs

Vyacheslav_A_ · ‎06-16-2015

Sorry for 4th posting with artifacts...

Windows Event Screenshot attached.

Vyacheslav_A_ · ‎06-17-2015

BSOD screenshot attached.

anyone here? :) I have completely no ideas what to do next...

Vyacheslav_A_ · ‎06-17-2015

I have upgraded my hand-made active cooling system with a vacuum cleaner. Vacuum cleaner through flexible adapter is able to create a very powerful airflow. :) As a result temperatures have dropped to 47/55 C.

Same ploblem exists. So it's not an overheating.

I have no ideas what to do next... :\

Frances_R_Intel · ‎06-17-2015

I must applaud your inventiveness on your active cooling system. I saw a similar issue in the past with the connection being lost but that was on a Linux box and, as I recall, only happened when the coprocessor was idle for long enough for it to go into a deep sleep. Being Linux, there was no BSOD - I don't recall the problem having caused the host to crash but it was a while ago. But you are seeing this with the coprocessor actively running a program. I will dig up the old problem and see if it can cast any light on this problem.

Vyacheslav_A_ · ‎06-17-2015

Frances. thank you for your reply!

I think BSOD is not an additional issue but the other side of myproblem. Sorry for my English, I mean the only problem ("connection lost") may cause just "connection lost" event or BSOD. Please take a look - I've attached a BSOD memory dump. May be it helps for the investigation.

I'm ready to give any debug or system information that helps for the investigation. Please don't leave me without a final sentence (host<->coprocessor incompatibility, coprocessor hardware problem, or something else) :)

Thanks!

Frances_R_Intel · ‎06-17-2015

OK, found the old issue. It only occurs on some B0 and B1 steppings - you have B1. It did, at least in one case, cause a panic (Linux equivalent of BSOD) on the host. The recommendation is to disable the pc3 and pc6 power states. Those states basically shut down the entire card. You can find more information on the power states in https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states. Some people disabled only pc3. The pc3 state can kick in if the code you are running on the coprocessor goes to sleep - which is what I suspect happened with your OpenCL code. Disabling pc3 and pc6 does not stop the coprocessor from turning down the power on individual cores. You can modify the power states by editing C:\Program Files\Intel\MPSS\ d\mic0.xml.

Vyacheslav_A_ · ‎06-21-2015

Good idea Frances.

My mic0.xml lays at C:\Program Files\Intel\MPSS\mic0.xml and DOES NOT contains section about pc3+pc6 but C:\Program Files\Intel\MPSS\global.xml does, So I changes global.xml at the way:

And copied this section into mic0.xml.

Because of stochastic nature of my problem it has taken few days to check your hypothesis. Unfortunately it didn't help. I also tried different pcie slots with no effect.

Is there an additional way to obtain the reason of my problem? May be some logs into uOS? Debug version of MPSS?

I dont know where is the root of the problem: in the coprocessor or motherboard. How can I make sure that coprocessor is in good condition? I have not additional coprocessor unfortunately.

Thanks for your help!