Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
17060 Discussions

31S1P: BSOD or Device Connection Lost

Vyacheslav_A_
Beginner
2,504 Views

31S1P: BSOD or Device Connection Lost


Hi.


There are some kind of instabilities with my Xeon Phi 31s1p. With or WITHOUT(!) a job on the coprocessor there are events like


june 16 2015 21:02:31: Warning: mic0: Device connection lost!
june 16 2015 21:03:47: Information: mic0: Device connection restored

 

With an OpenCL job the situation gets worse: it may loose connection or raise blue screen of death :) IRQL_LESS_OR_EQUAL in some MPSS dll (I will take a screenshot later).


In case of BSOD micras log contains nothing. I will try to reproduce connection issue and attach micras log.

There is no overheating. In idle mode Phi has 63C, with a job - 75C.

Nothing changes with downgrade (or upgrate) between MPSS 3.3.4 and 3.5. OpenCL runtime is 14.2.


I afraid it's hardware issue with the coprocessor...


Any ideas? How to investigate? May be it's local overheating? micctrl -t gets me safe values.


Thanks!

 

 

0 Kudos
9 Replies
Vyacheslav_A_
Beginner
2,504 Views

attached:

- BSOD information

- some MPSS information

- device connection lost screenshots

0 Kudos
Vyacheslav_A_
Beginner
2,504 Views

Attached:

- micras log when 'connection lost' event occurs

- screenshot at the moment before of 'connection lost' event occurs
 

0 Kudos
Vyacheslav_A_
Beginner
2,504 Views

Sorry for 4th posting with artifacts...

 

Windows Event Screenshot attached.

0 Kudos
Vyacheslav_A_
Beginner
2,504 Views

BSOD screenshot attached.

 

anyone here? :) I have completely no ideas what to do next...

0 Kudos
Vyacheslav_A_
Beginner
2,504 Views

I have upgraded my hand-made active cooling system with a vacuum cleaner. Vacuum cleaner through flexible adapter is able to create a very powerful airflow. :) As a result temperatures have dropped to 47/55 C.

Same ploblem exists. So it's not an overheating.

I have no ideas what to do next...    :\

0 Kudos
Frances_R_Intel
Employee
2,504 Views

I must applaud your inventiveness on your active cooling system. I saw a similar issue in the past with the connection being lost but that was on a Linux box and, as I recall, only happened when the coprocessor was idle for long enough for it to go into a deep sleep. Being Linux, there was no BSOD - I don't recall the problem having caused the host to crash but it was a while ago. But you are seeing this with the coprocessor actively running a program. I will dig up the old problem and see if it can cast any light on this problem.

0 Kudos
Vyacheslav_A_
Beginner
2,504 Views

Frances. thank you for your reply!

I think BSOD is not an additional issue but the other side of myproblem. Sorry for my English, I mean the only problem ("connection lost") may cause just "connection lost" event or BSOD. Please take a look - I've attached a BSOD memory dump. May be it helps for the investigation.

 

I'm ready to give any debug or system information that helps for the investigation. Please don't leave me without a final sentence (host<->coprocessor incompatibility, coprocessor hardware problem, or something else) :)

 

Thanks!

0 Kudos
Frances_R_Intel
Employee
2,504 Views

OK, found the old issue. It only occurs on some B0 and B1 steppings - you have B1. It did, at least in one case, cause a panic (Linux equivalent of BSOD) on the host. The recommendation is to disable the pc3 and pc6 power states. Those states basically shut down the entire card. You can find more information on the power states in https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states. Some people disabled only pc3. The pc3 state can kick in if the code you are running on the coprocessor goes to sleep - which is what I suspect happened with your OpenCL code. Disabling pc3 and pc6 does not stop the coprocessor from turning down the power on individual cores. You can modify the power states by editing C:\Program Files\Intel\MPSS\ d\mic0.xml.

0 Kudos
Vyacheslav_A_
Beginner
2,504 Views

Good idea Frances.

My mic0.xml lays at C:\Program Files\Intel\MPSS\mic0.xml and DOES NOT contains section about pc3+pc6 but C:\Program Files\Intel\MPSS\global.xml does, So I changes global.xml at the way:

  <PowerManagement>
    <cpufreq>on</cpufreq>
    <corec6>on</corec6>
    <pc3>off</pc3>
    <pc6>off</pc6>
  </PowerManagement>
 

And copied this section into mic0.xml.

Because of stochastic nature of my problem it has taken few days to check your hypothesis. Unfortunately it didn't help. I also tried different pcie slots with no effect.

 

Is there an additional way to obtain the reason of my problem? May be some logs into uOS? Debug version of MPSS?

I dont know where is the root of the problem: in the  coprocessor or motherboard. How can I make sure that coprocessor is in good condition? I have not additional coprocessor unfortunately.

 

Thanks for your help!

0 Kudos
Reply