- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
31S1P: BSOD or Device Connection Lost
Hi.
There are some kind of instabilities with my Xeon Phi 31s1p. With or WITHOUT(!) a job on the coprocessor there are events like
june 16 2015 21:02:31: Warning: mic0: Device connection lost!
june 16 2015 21:03:47: Information: mic0: Device connection restored
With an OpenCL job the situation gets worse: it may loose connection or raise blue screen of death :) IRQL_LESS_OR_EQUAL in some MPSS dll (I will take a screenshot later).
In case of BSOD micras log contains nothing. I will try to reproduce connection issue and attach micras log.
There is no overheating. In idle mode Phi has 63C, with a job - 75C.
Nothing changes with downgrade (or upgrate) between MPSS 3.3.4 and 3.5. OpenCL runtime is 14.2.
I afraid it's hardware issue with the coprocessor...
Any ideas? How to investigate? May be it's local overheating? micctrl -t gets me safe values.
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have upgraded my hand-made active cooling system with a vacuum cleaner. Vacuum cleaner through flexible adapter is able to create a very powerful airflow. :) As a result temperatures have dropped to 47/55 C.
Same ploblem exists. So it's not an overheating.
I have no ideas what to do next... :\
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I must applaud your inventiveness on your active cooling system. I saw a similar issue in the past with the connection being lost but that was on a Linux box and, as I recall, only happened when the coprocessor was idle for long enough for it to go into a deep sleep. Being Linux, there was no BSOD - I don't recall the problem having caused the host to crash but it was a while ago. But you are seeing this with the coprocessor actively running a program. I will dig up the old problem and see if it can cast any light on this problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances. thank you for your reply!
I think BSOD is not an additional issue but the other side of myproblem. Sorry for my English, I mean the only problem ("connection lost") may cause just "connection lost" event or BSOD. Please take a look - I've attached a BSOD memory dump. May be it helps for the investigation.
I'm ready to give any debug or system information that helps for the investigation. Please don't leave me without a final sentence (host<->coprocessor incompatibility, coprocessor hardware problem, or something else) :)
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, found the old issue. It only occurs on some B0 and B1 steppings - you have B1. It did, at least in one case, cause a panic (Linux equivalent of BSOD) on the host. The recommendation is to disable the pc3 and pc6 power states. Those states basically shut down the entire card. You can find more information on the power states in https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states. Some people disabled only pc3. The pc3 state can kick in if the code you are running on the coprocessor goes to sleep - which is what I suspect happened with your OpenCL code. Disabling pc3 and pc6 does not stop the coprocessor from turning down the power on individual cores. You can modify the power states by editing C:\Program Files\Intel\MPSS\ d\mic0.xml.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good idea Frances.
My mic0.xml lays at C:\Program Files\Intel\MPSS\mic0.xml and DOES NOT contains section about pc3+pc6 but C:\Program Files\Intel\MPSS\global.xml does, So I changes global.xml at the way:
<PowerManagement>
<cpufreq>on</cpufreq>
<corec6>on</corec6>
<pc3>off</pc3>
<pc6>off</pc6>
</PowerManagement>
And copied this section into mic0.xml.
Because of stochastic nature of my problem it has taken few days to check your hypothesis. Unfortunately it didn't help. I also tried different pcie slots with no effect.
Is there an additional way to obtain the reason of my problem? May be some logs into uOS? Debug version of MPSS?
I dont know where is the root of the problem: in the coprocessor or motherboard. How can I make sure that coprocessor is in good condition? I have not additional coprocessor unfortunately.
Thanks for your help!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page