- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you having problems with your hardware (Cannot see your Intel(R) Xeon Phi(tm) coprocessor? Sporadic accessibility?) or with the Intel(R) Manycore Platform Software Stack (Intel(R) MPSS) running reliably?
Attached to this post are PDF "flowcharts" that explain how you can troubleshoot the problem (note: Both Linux and Windows flowcharts are available), and shows what information you will want to collect if you need to escalate your issue to your OEM provider or Intel.
We hope this is is useful to you! Please let us know if you have found a boundary condition not comprehended properly by this "flow".
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Belinda !
I managed to start MPSS service and to update micflash.
The main probleme was a thermal one. We have installed one more fan just for the Xeon Phi coprocessor.
But there is still a message that is not correct when I try to get the version of micflash.
Here it is :
micflash -getversion -device 0 mic0: Flash read started mic0: Read done mic0: Version: 2.1.02.0390 mic0: Transitioning to ready state micflash: mic0: Failed to read post code: read: /sys/class/mic/mic0/post_code: No such device or address
and then :
micctrl -s mic0: reset failed
May you help me please ?
Edit :
After rebooting output for miccheck :
miccheck MicCheck 3.2-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... pass Test 5 (mic0): Check ras daemon is available in device ... pass Test 6 (mic0): Check running flash version is correct ... pass Status: OK
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BELINDA L. (Intel) wrote:
the 'lspci -vvv' output on your host shows some weird things for the coprocessor (look for Co-processor in the output).
Here is the new one :
08:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (r ev 20) Subsystem: Intel Corporation Device 7d95 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step ping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 11 Region 0: Memory at 380800000000 (64-bit, prefetchable) [size=16G] Region 4: Memory at d3200000 (64-bit, non-prefetchable) [size=128K] Capabilities: [44] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot -,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [4c] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupporte d- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPe nd- LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <4us, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF N ot Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OB FF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedC ompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [98] MSI-X: Enable- Count=16 Masked- Vector table: BAR=4 offset=00017000 PBA: BAR=4 offset=00018000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
BELINDA L. (Intel) wrote:
I found someone else in this forum who has similar hardware to yours:
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: P9X79 WShe uses CentOS (6.4) vs. yours (6.5), using an older MPSS (3.1.x) vs yours (3.2).
Is it possible to know who he is to ask him questions if neccessary ?
BELINDA L. (Intel) wrote:
Let me ask a couple of questions:
- is this the first time you've installed this coprocessor? (that seems to be the case based on what you've said before)
I tried several times but it is the first time I installed it.
BELINDA L. (Intel) wrote:
- have you tried plugging the co-processor into any other slot in your system
I tried 3 of the 7 slots. Finally I choose the one in the middle to avoid too much heat.
BELINDA L. (Intel) wrote:
- did you change anything in your system's BIOS? (i.e. you need to enable BIOS support for memory mapped I/O address ranges above 4GB? )
I enabled addresses above 4 GB as soon as I have installed the new motherboard.
Moreover I have just boosted the speed of the fan for the Xeon Phi co-processor.
BELINDA L. (Intel) wrote:
- we may have to look further into the BIOS -- I have some BIOS update files from someone who, like I said before, had his ASUS functioning. ". I could forward these to you. The version he has working is P9x79-WS-ASUS-4306.CA. what is yours?
I do not know which is the version of BIOS. I will tell it later.
Thanks for your help.
I still need help as you read it above !
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Virginie -
1. what are the results of 'ls /sys/class/mic/mic0'
2. You indicated that you managed to start mpss -- does the process table show mpssd running (ps auxw | grep mpss)? were there any errors resulting from the startup (service mpss start)?
3. can you obtain and send another capture of micdebug.sh (now that you corrected the thermal issue - hopefully ); we're specifically interested in what micinfo and dmesg commands say. dmesg or /var/log/messages may have some indication of what is happening. micdebug.sh collects all of this data in one shot.
4. Are you pretty sure that there aren't any lingering thermal issues even after your changes in fan speeds and coprocessor/slot positioning?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Belinda !
1.
ls /sys/class/mic/mic0 active_cores flash_update memoryvoltage scif_status boot_count flashversion memsize serialnumber cmdline fuse_config_rev mode sku crash_count image model state dev initramfs pc3_enabled stepping device interface_version pc6_enabled stepping_data extended_family kernel_cmdline pc6_timeout substepping_data extended_model log_buf_addr platform subsystem fail_safe_offset log_buf_len post_code uevent family meminfo power virtblk_file family_data memoryfrequency processor
2. There is no more errors when I start mpss service.
ps auxw | grep mpss root 3736 0.0 0.0 194864 932 pts/0 Sl 08:14 0:00 /usr/sbin/mpssd root 3866 0.0 0.0 105320 912 pts/0 S+ 08:21 0:00 grep mpss
3. The output of micdebug.sh is in attachment.
4. I can not be sure, but yesterday I logged on mic0 by ssh for all the morning without any interruption. It was the first time I managed to run mpss service longer than 1 minute. Moreover I touched the co-processor many times before closing my computer yesterday morning without burning me as I did previously every time I booted up my computer. It was still hot but not burning. Before it was burning every time for a few minutes and when the mic status was becoming "reset failed" it was becoming cold again. I send you an image of the co-processor and the fan before I installed it.
Thanks and have a nice day.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Virginie,
if your coprocessor came up and was accessible for a short amount of time, then it's quite possibly a cooling or power management issue.
Do you know whether the ASUS system you are using is qualified to run this coprocessor? Is this something you can check with your hardware provider?
One thing to try (temporarily -- I would not recommend leaving power management off for extended periods of time) is to do this:
sudo service mpss stop
sudo micctrl --pm=off
sudo micctrl --resetconfig (shouldn’t be necessary but won’t hurt anything)
sudo service mpss start
and see if the coprocessor manages to stay up for a while. Let me know how that goes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Belinda !
Nowadays I can start MPSS for an entire day without any problem. But for the moment I do not make it work a lot !
I did not try what you explain last time because when I stop MPSS service I can not restart it without rebooting my computer.
Now that the hardware problems seem to have been resolved I have a new problem.
When I try to run a MPI program on the Xeon Phi co-processor, I have this message :
mpirun -n 20 -host mic0 /tmp/myprog.mic pmi_proxy: /bin/pmi_proxy: cannot execute binary file pmi_proxy: /bin/pmi_proxy: Success
I can run the myprog.host on my computer without any problem but I can not run it on MIC neither from my PC nor from mic0 after a ssh.
Thank you for your help.
(I am not sure it is the good place for that new question but I did not found it.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Belinda !
Today I succeeded in running several programs on the Xeon Phi co-processor but only after a ssh connection.
When I try from the host I have this error :
mpirun -n 2 -host mic0 /Essais_MPI/myprog.mic [proxy:0:0@mic0] HYDU_sock_connect (./utils/sock/sock.c:264): unable to connect from "mic0" to "myIPaddress" (No route to host) [proxy:0:0@mic0] main (./pm/pmiserv/pmip.c:396): unable to connect to server myIPaddress at port 48973 (check for firewalls!)
The port is each time different.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Virginie,
I wonder if there is a problem with the configuration on the coprocessor. When you installed the MPSS, did you configure the network using micctrl or did you edit the mic0.conf directly? If you edited mic0.conf directly, did you use micctrl to push changes out afterward? ssh sets up a tunnel from the host to the coprocessor when it connects, which may be why it works and mpi does not.
By the way, it might be easier to make sure you problem doesn't get lost if you start a new thread. Even though your problems are troubleshooting problems, when we scan back through forum posts to see if there are issues that never got addressed, it is easier to find these issues when each thread deals with a separate issue and has its own title.
Frances
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Belinda !
Yesterday I have been able to run MPI programs on symmetric mode (both host and co-processor) and after ssh directly on mic0.
But today I have troubles again.
MPSS starts correctly and 3 minutes later the status of mic0 is lost. and I can not reset it.
# micctrl -s mic0: lost # service mpss status mpss is running
# micctrl -rw mic0: resetting mic0: reset failed
Even when everything seems to run fine I am not able to reboot or reset with micctrl.
I tried the steps you told me on your message #26 but it failed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Frances !
Frances Roth (Intel) wrote:
I wonder if there is a problem with the configuration on the coprocessor. When you installed the MPSS, did you configure the network using micctrl or did you edit the mic0.conf directly? If you edited mic0.conf directly, did you use micctrl to push changes out afterward? ssh sets up a tunnel from the host to the coprocessor when it connects, which may be why it works and mpi does not.
When I installed the MPSS I used micctrl to configure the network.
Frances Roth (Intel) wrote:
By the way, it might be easier to make sure you problem doesn't get lost if you start a new thread. Even though your problems are troubleshooting problems, when we scan back through forum posts to see if there are issues that never got addressed, it is easier to find these issues when each thread deals with a separate issue and has its own title.
OK. Next time I will use a new thread to post.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
i cannot access to mpss 3.2.1 for windows (update 10 april 2014)
https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#32rel
version 3.2.1 for windows
the link seems not working?
thanks
bertrand
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi
it is ok now for downloading mistake by me
sorry for my useless post(!)
regards
bertrand
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we've added a Windows troubleshooting flow to this post - for those who are getting Intel Xeon Phi coprocessors working on Microsoft* Windows.
happy computing!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sory, come listen and learn with this discussion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have :
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.1
Device Serial Number : ADKC32100318
Where do I find the new Firmware ? I have looked and searched for it with little success.
Can you provide a web page with the firmware ?
Regards,
Ole
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The firmware is delivered with the MPSS. When you install a new MPSS, one of the instructions in the readme.txt file tells you to update the flash with the micflash command. This will update Flash, SMC Firmware and SMC Boot Loader.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »