Software Archive
Read-only legacy content
17060 Discussions

Troubleshooting HOWTO: Bad hardware? MPSS? Configuration?

BelindaLiviero
Employee
4,285 Views

Are you having problems with your hardware (Cannot see your Intel(R) Xeon Phi(tm) coprocessor?  Sporadic accessibility?) or with the Intel(R) Manycore Platform Software Stack (Intel(R) MPSS) running reliably?

Attached to this post are PDF "flowcharts" that explain how you can troubleshoot the problem (note:  Both Linux and Windows flowcharts are available), and shows what information you will want to collect if you need to escalate your issue to your OEM provider or Intel.

We hope this is is useful to you!   Please let us know if you have found a boundary condition not comprehended properly by this "flow".

0 Kudos
37 Replies
Virginie_Favrat
Beginner
1,535 Views

Hi Belinda !

I managed to start MPSS service and to update micflash.

The main probleme was a thermal one. We have installed one more fan just for the Xeon Phi coprocessor.

But there is still a message that is not correct when I try to get the version of micflash.

Here it is :

micflash -getversion -device 0
mic0: Flash read started
mic0: Read done
mic0: Version: 2.1.02.0390
mic0: Transitioning to ready state
micflash: mic0: Failed to read post code: read: /sys/class/mic/mic0/post_code: No such device or address

and then :

micctrl -s
mic0: reset failed

May you help me please ?

 

Edit :

After rebooting output for miccheck :

miccheck
MicCheck 3.2-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass

Status: OK

 

0 Kudos
Virginie_Favrat
Beginner
1,535 Views

BELINDA L. (Intel) wrote:

the 'lspci -vvv' output on your host shows some weird things for the coprocessor (look for Co-processor in the output).

Here is the new one :

08:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (r 
ev 20) 
        Subsystem: Intel Corporation Device 7d95 
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step 
ping- SERR- FastB2B- DisINTx- 
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx- 
        Latency: 0, Cache Line Size: 64 bytes 
        Interrupt: pin A routed to IRQ 11 
        Region 0: Memory at 380800000000 (64-bit, prefetchable) [size=16G] 
        Region 4: Memory at d3200000 (64-bit, non-prefetchable) [size=128K] 
        Capabilities: [44] Power Management version 3 
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot 
-,D3cold-) 
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- 
        Capabilities: [4c] Express (v2) Endpoint, MSI 00 
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us 
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupporte
d- 
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ 
                        MaxPayload 128 bytes, MaxReadReq 512 bytes 
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPe 
nd- 
                LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 
 <4us, L1 unlimited 
                        ClockPM- Surprise- LLActRep- BwNot- 
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- 
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- 
                LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- 
 BWMgmt- ABWMgmt- 
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF N 
ot Supported 
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OB 
FF Disabled 
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- 
                         Transmit Margin: Normal Operating Range, EnterModifiedC 
ompliance- ComplianceSOS- 
                         Compliance De-emphasis: -6dB 
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, 
 
 EqualizationPhase1- 
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- 
        Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+ 
                Address: 0000000000000000  Data: 0000 
        Capabilities: [98] MSI-X: Enable- Count=16 Masked- 
                Vector table: BAR=4 offset=00017000 
                PBA: BAR=4 offset=00018000 
        Capabilities: [100 v1] Advanced Error Reporting 
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- 
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- 
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- 
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- 
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ 
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- 

 

BELINDA L. (Intel) wrote:

 

I found someone else in this forum who has similar hardware to yours:

Manufacturer: ASUSTeK COMPUTER INC.

    Product Name: P9X79 WS

he uses CentOS (6.4) vs. yours (6.5), using an older MPSS (3.1.x) vs yours (3.2). 

Is it possible to know who he is to ask him questions if neccessary ?

BELINDA L. (Intel) wrote:

Let me ask a couple of questions:

   - is this the first time you've installed this coprocessor? (that seems to be the case based on what you've said before)

I tried several times but it is the first time I installed it.

BELINDA L. (Intel) wrote:

   - have you tried plugging the co-processor into any other slot in your system

I tried 3 of the 7 slots. Finally I choose the one in the middle to avoid too much heat.

BELINDA L. (Intel) wrote:

   - did you  change anything in your system's BIOS? (i.e. you need to enable BIOS support for memory mapped I/O address ranges above 4GB? )

I enabled addresses above 4 GB as soon as I have installed the new motherboard.

Moreover I have just boosted the speed of the fan for the Xeon Phi co-processor.

BELINDA L. (Intel) wrote:

    - we may have to look further into the BIOS -- I have some BIOS update files from someone who, like I said before, had his ASUS functioning. ".   I could forward these to you.   The version he has working is P9x79-WS-ASUS-4306.CA.   what is yours?

I do not know which is the version of BIOS. I will tell it later.

Thanks for your help.

I still need help as you read it above !

0 Kudos
BelindaLiviero
Employee
1,535 Views

Hi Virginie -

1. what are the results of 'ls /sys/class/mic/mic0'

2. You indicated that you managed to start mpss -- does the process table show mpssd running (ps auxw | grep mpss)?  were there any errors resulting from the startup (service mpss start)?

3. can you obtain and send another capture of micdebug.sh (now that you corrected the thermal issue - hopefully ); we're specifically interested in what micinfo and dmesg commands say.   dmesg or /var/log/messages may have some indication of what is happening.   micdebug.sh collects all of this data in one shot.

4. Are you pretty sure that there aren't any lingering thermal issues even after your changes in fan speeds and coprocessor/slot positioning?

0 Kudos
Virginie_Favrat
Beginner
1,535 Views

Hi Belinda !

1.

ls /sys/class/mic/mic0
active_cores      flash_update       memoryvoltage  scif_status
boot_count        flashversion       memsize        serialnumber
cmdline           fuse_config_rev    mode           sku
crash_count       image              model          state
dev               initramfs          pc3_enabled    stepping
device            interface_version  pc6_enabled    stepping_data
extended_family   kernel_cmdline     pc6_timeout    substepping_data
extended_model    log_buf_addr       platform       subsystem
fail_safe_offset  log_buf_len        post_code      uevent
family            meminfo            power          virtblk_file
family_data       memoryfrequency    processor

2. There is no more errors when I start mpss service.

ps auxw | grep mpss
root      3736  0.0  0.0 194864   932 pts/0    Sl   08:14   0:00 /usr/sbin/mpssd
root      3866  0.0  0.0 105320   912 pts/0    S+   08:21   0:00 grep mpss

3. The output of micdebug.sh is in attachment.

4. I can not be sure, but yesterday I logged on mic0 by ssh for all the morning without any interruption. It was the first time I managed to run mpss service longer than 1 minute. Moreover I touched the co-processor many times before closing my computer yesterday morning without burning me as I did previously every time I booted up my computer. It was still hot but not burning. Before it was burning every time for a few minutes and when the mic status was becoming "reset failed" it was becoming cold again. I send you an image of the co-processor and the fan before I installed it.

Thanks and have a nice day.

0 Kudos
BelindaLiviero
Employee
1,535 Views

Hi Virginie,

if your coprocessor came up and was accessible for a short amount of time, then it's quite possibly a cooling or power management issue.

Do you know whether the ASUS system you are using is qualified to run this coprocessor?  Is this something you can check with your hardware provider?

One thing to try (temporarily -- I would not recommend leaving power management off for extended periods of time) is to do this:

sudo service mpss stop

sudo micctrl --pm=off

sudo micctrl --resetconfig  (shouldn’t be necessary but won’t hurt anything)

sudo service mpss start

 

and see if the coprocessor manages to stay up for a while.   Let me know how that goes.   

0 Kudos
Virginie_Favrat
Beginner
1,535 Views

Hi Belinda !

Nowadays I can start MPSS for an entire day without any problem. But for the moment I do not make it work a lot !

I did not try what you explain last time because when I stop MPSS service I can not restart it without rebooting my computer.

Now that the hardware problems seem to have been resolved I have a new problem.

When I try to run a MPI program on the Xeon Phi co-processor, I have this message :

mpirun -n 20 -host mic0 /tmp/myprog.mic
pmi_proxy: /bin/pmi_proxy: cannot execute binary file
pmi_proxy: /bin/pmi_proxy: Success

I can run the myprog.host on my computer without any problem but I can not run it on MIC neither from my PC nor from mic0 after a ssh.

Thank you for your help.

(I am not sure it is the good place for that new question but I did not found it.)

0 Kudos
Virginie_Favrat
Beginner
1,535 Views

Hi Belinda !

Today I succeeded in running several programs on the Xeon Phi co-processor but only after a ssh connection.

When I try from the host I have this error :

mpirun -n 2 -host mic0 /Essais_MPI/myprog.mic
[proxy:0:0@mic0] HYDU_sock_connect (./utils/sock/sock.c:264): unable to connect from "mic0" to "myIPaddress" (No route to host)
[proxy:0:0@mic0] main (./pm/pmiserv/pmip.c:396): unable to connect to server myIPaddress at port 48973 (check for firewalls!)

The port is each time different.

Thanks.

0 Kudos
Frances_R_Intel
Employee
1,535 Views

Virginie,

I wonder if there is a problem with the configuration on the coprocessor. When you installed the MPSS, did you configure the network using micctrl or did you edit the mic0.conf directly? If you edited mic0.conf directly, did you use micctrl to push changes out afterward? ssh sets up a tunnel from the host to the coprocessor when it connects, which may be why it works and mpi does not.

By the way, it might be easier to make sure you problem doesn't get lost if you start a new thread. Even though your problems are troubleshooting problems, when we scan back through forum posts to see if there are issues that never got addressed, it is easier to find these issues when each thread deals with a separate issue and has its own title.

Frances

0 Kudos
Virginie_Favrat
Beginner
1,535 Views

Hi Belinda !

Yesterday I have been able to run MPI programs on symmetric mode (both host and co-processor) and after ssh directly on mic0.

But today I have troubles again.

MPSS starts correctly and 3 minutes later the status of mic0 is lost. and I can not reset it.

# micctrl -s
mic0: lost
# service mpss status
mpss is running
# micctrl -rw
          mic0: resetting
          mic0: reset failed

Even when everything seems to run fine I am not able to reboot or reset with micctrl.

I tried the steps you told me on your message #26 but it failed.

0 Kudos
Virginie_Favrat
Beginner
1,535 Views

Hi Frances !

Frances Roth (Intel) wrote:

I wonder if there is a problem with the configuration on the coprocessor. When you installed the MPSS, did you configure the network using micctrl or did you edit the mic0.conf directly? If you edited mic0.conf directly, did you use micctrl to push changes out afterward? ssh sets up a tunnel from the host to the coprocessor when it connects, which may be why it works and mpi does not.

When I installed the MPSS I used micctrl to configure the network.

Frances Roth (Intel) wrote:

By the way, it might be easier to make sure you problem doesn't get lost if you start a new thread. Even though your problems are troubleshooting problems, when we scan back through forum posts to see if there are issues that never got addressed, it is easier to find these issues when each thread deals with a separate issue and has its own title.

OK. Next time I will use a new thread to post.

Thanks.

0 Kudos
bertrand__remy
Beginner
1,535 Views

 

 

hi,

 

i cannot access  to mpss 3.2.1  for windows  (update 10 april 2014)

 

https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#32rel

 

version 3.2.1  for windows

the link seems not working?

 

thanks

bertrand

 

 

 

0 Kudos
remy_b_
Beginner
1,535 Views

hi

it is ok   now for downloading  mistake by me

sorry for my useless post(!)

regards

bertrand

0 Kudos
BelindaLiviero
Employee
1,535 Views

we've added a Windows troubleshooting flow to this post - for those who are getting Intel Xeon Phi coprocessors working on Microsoft* Windows.

happy computing!

0 Kudos
Ole_Saastad
Beginner
1,535 Views
This comment has been moved to its own thread
0 Kudos
kecoro
Beginner
1,535 Views

sory, come listen and learn with this discussion

0 Kudos
Ole_Saastad
Beginner
1,535 Views

Hi,

I have :


                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.4.1
                Device Serial Number     : ADKC32100318

 

Where do I find the new Firmware ? I have looked and searched for it with little success.

Can you provide a web page with the firmware ?

 

Regards,

Ole

 

0 Kudos
Frances_R_Intel
Employee
1,535 Views

The firmware is delivered with the MPSS. When you install a new MPSS, one of the instructions in the readme.txt file tells you to update the flash with the micflash command. This will update Flash, SMC Firmware and SMC Boot Loader.

0 Kudos
Reply