Software Archive
Read-only legacy content
17061 Discussions

Troubleshooting HOWTO: Bad hardware? MPSS? Configuration?

BelindaLiviero
Employee
4,608 Views

Are you having problems with your hardware (Cannot see your Intel(R) Xeon Phi(tm) coprocessor?  Sporadic accessibility?) or with the Intel(R) Manycore Platform Software Stack (Intel(R) MPSS) running reliably?

Attached to this post are PDF "flowcharts" that explain how you can troubleshoot the problem (note:  Both Linux and Windows flowcharts are available), and shows what information you will want to collect if you need to escalate your issue to your OEM provider or Intel.

We hope this is is useful to you!   Please let us know if you have found a boundary condition not comprehended properly by this "flow".

0 Kudos
37 Replies
Paul_C_7
Beginner
2,945 Views

I have experienced a problem that looks like a bug in the the 64 bit memory to stack push instruction.

I am porting the Glasgow Pascal compiler to the MIC and have run into an error that looks very like a bad implementation of PUSH

It appears that a push instruction  of the form:

 push QWORD[   r8* 8+ label]

actually pushes the quadword at

 push QWORD[   r8* 8+ label140ba08d9aadf+8]

Here are the relevant source lines along with the relevant assembler lines that they translate into

First we have a call on a run time library function written in C using C parameter passing.

;writeln( shiftindex[d,0]);

 

; note that shiftindex is declared as array[0..4,0..1] of integer

 mov  rcx,         5                          ; field width info

 mov  rdx,         12                         ; field width info

 mov bl,BYTE  ptr  [  rbp+         -49]

 movsx r8,     bl

 imul   r8,        8

 movsx rsi,  dword   ptr  [ r8+ label140ba08d9aadf] ; the parameter for the value to be printed

 movsx rdi,  dword   ptr  [  unit$system$base+         -24] ; the file it will be sent to

.ifndef definedprintint

definedprintint=1

.extern                  printint

.endif

 call printint;#imported

;--------

; this correctly prints out the 0th element of the row of the array shiftindex

 

 

Now we call a pascal function passing a row of the array by value on the stack using a push instruction to place the row on the stack

 

;compareImagePair (shiftindex)

; d is a byte

; #297

 mov bl,BYTE  ptr  [  rbp+         -49]

 movsx r8,     bl

 push QWORD[   r8* 8+ label140ba08d9aadf]

 call label140ba08d9abe3

; this passes to the function the d+1 th element of the array shiftindex ; in other words the push function fetches the wrong element from the array ; as compared to the mov instruction used earlier

 

Printout from programme

First the contents of the shiftindex array

           0           0

           0          -1

           0           1

          -1           0

           1           0

 

           d          shiftindex[d,0]

           3          -1

what we get inside the function compareImagePair when we print the parameter

dirvec =           1           0

0 Kudos
Paul_C_7
Beginner
2,945 Views

I have now concluded that this is a bug in the assembler distributed with the MIC, if you replace the line

 push QWORD[   r8* 8+ label140ba08d9aadf]

with

 push QWORD ptr [   r8* 8+ label140ba08d9aadf]

it fetches the correct value not a value 8 on from the correct address.

0 Kudos
James_C_Intel2
Employee
2,945 Views

Thanks for isolating this bug. It has been reported to the team that owns the assembler.

0 Kudos
BelindaLiviero
Employee
2,945 Views

It appears that the x86_64 assembler does the same thing.  

 

 The error is that "QWORD ptr" must be used here, as Paul realized.  The fact that QWORD alone is allowed may be a bug, which we need to discuss internally;    If AT&T syntax is used, what happens?

 

 

0 Kudos
Yue_H_
Beginner
2,945 Views

I am trying to install Xeon Phi card on a Supermicro server (http://www.supermicro.com/products/superblade/module/sbi-7127rg.cfm). According to the flow chart, I need to "Enable support for mapping >4GB MMIO in the host BIOS" . However, I cannot see the MMIO setup option in BIOS even after upgrading to the latest version. Could anyone please give me some suggestions? 

0 Kudos
Matthias_H_Intel
Employee
2,945 Views

Hi Yue,

from a quick view on your link it looks like it's an old server which might not support Xeon Phi at all. You might check with Supermicro whether this server could host Xeon Phi.

 

0 Kudos
Matthias_H_Intel
Employee
2,945 Views

otherwise it might also be the case that your BIOS has this option by default - is your card detected at all?

0 Kudos
Matthias_H_Intel
Employee
2,945 Views
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,945 Views

Belinda,

I've installed MPSS for Windows on a Windows 7 Pro x64 system. I can get the two Xeon Phi 5510P cards up and running, firmware updates, cards boot, micinfo shows both cards, cards ready, I can ping both cards. MicSmc-gui.exe shows both cards ideling along, ...

I can compile my first project selected in the tutorials coi folder "hello_world". The project compiles and runs fine up to the point where it wants to launch the native side app hello_world_sink_mic, which is not built by the solution (as separate project).

Launching an Intel Parallel Studio XE 2013 command prompt for use with Visual Studio 2012, and navigating to the demo folder (under C:\Program Files\..." and issuing

icl -Qmic hello_world_sink.cpp -o hello_world_sink_mic

I receive an error stating stdio.h cannot be found, check MPSS environment variables.

If I remove the -Qmic (not what I want as this compiles as host app) I get an error writing the .obj file (due to folder off C:\Program Files\..."

If I copy the MPSS folder elsewhere (not under protected folder)...

compile with -Qmic fails with stdio.h not found

compile without -Qmic succeeds.

IOW -Qmic expects a different set of environment variables (with respect to INCLUDE)

How do I properly set the environment variable(s) for compiling the coprocessor side (-Qmic) of the demo programs under Windows?

Jim Dempsey

0 Kudos
Alex_R_
Beginner
2,945 Views

We installed the 3120A card in Windows7 box, the card is blinking blue. Installed MPSS 3.1.2. The card is not displayed in Device Manager (?). "micctrl -s" command results in error: 

Error manipulating coprocessor: Intel(R) Xeon Phi(TM) coprocessor driver is not loaded or you have insufficient access

0 Kudos
BelindaLiviero
Employee
2,945 Views

Hi Alex, can you check the following (this is based on similar forum posts reported earlier this month)

  1. physically inspect the card installation - is the card inserted properly, and are all power connectors on the card plugged in properly

   2. If you are working with a numa machine, where some of the PCI slots are enabled or disabled, you need to make sure the coprocessor is installed on an enabled PCI slot.

 

 

 

0 Kudos
Alex_R_
Beginner
2,945 Views

Thank you, BELINDA! Switching to another PCI slot worked!

 

0 Kudos
Jess
Beginner
2,945 Views

Thanks for this.  Really helpful.  Unfortunately the problem we are seeing is the driver crashing when the machine boots.  I've attached the stacktrace from the logs.  This is from the latest MPSS on RHEL 6.

 

0 Kudos
Jess
Beginner
2,945 Views

The other problem that we are having is actually with NFS exporting GPFS shares.  Since the GPFS drivers and client software does not support MIC, we NFS export the drives from each host to its MICs.  It is very unreliable though, and so we find that the MICs will not mount the drives sometimes, citing "stale NFS filehandle" as the cause, which is untrue.  It seems related to the order of the mounts in /etc/fstab, as the first one will mount and the second won't.

Ideally we'd like GPFS binaries for MIC, as this is a kludge anyway.  In the current state we can't really say to users that the systems are ready to use.

(We'd also really like MPSS to support OFED 2.x, since that is what the rest of the machine is using.  Only the nodes with MICs in are on 1.5.x, and that's entirely due to needing it to support the IPoIB software provided with MPSS.)

0 Kudos
BelindaLiviero
Employee
2,945 Views

Hi Zaniyah, 

is your mic stracktrace from a consistently failing coprocessor (and can you send the tarball that gets created by the micdebug.sh script?)

As for GPFS -- are you in a position where you can ask IBM for their plans to support GPFS with Intel Xeon Phi Coprocessors ?   You can even tell them that there is now a Lustre client (was recently released, we'll provide a writeup on that soon).

I will make sure to pass on your comments about wanting OFED 2.x support.

 

0 Kudos
Virginie_Favrat
Beginner
2,945 Views

Hi Belinda,

I have installed mpss-3.2 for first use of Xeon Phi, but I can not know which version of Flash is installed and I can not update it.

Neither can I start mpss service.

Here are several results of commands :

sudo micinfo
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.

Created Thu Mar 27 08:18:47 2014


	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-431.5.1.el6.x86_64
		Driver Version		: 3.2-1
		MPSS Version		: 3.2
		Host Physical Memory	: 65918 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : NotAvailable
		SMC Firmware Version	 : NotAvailable
		SMC Boot Loader Version	 : NotAvailable
		uOS Version 		 : NotAvailable
		Device Serial Number 	 : NotAvailable
...
sudo micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: No valid image found
micsmc
DEBUG: ***** MicSettings(parent)::fileName():  "/home/vivi/.config/Intel Corp/MicSmcGUI.ini" 
DEBUG: ***** SessionSettings(parent)::fileName():  "/home/vivi/.config/Intel Corp/MicSmcGUI.ini" 
Avertissement�: mic0 : Connexion avec le p�riph�rique perdue !
Infos Web mic0 : Connexion avec le p�riph�rique r�tablie.
Avertissement�: mic0 : Connexion avec le p�riph�rique perdue !
Infos Web mic0 : Connexion avec le p�riph�rique r�tablie.
Avertissement�: mic0 : Connexion avec le p�riph�rique perdue !
sudo service mpss start
Starting Intel(R) MPSS:                                    [ÉCHOUÉ]

May you help me to find what is the trouble ?

Thanks in advance.

Virginie

0 Kudos
Virginie_Favrat
Beginner
2,945 Views

I was thinking that there was a probleme beacause of the kenrnel version.

So I restart with the original kernel version and reinstall MPSS. But here is the result of micctrl --initdefaults :

micctrl --initdefaults
micctrl(segv_handler+0x18) [0x4070c8]
/lib64/libpthread.so.0() [0x34cb40f710]
/usr/lib64/libmpssconfig.so.0.0.1(_add_miclist_not_present+0xb8) [0x7f5c84a35b98]
/usr/lib64/libmpssconfig.so.0.0.1(mpss_get_miclist+0x4d) [0x7f5c84a35e7d]
micctrl(create_miclist+0x1cd) [0x42123d]
micctrl(parse_config_args+0x370) [0x40db60]
micctrl(main+0x236) [0x40df56]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x34cb01ed1d]
micctrl() [0x406d29]

 

0 Kudos
BelindaLiviero
Employee
2,946 Views

Hi Virginie, can you send us the output of /usr/bin/micdebug.sh (just attach the tarball to this thread).   That would be most helpful.

0 Kudos
Virginie_Favrat
Beginner
2,946 Views

Hi Belinda !

I send you the last one, but if you want I have 2 other ones (made on Tuesday and Wednesday).

The usual commands I use to try as root :

  1. lspci | grep proc
  2. setenforce 0
  3. modprobe mic
  4. service mpss start

Thanks in advance.

0 Kudos
BelindaLiviero
Employee
2,758 Views

Hi Virginie,

the 'lspci -vvv' output on your host shows some weird things for the coprocessor (look for Co-processor in the output).   

Here is what it shows for you:

 

------

04:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel driver in use: mic

------

Here is what it should normally show: (as an example)

-----

01:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)
    Subsystem: Intel Corporation Device 2500
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 32
    Region 0: Memory at 380c00000000 (64-bit, prefetchable) [size=8G]
    Region 4: Memory at fb700000 (64-bit, non-prefetchable) [size=128K]
    Capabilities: [44] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [4c] Express (v2) Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <4us, L1 unlimited
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
    Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
        Vector table: BAR=4 offset=00017000
        PBA: BAR=4 offset=00018000
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
    Kernel driver in use: mic

----

 

 

I found someone else in this forum who has similar hardware to yours:

Manufacturer: ASUSTeK COMPUTER INC.
    Product Name: P9X79 WS

he uses CentOS (6.4) vs. yours (6.5), using an older MPSS (3.1.x) vs yours (3.2). 

Let me ask a couple of questions:

   - is this the first time you've installed this coprocessor? (that seems to be the case based on what you've said before)

   - have you tried plugging the coprocessor into any other slot in your system

   - did you  change anything in your system's BIOS? (i.e. you need to enable BIOS support for memory mapped I/O address ranges above 4GB? )

    - we may have to look further into the BIOS -- I have some BIOS update files from someone who, like I said before, had his ASUS functioning. ".   I could forward these to you.   The version he has working is P9x79-WS-ASUS-4306.CA.   what is yours?

 

0 Kudos
Reply