Software Archive
Read-only legacy content
17061 Discussions

device visible in OS until an attempt to load kernel (or for a while)

lejeczek
Beginner
1,032 Views

lspci -vv -s 05:00.0
05:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
    Subsystem: Intel Corporation Device 2500
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 0
    Region 0: Memory at <unassigned> (64-bit, prefetchable) [disabled] [size=8G]
    Region 4: Memory at dff00000 (64-bit, non-prefetchable) [disabled] [size=128K]
    Capabilities: [44] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [4c] Express (v2) Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
    Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [98] MSI-X: Enable- Count=16 Masked-
        Vector table: BAR=4 offset=00017000
        PBA: BAR=4 offset=00018000
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
 

device after a while (after system boot) would disappear from the OS quietly.

Also without kernel module load at boot time, but with manual load it disappears, with this feedback:

  155.158118] mic: module verification failed: signature and/or required key missing - tainting kernel
[  155.170604] vnet: mode: dma, buffers: 62
[  155.170651] mic 0000:05:00.0: enabling device (0140 -> 0142)
[  155.170797] mic 0: failed to reserve aperture space
[  155.170838] mic: No MIC boards present.  SCIF available in loopback mode

any help greatly appreciated.

0 Kudos
5 Replies
lejeczek
Beginner
1,032 Views

and then:

lspci -vv -s 05:00.0
05:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev ff) (prog-if ff)
    !!! Unknown header type 7f

0 Kudos
Frances_R_Intel
Employee
1,032 Views

There are two types of problems people have been having with their new 31S1 cards - finding an suitable system board and overheating. I think you may be experiencing both.

The lines:

Region 0: Memory at <unassigned> (64-bit, prefetchable) [disabled] [size=8G]
Region 4: Memory at dff00000 (64-bit, non-prefetchable) [disabled]

generally mean that the BIOS settings are wrong. The first thing to check is that the BIOS is set to allow large (>4G) BAR addresses. If you can't change the BAR size in the BIOS you need a new BIOS (or a different board). It is possible there are other problems as well. You might want to follow the discussion in https://software.intel.com/en-us/forums/topic/538897 in which another user is trying to solve this issue.

The card disappearing after the system is on for a while often means that the card is overheating. The Intel Xeon Phi coprocessor 31S1P is passively cooled. Some smaller systems do not have sufficient cooling for these cards. You might want to follow the discussion in https://software.intel.com/en-us/forums/topic/537661 where the cooling issues are being discussed.

Let us know how it goes.

0 Kudos
lejeczek
Beginner
1,032 Views

many thanks Frances

I'm not an engineer but from I gather LBAR is an intrinsic of x86_64, native feature/mechanism. Before I knock on AMD's doors, would you comment on those claims about this being not the case with AMD latest x86 CPU, simply that it does not work there - just a comment, I would not expect a statement.

On thermal subject, even though it all sits in a Supermicro server case I'll try to give Phis more air and will share my findings.

 

0 Kudos
Frances_R_Intel
Employee
1,032 Views

Pawel,

As far as the possibility of overheating, there is a program called micsmc that can run on the host processor either as a GUI or command line. It will let you monitor the temperature. I should have mentioned this before. If you think your system has enough cooling, you might want to bring the GUI up and just watch the temperature for a while. There is a man page that will show you all the options.

0 Kudos
JJK
New Contributor III
1,032 Views

"LBAR" is a PCI feature and is not Intel specific - it means that 64bit PCI registers are allowed (and available). Some motherboards only support 32bit PCI registers (which is OK from a PCI point of view).

I've got a Supermicro server ( X9DRG-HF motherboard) with 2 Xeon Phi's in it and I had to turn up the fans (using a BIOS setting) to stop the Phi's from overheating. The fans in the Supermicro mobo are now continually blowing at high speeds and the Phi (5110P's) stay at a nice cool 40 degrees C.

HTH,

JJK

0 Kudos
Reply