Software Archive
Read-only legacy content
17061 Discussions

Phi 7120P initialization issues and invalid revision ff

Nathan_S_
Novice
813 Views

I am working to get an Intel 7120P up and running in an HP Z800 workstation with a 1110 W power supply. The board is plugged into a PCIe-x16 gen 2 slot and is connected to one of the PS' 6-pin connectors. As the PS has no 8-pin connectors, we are using a pair of IDE hard drive power cables paired together in an 8-pin adapter.  All of this is in an (unfortunately) out of date Fedora Core 15 distribution running kernel 2.6.43.8-1.fc15.x86_64.

At first when booted, the board shows up via lspci as:

$ lspci -vvv
0f:00.0 Co-processor: Intel Corporation Device 225c (rev 20)
        Subsystem: Intel Corporation Device 7d95
        Physical Slot: 2
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 255
  ...

After about 6 minutes, lspci changes:

$ lspci -vvv
0f:00.0 Co-processor: Intel Corporation Device 225c (rev ff) (prog-if ff)
        !!! Unknown header type 7f

My first thought is that this is thermal in nature, but we have a large fan at high revs pointing down the board and an external fan blowing over the whole card.  We've never gotten as far as getting the mpss service started. The power supply should also be sufficient.

Has anyone experienced this same or similar issues in the past? Does anyone have any suggestions on how to resolve it?

 

0 Kudos
6 Replies
JJK
New Contributor III
813 Views

the line

Interrupt: pin A routed to IRQ 255

could point to the root cause : IRQ 255 actually means 'NOT SET' - i.e. the card is never assigned an IRQ by the motherboard or kernel and hence the card "disappears" after a while. Try playing with the 'pci=....' and or 'acpi=...' options.

 

0 Kudos
Frances_R_Intel
Employee
813 Views

I am not sure why you are not getting a valid IRQ assigned. My first suspicion was that it might be a power issue but I asked around among those who know more than I. Their suspicion is that you are, indeed, having thermal issues. Other users have tried using a passively cooled card in a workstation like this. It has almost invariably gone badly, even with a large fan blowing air over the card. The air needs to be directed into the card and over the heat sink fins.

A second issue is that the system you are using most likely does not have a BIOS that supports the amount of MMIO mapping required for card. There is no work around for that.

I would recommend that you check the BIOS to ensure it can support the card. (See section 2.6 in the readme.txt for either MPSS 3.3 or 3.4.) If it does not, you cannot use the card with that system. Even if the BIOS is acceptable, I would still recommend that you use a different system rather than trying to play around with the cooling. Given that you cannot get the card far enough up to run micsmc or miccheck, you have no idea how hot the card is getting. You could permanently damage the card while you play around with trying to guide the air where it needs to be. Even if you do somehow manage to work around the cooling issue, there is still no guarantee that there is not also a power issue.

I'm sorry to be so negative, but I have watch people try to deal with fitting a card into a system that isn't suitable and I would rather not see you suffer through that.

0 Kudos
Nathan_S_
Novice
813 Views

It does appear that there are three problems. The first is heat. With additional fans, I can get it to stay responding longer, but not indefinitely, even when at idle. The second is that the system was not assigning it an IRQ. That was resolved by using the pci=noacpi kernel option at boot. Finally, the board was not mapping the memory, suggesting that even though HP says the Z800 supports greater than 4GB memory mapped I/O (MMIO), and the host system has 96 GB or RAM (if that even makes a difference), it doesn't support enough MMIO to map the 7120P's 16 GB of RAM (not sure if has to map it all or not, but it's not successfully mapping what's required).

A neighbor of ours has a server that they put the board into, which resolved all three problems at once. Our next step is figuring out where to go from here. Clearly the Z800/7120P is not a combination that is going to work for us.

0 Kudos
Nathan_S_
Novice
813 Views

We have now also tried using an Intel Phi 3120A in the system since it provides active cooling, thus resolving the thermal issues, and also has only 6 GB or RAM. This board is assigned an IRQ, but the Z800 system cannot map its memory. In fact, the system lists that it cannot map the 16 GB memory region, depite the fact that this board has only 6 GB.  The other change is that we're now testing under CentOS 7.

0 Kudos
Frances_R_Intel
Employee
813 Views

CentOS 7 should not be an issue. I regularly use a system running RHEL 7.0 with no problem and I don't believe there is a significant difference between the two as far as the Intel Xeon Phi coprocessor is concerned.

As to what the mapped memory is for, it is not to mapped the memory on the card but to support the PCIe interface/driver. The message about not being able to map the 16 GB memory region sounds almost like the total mapped memory for all the devices in your system has exceeded the amount of memory available for mapping. If this is the case, you might want to contact HP or whomever you have support for your host system through and see if this limit can be increased. If you would like us to look at this some more, it would be helpful to have the output from /usr/bin/micdebug.sh, which goes through your system, grabbing various log and configuration files and tarring them up. Look through the output it produces and, if it is ok with you, attach it to a private message to me. (The Send Author a Message link.)

0 Kudos
Nathan_S_
Novice
813 Views

HP has now come back and said that the Z800 does not support the Intel Phi 3120A for two reasons: insufficient power in stock 800W power supply; BIOS that cannot support the memory mapped IO needs of the card.

0 Kudos
Reply