Application Acceleration With FPGAs
Programmable Acceleration Cards (PACs), DCP, FPGA AI Suite, Software Stack, and Reference Designs
477 Discussions

N6000/PL-1 SmartNIC image deployment error

Dang_Tran__Frederic
1,960 Views

Hello,

I’ve installed an Intel  N6000/1-PL  SmartNIC on a Lenovo SR650v2 server with the following stack:

  • N6000 SKU1
  • CentOS Stream release 8
  • OPAE v2.1.1
  • kernel 5.15.92-dfl

Server BIOS settings: card tested on two slots (1 and 7) with PCIe bifurcation set to x8x8. Fan speed set to maximum.

The server BIOS reports the following warning:

PCIe error recovery has occurred in slot number 1. The adapter may not work correctly.

And dmesg contains:

[22638.864360] intel-m10bmc-sec-update n6000bmc-sec-update.3.auto: SDM trigger failure: 4
[22638.877250] dfl-pci 0000:c5:00.1: enabling device (0140 -> 0142)
[22638.877568] dfl-pci 0000:c5:00.1: PCIE AER unavailable -5.
[22638.890287] dfl-pci 0000:c5:00.2: enabling device (0140 -> 0142)
[22638.890607] dfl-pci 0000:c5:00.2: PCIE AER unavailable -5.
[22638.904091] dfl-pci 0000:c5:00.3: enabling device (0140 -> 0142)
[22638.904377] dfl-pci 0000:c5:00.3: PCIE AER unavailable -5.
[22638.916944] dfl-pci 0000:c5:00.4: enabling device (0140 -> 0142)
[22638.917231] dfl-pci 0000:c5:00.4: PCIE AER unavailable -5.

Trying to deploy an image results in the error included below.
Otherwise PCIe  inventory and fpgainfo command seem to work ok as shown below.

Any help would be appreciated. Hardware problem, on-card BMC problem, software problem ?

 

fpgasupdate --log-level debug ofs_top_page1_pacsign_user1.bin 0000:C5:00.0
[2024-01-29 05:07:27.46] [DEBUG ] fw file: ofs_top_page1_pacsign_user1.bin
[2024-01-29 05:07:27.46] [DEBUG ] addr: 0000:C5:00.0
[2024-01-29 05:07:27.46] [DEBUG ] hash256: b'e026976389252b8a746943f351e8f149e5f0415f620cd1e0618229eb79e01bb8'
[2024-01-29 05:07:27.46] [DEBUG ] hash384: b'bb04ea12557ce23f2cb75685669d794fb6a06bf7b590430aa8bfdb4c765c6e15ecdb38200e1599aa8a7e52a2958e20db'
[2024-01-29 05:07:27.46] [DEBUG ] file type: Static Region (Update)
[2024-01-29 05:07:27.47] [DEBUG ] found device at 0000:c5:00.3 -tree is
[pci_address(0000:c2:04.0), pci_id(0x8086, 0x347c)] (pcieport)
[pci_address(0000:c5:00.3), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.1), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.4), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.2), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.0), pci_id(0x8086, 0xbcce)] (dfl-pci)

[2024-01-29 05:07:27.47] [DEBUG ] found device at 0000:c5:00.1 -tree is
[pci_address(0000:c2:04.0), pci_id(0x8086, 0x347c)] (pcieport)
[pci_address(0000:c5:00.3), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.1), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.4), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.2), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.0), pci_id(0x8086, 0xbcce)] (dfl-pci)

[2024-01-29 05:07:27.47] [DEBUG ] found device at 0000:c5:00.0 -tree is
[pci_address(0000:c2:04.0), pci_id(0x8086, 0x347c)] (pcieport)
[pci_address(0000:c5:00.3), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.1), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.4), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.2), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.0), pci_id(0x8086, 0xbcce)] (dfl-pci)

[2024-01-29 05:07:27.47] [DEBUG ] found device at 0000:c5:00.4 -tree is
[pci_address(0000:c2:04.0), pci_id(0x8086, 0x347c)] (pcieport)
[pci_address(0000:c5:00.3), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.1), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.4), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.2), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.0), pci_id(0x8086, 0xbcce)] (dfl-pci)

[2024-01-29 05:07:27.47] [DEBUG ] found device at 0000:c5:00.2 -tree is
[pci_address(0000:c2:04.0), pci_id(0x8086, 0x347c)] (pcieport)
[pci_address(0000:c5:00.3), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.1), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.4), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.2), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.0), pci_id(0x8086, 0xbcce)] (dfl-pci)

[2024-01-29 05:07:27.47] [DEBUG ] found device at 0000:c5:00.0 -tree is
[pci_address(0000:c2:04.0), pci_id(0x8086, 0x347c)] (pcieport)
[pci_address(0000:c5:00.3), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.1), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.4), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.2), pci_id(0x8086, 0xbcce)] (dfl-pci)
[pci_address(0000:c5:00.0), pci_id(0x8086, 0xbcce)] (dfl-pci)

[2024-01-29 05:07:27.48] [DEBUG ] could not find: "/sys/class/fpga_region/region0/dfl-fme.0/dfl*.*/*spi*/spi_master/spi*/spi*"
[2024-01-29 05:07:27.48] [DEBUG ] could not find: "/sys/class/fpga_region/region0/dfl-fme.0/dfl*.*/spi_master/spi*/spi*"
[2024-01-29 05:07:27.48] [DEBUG ] could not find: "/sys/class/fpga_region/region0/dfl-fme.0/spi*/spi_master/spi*/spi*"
[2024-01-29 05:07:27.48] [DEBUG ] could not find: "/sys/class/fpga_region/region0/dfl-fme.0/dfl_dev.4/n6000bmc-sec-update.3.auto/*fpga_sec_mgr*/*fpga_sec*"
[2024-01-29 05:07:27.48] [DEBUG ] could not find: "/sys/class/fpga_region/region0/dfl-fme.0/dfl_dev.4/n6000bmc-sec-update.3.auto/fpga_image_load/fpga_image*"
Traceback (most recent call last):
File "/usr/bin/fpgasupdate", line 33, in <module>
sys.exit(load_entry_point('opae.admin===1.4.1-', 'console_scripts', 'fpgasupdate')())
File "/usr/lib/python3.6/site-packages/opae/admin/tools/fpgasupdate.py", line 789, in main
if pac.upload_dev.find_one(os.path.join('update', 'filename')):
AttributeError: 'NoneType' object has no attribute 'find_one'

 

lspci -vt

| +-02.0-[c3-c4]--+-00.0 Intel Corporation Ethernet Controller E810-C for backplane
| | +-00.1 Intel Corporation Ethernet Controller E810-C for backplane
| | +-00.2 Intel Corporation Ethernet Controller E810-C for backplane
| | +-00.3 Intel Corporation Ethernet Controller E810-C for backplane
| | +-00.4 Intel Corporation Ethernet Controller E810-C for backplane
| | +-00.5 Intel Corporation Ethernet Controller E810-C for backplane
| | +-00.6 Intel Corporation Ethernet Controller E810-C for backplane
| | \-00.7 Intel Corporation Ethernet Controller E810-C for backplane
| \-04.0-[c5]--+-00.0 Intel Corporation Device bcce
| +-00.1 Intel Corporation Device bcce
| +-00.2 Intel Corporation Device bcce
| +-00.3 Intel Corporation Device bcce
| \-00.4 Intel Corporation Device bcce


fpgainfo fme
Intel Acceleration Development Platform N6001
Board Management Controller NIOS FW version: 3.14.0
Board Management Controller Build version: 3.14.0
//****** FME ******//
Object Id : 0xEF00000
PCIe s:b:d.f : 0000:C5:00.0
Vendor Id : 0x8086
Device Id : 0xBCCE
SubVendor Id : 0x8086
SubDevice Id : 0x1771
Socket Id : 0x00
Ports Num : 01
Bitstream Id : 0x5010202FAB46E6A
Bitstream Version : 5.0.1
Pr Interface Id : 00bc56cf-9e1f-5bf0-8011-48736ec862c9
Boot Page : user1
Factory Image Info : 801148736ec862c900bc56cf9e1f5bf0
User1 Image Info : 801148736ec862c900bc56cf9e1f5bf0
User2 Image Info : 801148736ec862c900bc56cf9e1f5bf0

 

 

0 Kudos
12 Replies
khtan
Employee
1,921 Views

Hi Frederic, 

Sorry for the delay in replying to your post. Just a few questions

1) Does the fpga card working (eg running afu test or your program) after your see the error and done all the commands that you listed (lspci, fpgainfo fme)?

2) Did you do any prior flashing on the FPGA card before rebooting?

3) Does the issue happen 1 time only or every reboot also you see the issue "PCIe error recovery has occurred in slot number 1. The adapter may not work correctly."

 

It might be due to this  intel-m10bmc-sec-update n6000bmc-sec-update.3.auto: SDM trigger failure: 4

 

If flashing SDM firmware , what I saw in our engineering database is that :

SDM provision firmware downloading requires Power Cycle, (This is SDM requirement).

Once SDM provisioning firmware download and key provisioning is done then we need to do power cycle.

 

Thanks

Regards

Kian

0 Kudos
Dang_Tran__Frederic
1,914 Views

Hi Kian,

1) chicken and egg problem: since I cannot deploy any image on the board, I haven't be able to test it with any program (my end goal is to  use Intel P4 SDK with this card)

2) the only thing that I flashed on the card is a more recent BMC firmware (using a USB/jtag cable). The initial version was 3.1. I upgraded it to 3.14 but to no avail:

Board Management Controller NIOS FW version: 3.14.0
Board Management Controller Build version: 3.14.0

3) the problem occurs systematically after any number of (cold) reboot

I'm not aware of the SDM firmware. Is it distinct from the BMC firmware ?


Regards

Frederic

 

0 Kudos
khtan
Employee
1,884 Views

Hi Frederic , 

Thanks for the reply, so basically the fpga board does not have any image in it yet other than the BMC firmware on max10. 

I was trying to find which version is associated with Pr Interface Id : 00bc56cf-9e1f-5bf0-8011-48736ec862c9

 

Anyway, I discuss with my colleague over here on this issue, we should focus on why fpgasupdate fail with missing files. I were thinking because the card is non functional without valid image , it is triggering the SDM (secure device manager) to try reconfigure the fpga and fail. It is a separate firmware from BMC but have some interface with it.

 

Do you know the OFS version that you installed in your system, I only saw OPAE is 2.1.1 but dfl version unknown except you are running kernel 5.15.92) and also the Quartus version that is installed in your system?

Could you try using Quartus to program/flash the fpga and see whether the fpga is working?

 

Thanks

Regards

Kian

 

0 Kudos
Dang_Tran__Frederic
1,859 Views

Hi Kian,

Regarding OFS version, I did not use an OFS installer script.  I compiled the kernel using this branch of the linux-dfl project:

git clone https://github.com/OPAE/linux-dfl.git -b fpga-ofs-dev-5.15-lts

Quartus version is Version 22.1.0 Build 174 03/30/2022 SC Pro Edition.

My knowledge of Quartus (and low-level FPGA programming) being limited, I'm afraid I won't be able to program the card using Quartus unless a ready-to-use project is available.

 

Regards,

Frederic

 

0 Kudos
khtan
Employee
1,787 Views

Hi Frederic,

Sorry for the delay in replying, trying to setup a server on my end to test out the configuration on my side.

 

Do you mind to provide the file that you tried to flash in via this command "fpgasupdate --log-level debug ofs_top_page1_pacsign_user1.bin 0000:C5:00.0"  ? 

 

I will try it on my end and see whether I could see the same thing

 

Thanks

Regards

Kian

0 Kudos
Dang_Tran__Frederic
1,762 Views

Hi Kian,

Please find the image as attachement.

Regards,

Frederic

0 Kudos
khtan
Employee
1,730 Views

Hi Frederic , 

I've setup similar system running the same OPAE and DFL with yours, and tried the fpgasupdate command . I could see the same error as you so I will debug this on my end.

 

The error is not related to bin file you provided, used the release bin also similar result.

 

Does upgrading the OFS version for both OPAE & DFL possible for you? If yes, let me try out the new version first.

 

Thanks

Regards

Kian

0 Kudos
Dang_Tran__Frederic
1,706 Views

Hi Kian,

Upgrading OPAE & DFL version is not possible because I need to stick to versions compatible with the P4 toolchain.
Actually, the linux-dfl version I've used so far (5.15.92-dfl) has a minor revision (92) more recent than the revision I was advised to use.

In doubt, I downgraded the kernel to 5.15.45-dfl, reinstalled OPAE 2.1.1 and ... the problem goes away ! fpgasupdate completes successfully.

 

Regards,

Frederic

0 Kudos
khtan
Employee
1,695 Views

Hi Frederic , 

That's good to hear . I upgraded the DFL and OPAE version to the 2023.3-2 OFS and fpgasupdate is also working. I was trying to find those missing files reported in the logs but couldn't find it , previous I was on 5.15.92-dfl as well. Probably there is some issues with that particular version as I remembered I tested OFS 2022 version and it was working previously 1 year + back.

 

Anyway thanks for the update. Does the error still pops up? 

Quote:

"The server BIOS reports the following warning:

PCIe error recovery has occurred in slot number 1. The adapter may not work correctly.

And dmesg contains:

[22638.864360] intel-m10bmc-sec-update n6000bmc-sec-update.3.auto: SDM trigger failure: 4"

 

Thanks

Regards

Kian

0 Kudos
Dang_Tran__Frederic
1,685 Views

Hi Kian,

The BIOS warning is gone as well as the kernel error message related to the N6000.

Regards,

Frederic

0 Kudos
khtan
Employee
1,680 Views

Hi Frederic,

Thanks for the info, is there anything else I could support you with? Otherwise I would like to close the forum case as resolved and transition it to community support.

 

Thanks

Regards

Kian

0 Kudos
Dang_Tran__Frederic
1,664 Views

Hi Kian,

As far as I'm concerned, the problem is solved and the case can be closed.

Regards,

Frederic

0 Kudos
Reply