Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
20644 Discussions

Issues with updating Arria10 PAC for AFU

jyoung
Novice
1,866 Views

Hello,

Platform info: Arria 10 GX PAC

Host System: Ubuntu 18.04 ( 4.15.0 kernel), Xeon Gold 6226R CPU dual-socket server

We have two Arria10 PAC cards that we are trying to run the AFU Getting Started examples (UG 20166), but we need to update our cards to the latest 1.2.1 firmware. On one of the cards, the fpgaotsu update finished correctly, but super-rsu fails with the following error:

 

sudo super-rsu --log-level trace /usr/share/opae/a10-gx-pac/super-rsu/base/rsu-09c4.json
[2020-09-02 21:29:41,652] [DEBUG   ] [MainThread  ] - found fpga objects: ['/sys/class/fpga/intel-fpga-dev.0']
[2020-09-02 21:29:41,653] [DEBUG   ] [MainThread  ] - found device at 0000:89:00.0 -tree is
 [pci_address(0000:85:00.0), pci_id(0x8086, 0x2030)]
    [pci_address(0000:86:00.0), pci_id(0x10b5, 0x8747)]
        [pci_address(0000:87:08.0), pci_id(0x10b5, 0x8747)]
        [pci_address(0000:87:10.0), pci_id(0x10b5, 0x8747)]
            [pci_address(0000:89:00.0), pci_id(0x8086, 0x09c4)]

[2020-09-02 21:29:41,654] [DEBUG   ] [MainThread  ] - could not find: "/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec_mgr/ifpga_sec*"
[2020-09-02 21:29:41,654] [WARNING ] [MainThread  ] - [0000:89:00.0] does not support secure update
[2020-09-02 21:29:41,654] [ERROR   ] [MainThread  ] - missing one or more items required by rsu config
[2020-09-02 21:29:41,654] [INFO    ] [MainThread  ] - super-rsu exiting with code '78'

############ FME info ################
fpgainfo fme
Board Management Controller, microcontroller FW version 26889
Last Power Down Cause: POK_CORE
Last Reset Cause: None
//****** FME ******//
Object Id                     : 0xEE00000
PCIe s:b:d:f                  : 0000:89:00:0
Device Id                     : 0x09C4
Socket Id                     : 0x00
Ports Num                     : 01
Bitstream Id                  : 0x123000200000185
Bitstream Version             : 1.2.3
Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
Boot Page                     : user

 

It seems possible that we are missing some driver here, but I'm not sure what the next thing to check might be.  Does anyone have any suggestions?

0 Kudos
1 Solution
jyoung
Novice
1,572 Views

I was finally able to update this using super-rsu after completely shutting off power to the server (cold reboot):

[]$ super-rsu --log-level trace /usr/share/opae/a10-gx-pac/super-rsu/base/rsu-09c4.json
[2020-12-28 16:33:37,086] [DEBUG   ] [MainThread  ] - found fpga objects: ['/sys/class/fpga/intel-fpga-dev.0']
[2020-12-28 16:33:37,088] [DEBUG   ] [MainThread  ] - found device at 0000:3d:00.0 -tree is
 [pci_address(0000:3a:00.0), pci_id(0x8086, 0x2030)]
    [pci_address(0000:3b:00.0), pci_id(0x10b5, 0x8747)]
        [pci_address(0000:3c:08.0), pci_id(0x10b5, 0x8747)]
            [pci_address(0000:3d:00.0), pci_id(0x8086, 0x09c4)]
        [pci_address(0000:3c:10.0), pci_id(0x10b5, 0x8747)]
            [pci_address(0000:3e:00.0), pci_id(0x198a, 0x385c)]

[2020-12-28 16:33:37,096] [WARNING ] [MainThread  ] - Update starting. Please do not interrupt.
[2020-12-28 16:33:37,097] [DEBUG   ] [MainThread  ] - [3d:00.0] version (0x0124000200000367) up to date for sr
[2020-12-28 16:33:37,098] [DEBUG   ] [MainThread  ] - bmc_fw is being force flashed
[2020-12-28 16:33:37,098] [DEBUG   ] [MainThread  ] - bmc_fw versions not equal (system:0x0000000000026889 != manifest:0x0000000000026895)
[2020-12-28 16:33:37,098] [DEBUG   ] [MainThread  ] - bmc_fw versions not equal (system:0x0000000000026889 != manifest:0x0000000000026895)
[2020-12-28 16:33:37,099] [DEBUG   ] [MainThread  ] - [3d:00.0] update timeout set to: 1200.0
[2020-12-28 16:33:37,099] [DEBUG   ] [3d:00.0     ] - update of board at [pci_address(0000:3d:00.0), pci_id(0x8086, 0x09c4)] started
[2020-12-28 16:33:37,099] [DEBUG   ] [MainThread  ] - max timeout set to: 0:20:00
[2020-12-28 16:33:37,100] [DEBUG   ] [3d:00.0     ] - starting task: fpgasupdate /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4_bootloader-26895-fw_Release.bin 0000:3d:00.0
[2020-12-28 16:33:37,222] [WARNING ] Update starting. Please do not interrupt.
[2020-12-28 16:33:37,223] [INFO    ] updating from file /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4_bootloader-26895-fw_Release.bin with size 38016
[2020-12-28 16:33:37,331] [INFO    ] writing to staging area
[2020-12-28 16:34:36,173] [DEBUG   ] [MainThread  ] - waiting (0:19:00.927721) for threads: 3d:00.0
[2020-12-28 16:34:36,674] [DEBUG   ] [MainThread  ] - waiting (0:19:00.426487) for threads: 3d:00.0
(100%) [____________________] [38016/38016 bytes][Time:0:01:34.404933]
[2020-12-28 16:35:11,747] [INFO    ] applying update to 0000:3d:00.0
(100%) [____________________][Time:0:00:08.010363]
[2020-12-28 16:35:19,757] [INFO    ] update of 0000:3d:00.0 complete
[2020-12-28 16:35:19,758] [INFO    ] Secure update OK
[2020-12-28 16:35:19,758] [INFO    ] Total time: 0:01:42.536032
[2020-12-28 16:35:19,809] [DEBUG   ] [3d:00.0     ] - task completed in 0:01:42.707920
[2020-12-28 16:35:19,809] [DEBUG   ] [3d:00.0     ] - starting task: fpgasupdate /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4-26895-fw_Release.bin 0000:3d:00.0
[2020-12-28 16:35:19,932] [WARNING ] Update starting. Please do not interrupt.
[2020-12-28 16:35:19,934] [INFO    ] updating from file /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4-26895-fw_Release.bin with size 244864
[2020-12-28 16:35:20,039] [INFO    ] writing to staging area
(100%) [____________________] [244864/244864 bytes][Time:0:00:01.575939]
[2020-12-28 16:35:21,626] [INFO    ] applying update to 0000:3d:00.0
[2020-12-28 16:35:36,247] [DEBUG   ] [MainThread  ] - waiting (0:18:00.853465) for threads: 3d:00.0
[2020-12-28 16:35:36,748] [DEBUG   ] [MainThread  ] - waiting (0:18:00.352268) for threads: 3d:00.0
(100%) [____________________][Time:0:00:43.055355]
[2020-12-28 16:36:04,681] [INFO    ] update of 0000:3d:00.0 complete
[2020-12-28 16:36:04,682] [INFO    ] Secure update OK
[2020-12-28 16:36:04,682] [INFO    ] Total time: 0:00:44.749368
[2020-12-28 16:36:04,702] [DEBUG   ] [3d:00.0     ] - task completed in 0:00:44.892688
[2020-12-28 16:36:05,283] [INFO    ] [MainThread  ] - 1 board updated. A power-cycle is required.
[2020-12-28 16:36:05,284] [INFO    ] [MainThread  ] - super_rsu.pyc completed in: 0:02:28.187105
[2020-12-28 16:36:05,284] [INFO    ] [MainThread  ] - super-rsu exiting with code '0'

#Check the fme with fpgainfo to make sure it is updated
[]$ fpgainfo fme
Board Management Controller, microcontroller FW version 26895
Last Power Down Cause: POK_CORE
Last Reset Cause: None
//****** FME ******//
Object Id                     : 0xEB00000
PCIe s:b:d:f                  : 0000:3D:00:0
Device Id                     : 0x09C4
Socket Id                     : 0x00
Ports Num                     : 01
Bitstream Id                  : 0x124000200000367
Bitstream Version             : 1.2.4
Pr Interface Id               : 38d782e3-b612-5343-b934-2433e348ac4c
Boot Page                     : user

 

I'm not totally sure why the fpga-otsu command would not originally complete (and then finally completed), but my best guess is that using a slightly different kernel minor version or that restarting the server with cold reboot (powering off the server) helps to reinitialize the FPGA state and devices under /sys/. Note that warm reboots (normal power-cycle) may be causing some weirdness with the FPGA device initialization, which is why I've recommended cold reboots (turning off power completely for 20-30 seconds).

For the original issue with super-rsu my suggested solution is the following:

1) Follow the instructions to run fpga-otsu on pg. 40 of the AFU Quick Start Guide. If it fails, power the server off completely for ~30 seconds (cold reboot), power on, initialize the AFU devstack (`. /opt/inteldevstack/init_env.sh`) and rerun the command until it succeeds.

2) Once fpga-otsu completes, perform a cold reboot again. This command should probably replace Step 2 on pg. 41 that originally suggests to "2. Power cycle the server." which can mean either a warm or cold reboot.

3) Check that the ifpga_sec_mgr module is properly loaded correctly and that the ifpga_sec_mgr device exists. If it does not exist, try a cold reboot and check each time after initializing the AFU devstack.

`ls /sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec_mgr/
ifpga_sec0`

4) If this device exists, then the super-rsu command should complete successfully (or at least fail elsewhere).

View solution in original post

0 Kudos
12 Replies
JohnT_Intel
Employee
1,828 Views

Hi,


Could you try to remove 1 of the board and make sure that the board firmware is able to update to 1.2.1 correctly?


0 Kudos
jyoung
Novice
1,820 Views

Hi John,

I should clarify that we have 2x Arria10 PAC cards but they are in two separate servers both running the same version of Linux. I'm not sure that removing one card would change anything here since there is only one card in each server. Is there something else I could try, either just using the 1.2.0 firmware and an older stack or using a latest Github release, if available?

For more information - one card finishes the initial fpgaotsu update but fails at the super-rsu command (as in this post). The second card in a separate server won't even finish the initial fpgaotsu update process. I tried posting on the OPAE Github for this second issue (link here) but have not heard back yet.

[]$ sudo fpgaotsu /usr/share/opae/a10-gx-pac/fpgaotsu/base/otsu-09C4.json
[2020-09-02 21:12:14,385] [INFO    ] [MainThread] Intel PAC with Intel Arria 10 GX FPGA 0000:01:00.0 is not secure.
[2020-09-02 21:12:14,393] [WARNING ] [MainThread] Update starting. Please do not interrupt.
[2020-09-02 21:12:14,394] [INFO    ] [0000:01:00.0] Updating Intel PAC with Intel Arria 10 GX FPGA : 0000:01:00.0
[2020-09-02 21:12:14,396] [INFO    ] [0000:01:00.0] Erasing flash@0x7fd0000 for 196608 bytes
[2020-09-02 21:12:14,591] [INFO    ] [0000:01:00.0] Writing flash@0x7ff0000 for 104 bytes (RushCreek_Release_key_blk2)
(100%) [████████████████████] [104/104 bytes][Time:0:00:00.000068]              
[2020-09-02 21:12:14,593] [INFO    ] [0000:01:00.0] Reading flash@0x7ff0000 for 104 bytes for verification
(100%) [████████████████████] [104/104 bytes][Time:0:00:00.006021]              
[2020-09-02 21:12:14,600] [INFO    ] [0000:01:00.0] Verified flash@0x7ff0000 for 104 bytes (RushCreek_Release_key_blk2)
[2020-09-02 21:12:14,600] [INFO    ] [0000:01:00.0] Erasing flash@0x1800000 for 41943040 bytes
[2020-09-02 21:12:54,742] [INFO    ] [0000:01:00.0] Writing flash@0x1800000 for 41943040 bytes (dcp_1_2_1_rot_reversed.rpd)
(100%) [████████████████████] [41943040/41943040 bytes][Time:0:01:06.238681]    
[2020-09-02 21:14:01,045] [INFO    ] [0000:01:00.0] Reading flash@0x1800000 for 41943040 bytes for verification
(100%) [████████████████████] [41943040/41943040 bytes][Time:0:00:58.330970]    
[2020-09-02 21:14:59,404] [INFO    ] [MainThread] Total time: 0:02:45.010901
[2020-09-02 21:14:59,405] [ERROR   ] [MainThread] Intel PAC with Intel Arria 10 GX FPGA 0000:01:00.0: Verification of dcp_1_2_1_rot_reversed.rpd @0x25165824 failed
[2020-09-02 21:14:59,405] [ERROR   ] [MainThread] One-Time Secure Update failed

 

 

0 Kudos
jyoung
Novice
1,807 Views

Hi John,

As I mentioned above we only have one 1 card in the machine, and it is running with the listed supported OS of Ubuntu 18.04.

Do you have any other suggestions we could try? For example, should we just use the 1.2.0 acceleration stack until Intel releases a better fix? Alternatively, can we use the 2.0.1 stack for the D5005 PAC to see if that works appropriately?

Thanks,

Jeff

0 Kudos
JohnT_Intel
Employee
1,794 Views

Hi,


Have you performed "systemctl stop pacd.service" before performing update? Then you can only performed "sudo fpgaotsu /usr/share/opae/a10*/fpgaotsu/base/otsu-09C4.json"


Could you provide me "sudo fpgainfo fme" information for both the system?


0 Kudos
EricMunYew_C_Intel
Moderator
1,682 Views

Hi, Young


Can you point me to the document and page you follow for installation ?


Thanks.


Eric






0 Kudos
jyoung
Novice
1,679 Views

Hi @EricMunYew_C_Intel - we are using the AFU Getting Started Guide (UG 20166). I'm not sure if the process has changed with recent OPAE announcements in November, to be honest.

 

@JohnT_Intel- apologies for the long delay. The student working on this claimed 1 of the boards in "server3" was working ok for his designs. Server2 we've never been able to get working.

Server2 still can't be upgraded to a newer firmware version.

[server2]$sudo fpgainfo fme
Board Management Controller, microcontroller FW version 26889
Last Power Down Cause: POK_CORE
Last Reset Cause: External reset
Power-on-reset
//****** FME ******//
Object Id                     : 0xEC00000
PCIe s:b:d:f                  : 0000:01:00:0
Device Id                     : 0x09C4
Socket Id                     : 0x00
Ports Num                     : 01
Bitstream Id                  : 0x123000200000185
Bitstream Version             : 1.2.3
Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
Boot Page                     : user

 

Server3 shows the correct bitstream version

[server3]$sudo fpgainfo fme
Board Management Controller, microcontroller FW version 26889
Last Power Down Cause: POK_CORE
Last Reset Cause: External reset
Power-on-reset
//****** FME ******//
Object Id                     : 0xEC00000
PCIe s:b:d:f                  : 0000:89:00:0
Device Id                     : 0x09C4
Socket Id                     : 0x00
Ports Num                     : 01
Bitstream Id                  : 0x124000200000367
Bitstream Version             : 1.2.4
Pr Interface Id               : 38d782e3-b612-5343-b934-2433e348ac4c
Boot Page                     : user

 

0 Kudos
jyoung
Novice
1,672 Views

I was reviewing my last post and noticed that the FW version on our "working" server, server3, was not yet updated to 26895. I reran the OTSU and super-rsu process

[server3$]fpgainfo fme
Board Management Controller, microcontroller FW version 26895
Last Power Down Cause: POK_CORE
Last Reset Cause: None
//****** FME ******//
Object Id : 0xEC00000
PCIe s:b:d:f : 0000:89:00:0
Device Id : 0x09C4
Socket Id : 0x00
Ports Num : 01
Bitstream Id : 0x124000200000367
Bitstream Version : 1.2.4
Pr Interface Id : 38d782e3-b612-5343-b934-2433e348ac4c
Boot Page : user

 

However, the non-working server still won't run the OTSU update. I have also checked that the pacd service is not running (it's not enabled currently on either server). I've attached the debug output for the OTSU update command.

server2:~# systemctl stop pacd
Failed to stop pacd.service: Unit pacd.service not loaded.
root@server2:~# systemctl status pacd.service
Unit pacd.service could not be found.
root@server2:~# fpgaotsu --log-level debug /usr/share/opae/a10*/fpgaotsu/base/otsu-09C4.json
[2020-11-24 13:22:04,809] [DEBUG   ] [MainThread] found fpga objects: ['/sys/class/fpga/intel-fpga-dev.0']
[2020-11-24 13:22:04,810] [DEBUG   ] [MainThread] found device at 0000:01:00.0 -tree is
 [pci_address(0000:00:1c.0), pci_id(0x8086, 0xa190)]
    [pci_address(0000:01:00.0), pci_id(0x8086, 0x09c4)]

[2020-11-24 13:22:04,810] [DEBUG   ] [MainThread] could not find: "/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/spi*/spi_master/spi*/spi*"
[2020-11-24 13:22:04,810] [DEBUG   ] [MainThread] could not find: "/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec_mgr/ifpga_sec*"
[2020-11-24 13:22:04,810] [INFO    ] [MainThread] Intel PAC with Intel Arria 10 GX FPGA 0000:01:00.0 is not secure.
[2020-11-24 13:22:04,819] [WARNING ] [MainThread] Update starting. Please do not interrupt.
[2020-11-24 13:22:04,820] [INFO    ] [0000:01:00.0] Updating Intel PAC with Intel Arria 10 GX FPGA : 0000:01:00.0
[2020-11-24 13:22:04,820] [DEBUG   ] [0000:01:00.0] could not find: "/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/spi*/spi_master/spi*/spi*"
[2020-11-24 13:22:04,822] [INFO    ] [0000:01:00.0] Erasing flash@0x7fd0000 for 196608 bytes
[2020-11-24 13:22:04,822] [DEBUG   ] [0000:01:00.0] erasing 196608 bytes starting at 0x07fd0000
[2020-11-24 13:22:05,023] [INFO    ] [0000:01:00.0] Writing flash@0x7ff0000 for 104 bytes (RushCreek_Release_key_blk2)
[2020-11-24 13:22:05,023] [DEBUG   ] [0000:01:00.0] (100%) [____________________] [104/104 bytes][Time:0:00:00.000055]

[2020-11-24 13:22:05,025] [INFO    ] [0000:01:00.0] Reading flash@0x7ff0000 for 104 bytes for verification
[2020-11-24 13:22:05,025] [DEBUG   ] [0000:01:00.0] copying 104 bytes from mtd device: /dev/mtd0
[2020-11-24 13:22:05,031] [DEBUG   ] [0000:01:00.0] (100%) [____________________] [104/104 bytes][Time:0:00:00.005958]

[2020-11-24 13:22:05,032] [INFO    ] [0000:01:00.0] Verified flash@0x7ff0000 for 104 bytes (RushCreek_Release_key_blk2)
[2020-11-24 13:22:05,032] [INFO    ] [0000:01:00.0] Erasing flash@0x1800000 for 41943040 bytes
[2020-11-24 13:22:05,032] [DEBUG   ] [0000:01:00.0] erasing 41943040 bytes starting at 0x01800000
[2020-11-24 13:22:15,753] [ERROR   ] [0000:01:00.0] [Errno 110] Connection timed out
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/opae/admin/tools/fpgaotsu.py", line 509, in process_flash_item
    self.erase(flash, mtd_dev)
  File "/usr/lib/python2.7/dist-packages/opae/admin/tools/fpgaotsu.py", line 317, in erase
    (erase_end + 1) - erase_start)
  File "/usr/lib/python2.7/dist-packages/opae/admin/utils/mtd.py", line 149, in erase
    fcntl.ioctl(fp.fileno(), self.IOCTL_MTD_MEMERASE, iodata)
IOError: [Errno 110] Connection timed out
[2020-11-24 13:22:15,764] [INFO    ] [MainThread] Total time: 0:00:10.944948
[2020-11-24 13:22:15,765] [ERROR   ] [MainThread] Intel PAC with Intel Arria 10 GX FPGA 0000:01:00.0: [Errno 110] Connection timed out
[2020-11-24 13:22:15,765] [ERROR   ] [MainThread] One-Time Secure Update failed

 

0 Kudos
EricMunYew_C_Intel
Moderator
1,660 Views

Have you tried on Ubuntu 16.04, kernel version 4.4 ?


Do you have 48 GB of free memory ?


Did you manually install OPAE ?


Which document you followed for upgrading to 1.2.1 ?


Are you running virtual machine or non-virtual machine ?


Can you check if your server is in the list below:

https://www.intel.com/content/www/us/en/programmable/products/boards_and_kits/dev-kits/altera/acceleration-card-arria-10-gx/buy.html


Did you get your bdf of your Intel PAC card correctly during upgrade,

lspci | grep 09c4

Output:

d8:00.0 Processing accelerators: Intel Corporation Device 09c4

./setup_fim_and_bmc -b <bus id> -d <device id> -f <function id> -p $OPAE_PLATFORM_ROOT

bus id d8

device id 00

function id 0


May I know where did you get fpgaotsu and super-rsu from ?


0 Kudos
jyoung
Novice
1,642 Views

Hi @EricMunYew_C_Intel - answers inline below.

 

Have you tried on Ubuntu 16.04, kernel version 4.4 ?

  • No, however the accleration stack user guide (UG 20166)  indicates that the validated release is Ubuntu 18.04, kernel 4.15. It's unfortunately not feasible to reload the server just to test this since it is a production machine.

Do you have 48 GB of free memory ?

  • Yes. This server has 384 GB of RAM.

Did you manually install OPAE ?

  • No. We installed it from the 1.2.1 acceleration stack for development from this Intel site.


Which document you followed for upgrading to 1.2.1 ?

Are you running virtual machine or non-virtual machine ?

  • This is a standard Cascade Lake server with SuperMicro MB (server model SYS-4029GP-TRT with X11DPG-OT-CPU MB).

Can you check if your server is in the list below:

https://www.intel.com/content/www/us/en/programmable/products/boards_and_kits/dev-kits/altera/accele...

  • It is not but we have specific vendors we have to purchase our servers from at our institution.

Did you get your bdf of your Intel PAC card correctly during upgrade,

lspci | grep 09c4

Output:

d8:00.0 Processing accelerators: Intel Corporation Device 09c4

./setup_fim_and_bmc -b <bus id> -d <device id> -f <function id> -p $OPAE_PLATFORM_ROOT

bus id d8

device id 00

function id 0

  • I don't see this command (setup_fim_and_bmc) in the 1.2.1 acceleration stack release as it seems to be specific to the 1.2 acceleration stack release (it shows up in an older tarball as  a10_gx_pac_ias_1_2_pv_rte_installer/components/setup_fim_and_bmc.sh). Section 5 of the user guide indicates we correctly have the 1.2 firmware, and Appendix A.1 seems to indicate we should not use this script to do a firmware upgrade since we already have the 1.2 firmware installed. Is there a benefit to downloading this additional script?

May I know where did you get fpgaotsu and super-rsu from ?

  • These are from the standard 1.2.1 acceleration stack for development from this Intel site.
0 Kudos
EricMunYew_C_Intel
Moderator
1,618 Views

Hi, Young


Can you try to upgrading from 1.2 to 1.2.1 (according to page 39), and follow page 40 for upgrading the FIM and BMC. And you can follow page 46 to change the permission permanently.


https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-qs-ias-v1-2-1.pdf


0 Kudos
jyoung
Novice
1,581 Views

@EricMunYew_C_Intel I tried the fpgaotsu command, and this finished correctly this time. However, I'm back to the initial super-rsu error from my original post.

 

FPGAOTSU finishes correctly - the only change is the * as far as I can tell so not sure what changed. I am using the exact same AFU stack as previously.

 

fpgaotsu /usr/share/opae/a10*/fpgaotsu/base/otsu-09C4.json
[2020-12-28 14:40:06,356] [INFO    ] [MainThread] Intel PAC with Intel Arria 10 GX FPGA 0000:3d:00.0 is not secure.
[2020-12-28 14:40:06,368] [WARNING ] [MainThread] Update starting. Please do not interrupt.

...
[2020-12-28 14:49:05,510] [INFO    ] [0000:3d:00.0] Writing flash@0x3000000 for 71200768 bytes (dcp_1_2_1_rot_reversed.rpd)
(100%) [____________________] [71200768/71200768 bytes][Time:0:01:32.556300]
[2020-12-28 14:50:38,162] [INFO    ] [0000:3d:00.0] Reading flash@0x3000000 for 71200768 bytes for verification
(100%) [____________________] [71200768/71200768 bytes][Time:0:01:36.341767]
[2020-12-28 14:52:14,553] [INFO    ] [0000:3d:00.0] Verified flash@0x3000000 for 71200768 bytes (dcp_1_2_1_rot_reversed.rpd)
[2020-12-28 14:52:14,562] [INFO    ] [MainThread] Total time: 0:12:08.193506
[2020-12-28 14:52:14,563] [INFO    ] [MainThread] One-Time Secure Update OK

 


However, super-rsu does not finish correctly:

 

super-rsu --log-level trace /usr/share/opae/a10-gx-pac/super-rsu/base/rsu-09c4.json
[2020-12-28 16:14:13,362] [DEBUG   ] [MainThread  ] - found fpga objects: ['/sys/class/fpga/intel-fpga-dev.0']
[2020-12-28 16:14:13,363] [DEBUG   ] [MainThread  ] - found device at 0000:3d:00.0 -tree is
 [pci_address(0000:3a:00.0), pci_id(0x8086, 0x2030)]
    [pci_address(0000:3b:00.0), pci_id(0x10b5, 0x8747)]
        [pci_address(0000:3c:08.0), pci_id(0x10b5, 0x8747)]
            [pci_address(0000:3d:00.0), pci_id(0x8086, 0x09c4)]
        [pci_address(0000:3c:10.0), pci_id(0x10b5, 0x8747)]
            [pci_address(0000:3e:00.0), pci_id(0x198a, 0x385c)]

[2020-12-28 16:14:13,363] [DEBUG   ] [MainThread  ] - could not find: "/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec_mgr/ifpga_sec*"
[2020-12-28 16:14:13,364] [WARNING ] [MainThread  ] - [0000:3d:00.0] does not support secure update
[2020-12-28 16:14:13,364] [ERROR   ] [MainThread  ] - missing one or more items required by rsu config
[2020-12-28 16:14:13,364] [INFO    ] [MainThread  ] - super-rsu exiting with code '78'

 


Looking at the working card/server it seems that the modules are the same - ifpga_sec_mgr module is loaded correctly on both systems, but the ifpga_sec_mgr device folder does not exist on the "non-working" server. I don't see any messages in dmesg that indicate any particular errors.

 

[]$ lsmod | grep ifpga
ifpga_sec_mgr 16384 1 intel_fpga_fme

#Check to see if the ifpga security manager device is loaded.
[]$ ls -all /sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec*
ls: cannot access '/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec*': No such file or directory

 

I don't see anything that sticks out in dmesg or /var/log/messages with respect to the security manager module.

 

0 Kudos
jyoung
Novice
1,573 Views

I was finally able to update this using super-rsu after completely shutting off power to the server (cold reboot):

[]$ super-rsu --log-level trace /usr/share/opae/a10-gx-pac/super-rsu/base/rsu-09c4.json
[2020-12-28 16:33:37,086] [DEBUG   ] [MainThread  ] - found fpga objects: ['/sys/class/fpga/intel-fpga-dev.0']
[2020-12-28 16:33:37,088] [DEBUG   ] [MainThread  ] - found device at 0000:3d:00.0 -tree is
 [pci_address(0000:3a:00.0), pci_id(0x8086, 0x2030)]
    [pci_address(0000:3b:00.0), pci_id(0x10b5, 0x8747)]
        [pci_address(0000:3c:08.0), pci_id(0x10b5, 0x8747)]
            [pci_address(0000:3d:00.0), pci_id(0x8086, 0x09c4)]
        [pci_address(0000:3c:10.0), pci_id(0x10b5, 0x8747)]
            [pci_address(0000:3e:00.0), pci_id(0x198a, 0x385c)]

[2020-12-28 16:33:37,096] [WARNING ] [MainThread  ] - Update starting. Please do not interrupt.
[2020-12-28 16:33:37,097] [DEBUG   ] [MainThread  ] - [3d:00.0] version (0x0124000200000367) up to date for sr
[2020-12-28 16:33:37,098] [DEBUG   ] [MainThread  ] - bmc_fw is being force flashed
[2020-12-28 16:33:37,098] [DEBUG   ] [MainThread  ] - bmc_fw versions not equal (system:0x0000000000026889 != manifest:0x0000000000026895)
[2020-12-28 16:33:37,098] [DEBUG   ] [MainThread  ] - bmc_fw versions not equal (system:0x0000000000026889 != manifest:0x0000000000026895)
[2020-12-28 16:33:37,099] [DEBUG   ] [MainThread  ] - [3d:00.0] update timeout set to: 1200.0
[2020-12-28 16:33:37,099] [DEBUG   ] [3d:00.0     ] - update of board at [pci_address(0000:3d:00.0), pci_id(0x8086, 0x09c4)] started
[2020-12-28 16:33:37,099] [DEBUG   ] [MainThread  ] - max timeout set to: 0:20:00
[2020-12-28 16:33:37,100] [DEBUG   ] [3d:00.0     ] - starting task: fpgasupdate /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4_bootloader-26895-fw_Release.bin 0000:3d:00.0
[2020-12-28 16:33:37,222] [WARNING ] Update starting. Please do not interrupt.
[2020-12-28 16:33:37,223] [INFO    ] updating from file /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4_bootloader-26895-fw_Release.bin with size 38016
[2020-12-28 16:33:37,331] [INFO    ] writing to staging area
[2020-12-28 16:34:36,173] [DEBUG   ] [MainThread  ] - waiting (0:19:00.927721) for threads: 3d:00.0
[2020-12-28 16:34:36,674] [DEBUG   ] [MainThread  ] - waiting (0:19:00.426487) for threads: 3d:00.0
(100%) [____________________] [38016/38016 bytes][Time:0:01:34.404933]
[2020-12-28 16:35:11,747] [INFO    ] applying update to 0000:3d:00.0
(100%) [____________________][Time:0:00:08.010363]
[2020-12-28 16:35:19,757] [INFO    ] update of 0000:3d:00.0 complete
[2020-12-28 16:35:19,758] [INFO    ] Secure update OK
[2020-12-28 16:35:19,758] [INFO    ] Total time: 0:01:42.536032
[2020-12-28 16:35:19,809] [DEBUG   ] [3d:00.0     ] - task completed in 0:01:42.707920
[2020-12-28 16:35:19,809] [DEBUG   ] [3d:00.0     ] - starting task: fpgasupdate /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4-26895-fw_Release.bin 0000:3d:00.0
[2020-12-28 16:35:19,932] [WARNING ] Update starting. Please do not interrupt.
[2020-12-28 16:35:19,934] [INFO    ] updating from file /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4-26895-fw_Release.bin with size 244864
[2020-12-28 16:35:20,039] [INFO    ] writing to staging area
(100%) [____________________] [244864/244864 bytes][Time:0:00:01.575939]
[2020-12-28 16:35:21,626] [INFO    ] applying update to 0000:3d:00.0
[2020-12-28 16:35:36,247] [DEBUG   ] [MainThread  ] - waiting (0:18:00.853465) for threads: 3d:00.0
[2020-12-28 16:35:36,748] [DEBUG   ] [MainThread  ] - waiting (0:18:00.352268) for threads: 3d:00.0
(100%) [____________________][Time:0:00:43.055355]
[2020-12-28 16:36:04,681] [INFO    ] update of 0000:3d:00.0 complete
[2020-12-28 16:36:04,682] [INFO    ] Secure update OK
[2020-12-28 16:36:04,682] [INFO    ] Total time: 0:00:44.749368
[2020-12-28 16:36:04,702] [DEBUG   ] [3d:00.0     ] - task completed in 0:00:44.892688
[2020-12-28 16:36:05,283] [INFO    ] [MainThread  ] - 1 board updated. A power-cycle is required.
[2020-12-28 16:36:05,284] [INFO    ] [MainThread  ] - super_rsu.pyc completed in: 0:02:28.187105
[2020-12-28 16:36:05,284] [INFO    ] [MainThread  ] - super-rsu exiting with code '0'

#Check the fme with fpgainfo to make sure it is updated
[]$ fpgainfo fme
Board Management Controller, microcontroller FW version 26895
Last Power Down Cause: POK_CORE
Last Reset Cause: None
//****** FME ******//
Object Id                     : 0xEB00000
PCIe s:b:d:f                  : 0000:3D:00:0
Device Id                     : 0x09C4
Socket Id                     : 0x00
Ports Num                     : 01
Bitstream Id                  : 0x124000200000367
Bitstream Version             : 1.2.4
Pr Interface Id               : 38d782e3-b612-5343-b934-2433e348ac4c
Boot Page                     : user

 

I'm not totally sure why the fpga-otsu command would not originally complete (and then finally completed), but my best guess is that using a slightly different kernel minor version or that restarting the server with cold reboot (powering off the server) helps to reinitialize the FPGA state and devices under /sys/. Note that warm reboots (normal power-cycle) may be causing some weirdness with the FPGA device initialization, which is why I've recommended cold reboots (turning off power completely for 20-30 seconds).

For the original issue with super-rsu my suggested solution is the following:

1) Follow the instructions to run fpga-otsu on pg. 40 of the AFU Quick Start Guide. If it fails, power the server off completely for ~30 seconds (cold reboot), power on, initialize the AFU devstack (`. /opt/inteldevstack/init_env.sh`) and rerun the command until it succeeds.

2) Once fpga-otsu completes, perform a cold reboot again. This command should probably replace Step 2 on pg. 41 that originally suggests to "2. Power cycle the server." which can mean either a warm or cold reboot.

3) Check that the ifpga_sec_mgr module is properly loaded correctly and that the ifpga_sec_mgr device exists. If it does not exist, try a cold reboot and check each time after initializing the AFU devstack.

`ls /sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec_mgr/
ifpga_sec0`

4) If this device exists, then the super-rsu command should complete successfully (or at least fail elsewhere).

0 Kudos
Reply