FPGA, SoC, And CPLD Boards And Kits
FPGA Evaluation and Development Kits
5892 Discussions

System crashes when installing BD-NVV-N3000-3 on Dell R720

Daehyeok_Kim
New Contributor I
756 Views

Hello,

We're trying to install the BD-NVV-N3000-3 card on our Dell R720 machine. 

We installed the card to one of x16 PCIe slots and connected it with a 6-pin power connector.

The system can boot with the card installed and it shows up in lspci output, but after a very short time - usually a matter of minutes - the system crashes with this log:

 

h11 login: [ 174.287195] {1}[Hardware Error]: Hardware error from APEI
Generic Hardware Error Source: 32992
[ 174.296815] {1}[Hardware Error]: event severity: fatal
[ 174.302550] {1}[Hardware Error]: Error 0, type: fatal
[ 174.308285] {1}[Hardware Error]: section_type: PCIe error
[ 174.314504] {1}[Hardware Error]: port_type: 4, root port
[ 174.320625] {1}[Hardware Error]: version: 1.0
[ 174.325682] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[ 174.332579] {1}[Hardware Error]: device_id: 0000:40:02.0
[ 174.338698] {1}[Hardware Error]: slot: 0
[ 174.343266] {1}[Hardware Error]: secondary_bus: 0x42
[ 174.348999] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x0e04
[ 174.356381] {1}[Hardware Error]: class_code: 000406
[ 174.362018] {1}[Hardware Error]: bridge: secondary_status: 0x2000,
control: 0x0003
[ 174.370662] Kernel panic - not syncing: Fatal hardware error!
[ 174.377153] Kernel Offset: 0x3d600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 174.391984] Rebooting in 30 seconds..

 

We noticed that this can happen even when the kernel is not running, like if you're sitting at a grub prompt or even during the UEFI pre-boot environment (see attached log).

We tried the card in each of the x16 slots and also tried removing the other adapters but to no avail.

 

Does anyone know why this happens? 

Please let me know if you need any further information.

 

Thanks,

Daehyeok

0 Kudos
1 Solution
JonWay_C_Intel
Employee
700 Views

When issue happens, if the LED on the backpanel are blinking 1 sec. It is most likely you do not have enough airflow to cool the card. Try setting the server fan to maximum.

Usually, the card will be running at first, but later as the temperature increases, the board will shutdown to prevent overheating. You can run "watch -d n 1 fpgainfo bmc".

Monitor FPGA Die Temperature. If keeps increasing and surpass 100C, it will shutdown.

If you do not want the shutdown to cause a kernel panic, you can activate the daemon to perform Graceful Shutdown. Steps as in: https://www.intel.com/content/www/us/en/programmable/documentation/xgz1560360700260.html#zqb1564607955079

Nevertheless, you still need to power cycle the server to bring the card back from shutdown state.

View solution in original post

2 Replies
JonWay_C_Intel
Employee
701 Views

When issue happens, if the LED on the backpanel are blinking 1 sec. It is most likely you do not have enough airflow to cool the card. Try setting the server fan to maximum.

Usually, the card will be running at first, but later as the temperature increases, the board will shutdown to prevent overheating. You can run "watch -d n 1 fpgainfo bmc".

Monitor FPGA Die Temperature. If keeps increasing and surpass 100C, it will shutdown.

If you do not want the shutdown to cause a kernel panic, you can activate the daemon to perform Graceful Shutdown. Steps as in: https://www.intel.com/content/www/us/en/programmable/documentation/xgz1560360700260.html#zqb1564607955079

Nevertheless, you still need to power cycle the server to bring the card back from shutdown state.

Daehyeok_Kim
New Contributor I
693 Views

Thanks for your reply.

As you said, the high temperature was the issue, and we were able to resolve it.

Thanks,

Daehyeok

0 Kudos
Reply