PCtoFPGA AVMM DMA data-transfer cann't complete in time randomly using A10 PCIe AVMM IP

JET60200 · ‎06-05-2021

hello experts,

we are using A10 HIP AVMM IP design as a PCIe device to pug inside X86 XEON SERVER to work with.

In Linux driver, we program FPGA AVMM DMA Engine to transfer bulk-data between FPGA SRAM and X86 DDR4 . Each transfer of bulk data has same data length ( 256Kbytes IQ data per DMA Transfer in 71us), and this data transfer keeps running consistently and keep going.

The whole system works as expect for 2-3 hours, and keep running correctly. we measure that each transfer of DATA DMA spends around 31 us or so, so generally in every "71 us" period, DMA can complete in time.

But randomly, we can see a few DMA data transfer consumes "+3977 us" at rare condition, which means that time DMA data transfer spend much more time than normal case , thus the system ran into problem and crashes. Since it's a very rare exception, and it's related AVMM DMA IP engine inside A10 FPGA chipset, we have no idea how to move forward to debug it.

Is there any idea how to debug it, or any debuggig status (registers) we can check in FPGA HIP AVMM module ? Thanks very much for help and advices.

JET60200 · ‎06-07-2021

continue to dig where is possible problem :

I tried to run lspci to check AER from PCIE HIP IP, before the final DMA stuck issue occurs, "lspci" doesn't see any ERROR , but afer the problem occurs, "lspci" shows there's a few AER error from FPGA HIP core, such as following :

"

[root@localhost ~]# lspci -s 0000:17:00.0 -vv
17:00.0 Non-VGA unclassified device: Altera Corporation Device 1001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 355
NUMA node: 0
Region 0: Memory at 38007ff00000 (64-bit, prefetchable) [size=512]
Region 2: Memory at c5800000 (32-bit, non-prefetchable) [size=4M]
Capabilities: [50] MSI: Enable- Count=1/4 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <4us, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [200 v1] Vendor Specific Information: ID=1172 Rev=0 Len=044 <?>
Capabilities: [300 v1] #19
Capabilities: [800 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

CESta: RxErr+ BadTLP+ BadDLLP- Rollover- Timeout+ NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Kernel driver in use: nr_device_driver

"

Which means some ERR happen during A10 AVMM DMA moving operation. But my facing problem is : for all these DMA operation, I just use 8 Descriptor Entries per each data moving, and every time these "8" dma descriptor entries keeps to have same content, why it keeping running for 3_4 hours, then suddenly met a "stuck" in PCIE HIP core ? that's weird ！

Anyone has any idea ? Thanks in advance

KhaiChein_Y_Intel · ‎06-10-2021

Hi,

Could you provide the Signal Tap?

Thanks

Best regards,

KhaiY

JET60200 · ‎06-10-2021

Hi @KhaiChein_Y_Intel ,

1) Regarding of " Signal Tap " capture, what signal（s） do you request to capture ？

2) Also regarding of " RxErr+ ", " BadTLP+ Timeout+ ", I believe they are located in PCIe Physical Layer & Link Layer , correct ? Since they're not relatedd to PCIE Application Data, does that mean it may be a Hardware related issue ?

Thanks for feedback //

JET60200 · ‎06-11-2021

Hello @KhaiChein_Y_Intel ,

what signals we should capture to investigate this "stuck" issue ? Is there any guidance to describre this ? Thanks in adavance

KhaiChein_Y_Intel · ‎06-11-2021

Hi,

Could you share the STP for the below signals and the .ip file? Please use translational for storage qualifier setting.

Txs

dma_rd_master

dma_wr_master

wr_dts_slave

rd_dts_slave

wr_dcm_master

rd_dcm_master

Rxm_BAR*

tx_out0[<n>-1:0]

rx_in0[<n>-1:0]

hip_reconfig_clk

hip_reconfig_rst_n

hip_reconfig_address[9:0]

hip_reconfig_read

hip_reconfig_readdata[15:0]

hip_reconfig_write

hip_reconfig_writedata[15:0]

hip_reconfig_byte_en[1:0]

ser_shift_load

interface_sel

npor

nreset_status

pin_perst

refclk

RdDmaWrite_o

RdDmaAddress_o[63:0]

RdDmaWriteData[<w>-1:0]

RdDmaBurstCount_o[<n> -1:0]

RdDmaByteEnable_o[ <w>-1:0]

RdDmaWaitRequest_i

WrDmaRead_o

WrDmaAddress_o[63:0]

WrDmaReadData_i[<w >-1:0]

WrDmaBurstCount_o[<n>-1:0]

WrDmaWaitRequest_i

WrDmaReadDataValid_i

cfg_par_err

derr_cor_ext_rcv

derr_cor_ext_rpl

derr_rpl

dlup

dlup_exit

ev128ns

ev1us

hotrst_exit

ins_status[3:0]

ko_cpl_spc_data[11:0]

ko_cpl_spc_header[7:0]

l2_exit

lane_act[3:0]

ltssmstate[4:0]

rx_par_err

tx_par_err[1:0]

currentspeed[1:0]

Cra*

Thanks

Best regards,

KhaiY

KhaiChein_Y_Intel · ‎06-17-2021

Hi,

We do not receive any response from you to the previous question/reply/answer that I have provided. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.

Best regards,

KhaiY