- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hello experts,
we are using A10 HIP AVMM IP design as a PCIe device to pug inside X86 XEON SERVER to work with.
In Linux driver, we program FPGA AVMM DMA Engine to transfer bulk-data between FPGA SRAM and X86 DDR4 . Each transfer of bulk data has same data length ( 256Kbytes IQ data per DMA Transfer in 71us), and this data transfer keeps running consistently and keep going.
The whole system works as expect for 2-3 hours, and keep running correctly. we measure that each transfer of DATA DMA spends around 31 us or so, so generally in every "71 us" period, DMA can complete in time.
But randomly, we can see a few DMA data transfer consumes "+3977 us" at rare condition, which means that time DMA data transfer spend much more time than normal case , thus the system ran into problem and crashes. Since it's a very rare exception, and it's related AVMM DMA IP engine inside A10 FPGA chipset, we have no idea how to move forward to debug it.
Is there any idea how to debug it, or any debuggig status (registers) we can check in FPGA HIP AVMM module ? Thanks very much for help and advices.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
continue to dig where is possible problem :
I tried to run lspci to check AER from PCIE HIP IP, before the final DMA stuck issue occurs, "lspci" doesn't see any ERROR , but afer the problem occurs, "lspci" shows there's a few AER error from FPGA HIP core, such as following :
"
[root@localhost ~]# lspci -s 0000:17:00.0 -vv
17:00.0 Non-VGA unclassified device: Altera Corporation Device 1001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 355
NUMA node: 0
Region 0: Memory at 38007ff00000 (64-bit, prefetchable) [size=512]
Region 2: Memory at c5800000 (32-bit, non-prefetchable) [size=4M]
Capabilities: [50] MSI: Enable- Count=1/4 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <4us, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [200 v1] Vendor Specific Information: ID=1172 Rev=0 Len=044 <?>
Capabilities: [300 v1] #19
Capabilities: [800 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP+ BadDLLP- Rollover- Timeout+ NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Kernel driver in use: nr_device_driver
"
Which means some ERR happen during A10 AVMM DMA moving operation. But my facing problem is : for all these DMA operation, I just use 8 Descriptor Entries per each data moving, and every time these "8" dma descriptor entries keeps to have same content, why it keeping running for 3_4 hours, then suddenly met a "stuck" in PCIE HIP core ? that's weird !
Anyone has any idea ? Thanks in advance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you provide the Signal Tap?
Thanks
Best regards,
KhaiY
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @KhaiChein_Y_Intel ,
1) Regarding of " Signal Tap " capture, what signal(s) do you request to capture ?
2) Also regarding of " RxErr+ ", " BadTLP+ Timeout+ ", I believe they are located in PCIe Physical Layer & Link Layer , correct ? Since they're not relatedd to PCIE Application Data, does that mean it may be a Hardware related issue ?
Thanks for feedback //
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @KhaiChein_Y_Intel ,
what signals we should capture to investigate this "stuck" issue ? Is there any guidance to describre this ? Thanks in adavance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you share the STP for the below signals and the .ip file? Please use translational for storage qualifier setting.
Txs
dma_rd_master
dma_wr_master
wr_dts_slave
rd_dts_slave
wr_dcm_master
rd_dcm_master
Rxm_BAR*
tx_out0[<n>-1:0]
rx_in0[<n>-1:0]
hip_reconfig_clk
hip_reconfig_rst_n
hip_reconfig_address[9:0]
hip_reconfig_read
hip_reconfig_readdata[15:0]
hip_reconfig_write
hip_reconfig_writedata[15:0]
hip_reconfig_byte_en[1:0]
ser_shift_load
interface_sel
npor
nreset_status
pin_perst
refclk
RdDmaWrite_o
RdDmaAddress_o[63:0]
RdDmaWriteData[<w>-1:0]
RdDmaBurstCount_o[<n> -1:0]
RdDmaByteEnable_o[ <w>-1:0]
RdDmaWaitRequest_i
WrDmaRead_o
WrDmaAddress_o[63:0]
WrDmaReadData_i[<w >-1:0]
WrDmaBurstCount_o[<n>-1:0]
WrDmaWaitRequest_i
WrDmaReadDataValid_i
cfg_par_err
derr_cor_ext_rcv
derr_cor_ext_rpl
derr_rpl
dlup
dlup_exit
ev128ns
ev1us
hotrst_exit
ins_status[3:0]
ko_cpl_spc_data[11:0]
ko_cpl_spc_header[7:0]
l2_exit
lane_act[3:0]
ltssmstate[4:0]
rx_par_err
tx_par_err[1:0]
currentspeed[1:0]
Cra*
Thanks
Best regards,
KhaiY
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We do not receive any response from you to the previous question/reply/answer that I have provided. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.
Best regards,
KhaiY

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page