Solved: Re:PCIE example issues

PVanL · ‎02-23-2021

I am using the PCIE Example as described in UG-20234. As i want to connect another application behind the hard-IP EP, i am modifying the testbench and APPS to allow transfers >4 bytes, which is how the example is written. The idea is: i verify in simulation, before mapping it onto the FPGA, debugging on a mapped design is way more time consuming then in simulation.

It is configured for 64 bit/250 Mhz / Atlanta interface between HardIP and APPS.

I managed to send 8 bytes to APPS, which sends those 8 bytes to memory, and returns them again. I then add 1 to each byte, to be sure i do not mix up with the sent information. The comparator at testbench detects the differences, and goes on playing (in the example it stops when send != receive. ) I adapted the header for the CmplD TLP from

| 77 RX | MWr | 0004 | 60000001_0000000F_00000001_00010000 |

| 77 RX | MRd (00) | 0004 | 20000001_0000000F_00000001_00010000 |

| 77 TX | CplD (00) | 0004 | 4A000001_01080004_00000000 |

| 77 RX | MWr | 0004 | 60000001_0000000F_00000001_00010000 |

| 77 RX | MRd (00) | 0004 | 20000001_0000000F_00000001_00010000 |

| 77 TX | CplD (00) | 0004 | 4A000001_01080004_00000000 |

(for 8 byte) to

| 77 RX | MWr | 0008 | 60000002_000000FF_00000001_00010000 |

| 77 RX | MRd (00) | 0008 | 20000002_000000FF_00000001_00010000 |

| 77 TX | CplD (00) | 0008 | 4A000002_01080008_00000000 |

The Header for 8 byte has the following TLP fields, which are to my knowledge correct:

For MWr:

DW0 60000002: length = 2, Fmt = 2’b11”, Type = 5’b00000 => strange: Fmt should be 2’b10 for MWr!!!??? (this is made by INTEL IP)

DW1 000000FF: 1stBE = F, Last_BE = F, Tag = 0, ReqID = 0 (Made by Intel IP)

DW2 00000001: why is DW1[0] = R=1, this should be 0 !!!!???? Address = 0 (Made by Intel IP) In reality it is 0000_0000 !!! (bug)

DW3 00010000 = ???? (should be data???!!!). But the data arrives correctly in the memory (Made by Intel IP) In reality it is data[31:0]

DW4: (not depicted in this table) data[63:32] works correctly

For Mrd:

DW0 20000002: Length = 2, Type = 5’b0000, Fmt = 2’b01 => strange: Fmt should be 2’b00 for Mrd (Made by Intel IP)

DW1 000000FF: : 1stBE = F, Last_BE = F, Tag = 0, ReqID = 0 (Made by Intel IP)

DW2: 00000000: why is DW1[0] = R=1, this should be 0 !!!!???? Address = 0 (Made by Intel IP)

For CplD

DW0 4A000002: Length = 2, Type = 5’b01010 (correct), Fmt = 2’b10 (correct) (Made by me)

DW1 000000FF: : 1stBE = F, Last_BE = F, Tag = 0, ReqID = 0 (Made by me)

DW2: 00000001: why is DW1[0] = R=1, this should be 0 !!!!???? Address = 0 (Made by Intel IP), in reality it is 0000_0000

DW3 data[31:0]

DW4 data[63:32]

For some reason this does not work:

The DUT PCIE hard IP does not produce nice continuous tx0..tx3 signals: they are interrupted with ‘x’ s.

In the testbench, the signal is routed via rx0..rx3 to altpcietb_bfm_rpvar_64b_x8_pipen1b. The output of rpvar: rx_data0 0000000000000000 rx_desc0 044a000002010800080000000078561011 rx_be f4 rx_dv 0 rx_dfr 0 rx_ack 0 rx_abort 0 rx_retry 0 rx_mask 0 rx_ws 0 Where the last 8 bytes are correct = data[31:0] sent!

My questions:

Why does this not work? Options:

the header for MRd is not correct, but this is generated by INTEL testbench?
The header for ClpD is not correct (this is generated by my modifications)?
The HardIP in DUT has a configuration that only allows 4 bytes?
The rpvar mimic can only handle 4 bytes?
The rpvar needs a modification of parameters?
And how do I make it working?

Thanks in advance! Pieter

KhaiChein_Y_Intel · ‎04-13-2021

Hi Pieter,

And yes, indeed, if you run the software test as in figure 12, then the application accepts traffic loadup to MAX.PAYLOAD. BUT NOT THE TESTBENCH / APPS in simulation!

Yes. As in my previous reply, the testbench provides simple method to do basic testing and this does not cover all the traffic profile stimuli. If user wants to simulate with what is not covered in the testbench, user has to modify the testbench based on their requirements.

So, i think i am on the edge of improving the testbench and APPS to payload = max.payload. My question: if i would share this code with you, would INTEL be willing to compensate me for this effort ? (i think INTEL does not deliver what it says it does as described in the manual)

I apologize for the miscommunication if there is any. The reason I asked if you are willing to share the modification earlier is because there are some hobbyists share information, what they have created or method to solve problems in this public forum. Please do not share if the content is confidential or non-public accessible.

Thanks for your understanding. Do let me know if you have any questions or concerns.

Best regards,

KhaiY

View solution in original post

KhaiChein_Y_Intel · ‎02-24-2021

Hi,

Could you share both the settings/.qsys file you used to create the example design and the files after modification?

Thanks.

Best regards,

KhaiY

PVanL · ‎02-25-2021

Dear KhaiChein_Y, thanks for reaching out.

I assume the name of the zipfiles speak for its content.

Main changes:

splitted altpcietb_bv_rp_gen2_x8 into two files: altpcie_bfm_rpvar and altpcietb_bv_rp_gen2_x8_ex_rpvar, to make the editing of the last more easy
added many $fdisplay statements to more easily track bitstreams (i always get lost in the wave screen)
changed a lot in tlp_parser to generate headers for 8 byte CplD. The result of that can e.g. be seen in outparsout.txt
added 8 byte wide write to and read from memory (interconnect_low and ~high, and mem_low and ~high) in pcie_ed
changed a bit in downstream_driver to generate 8 bytes.
in pcie_ed, line 546 and 566, 1 is added to each read byte
In the downstream_drive file, line 465, Length = 4, as written by INTEL. If you run the testbench, it fails of course because of the adding of 1 to each byte. But downstream_drive is modifies such, that it does not stop, it just reports a difference.
Change this line 465 to 8, and the tb sends 8 byte, the apps writes it to MEM in pcie_ed, read it back, adapts the TLP headerfiles, of which you can see the result in pcie_ed_tx.txt, and eventually sends seemingly correct data to the HardIP.
At receipt in TB, in rpvar, rx_data remains zero, but rx_desc0 has some recognizable values. Also, there are 'x' in the serial wires tx0..3 from the DUT, to the rx0...3 of rpvar. This suggests that the HardIP of PCIE is not controlled correct by TLP headers? configuration setting?
I hope this clarifies, please do not hesitate to ask clarification if needed.

Regards ,Pieter

PVanL · ‎03-02-2021

Dear KhaiChein_Y

I did not receive a reaction from you yet. Please respond.

Regards, pieter

PVanL · ‎03-05-2021

Dear KhaiChein_Y

Can you give me an update where you stand, or how i can help you, if at all, with getting the simulation running? regards, Pieter (....i am trying not to sound impatient.....)

KhaiChein_Y_Intel · ‎03-03-2021

Hi pieter,

I am still working on this. Please allow me some time.

Best regards,

KhaiY

PVanL · ‎03-03-2021

Hi KhaiY,

Glad to hear you are still on it....

Of course, i understand, it is a F**ng difficult testbench / apps / bfm. But i was triggered by INTEL if the answer provided was sufficient to close the call. (i thought with all those smart AI of today, the bot would have seen there is just a question (yours) and an answer (mine).....)

Regards, Pieter

KhaiChein_Y_Intel · ‎03-06-2021

Hi Pieter,

I plan to generate an example design using the same PCIe settings as pcie_ed.qsys that you have provided but I see warning messages below when I open the pcie_ed.qsys file.

Component Instantiation Warning: pcie_ed.APPS: File not found: ip/pcie_ed/pcie_ed_APPS.ip

Component Instantiation Warning: pcie_ed.DK: File not found: ip/pcie_ed/pcie_ed_DK.ip

Component Instantiation Warning: pcie_ed.DUT: File not found: ip/pcie_ed/pcie_ed_DUT.ip

Component Instantiation Warning: pcie_ed.MEM: File not found: ip/pcie_ed/pcie_ed_MEM.ip

MWr:

DW0 60000002: length = 2, Fmt = 2’b11”, Type = 5’b00000 => strange: Fmt should be 2’b10 for MWr!!!??? (this is made by INTEL IP)

>> Both 2'b010 and 2'b011 are valid for MWr. 2'b010 is for 3DW with data and 2'b011 is for 4DW with data. You are using 4DW so 2'b011 is correct.

DW3 00010000 = ???? (should be data???!!!). But the data arrives correctly in the memory (Made by Intel IP) In reality it is data[31:0]

>> TLP header DW3 and DW4 are for address but not data.

MRd

DW0 20000002: Length = 2, Type = 5’b0000, Fmt = 2’b01 => strange: Fmt should be 2’b00 for Mrd (Made by Intel IP)

>>Both 2'b000 and 2'b001 are valid for MRd. 2'b000 is for 3DW with no data and 2'b001 is for 4DW with no data.

Have you tried using the example design without modification? Do you see any unexpected behavior in the example design without modification?

Thanks

Best regards,

KhaiY

PVanL · ‎03-08-2021

Dear KhaiY. Yes, of course i tried without modification It runs fine. Take care: i modified the data being sent, and being send back, by 1/ making them different per loop, and 2/ adding 1 to each byte. Of course the comparator does not like that, so i changed it that it reports a difference, but does not stop as a result of that (other than the design example, where it would stop).

For more input/comments, check the attached word file, and check the transcript line 3640 -- end for the output of the comparator when Byte = 4.
I also attached a transscript_08.txt to show for Bytes = 8

regards, Pieter

KhaiChein_Y_Intel · ‎03-09-2021

Attach file

KhaiChein_Y_Intel · ‎03-09-2021

Hi Pieter,

Please find attached.

Thanks

Best regards,

KhaiY

PVanL · ‎03-11-2021

Now i see: in your folder customer/ip/pcie_id/ there is only a small subset of files. Even more strange, is that in my design those files do not exist at all!!!???

There should be the set of files as shown in the pci_202_mod/pcie_a10_hip_blabla/ip/pcie_ed/directory. My assumption was you would load the design example from INTEL website, and then replace the changed files with my files. (quartus 20.2) Apparantly the design example is not properly loaded into your machine, hence it will definitly not run.

The problem is: the fully compressed design is 0.7 GB (with 762 files), which does not fit in the drag & drop field below, so i cannot send this design to you, unless via a direct link to your email.

So, my suggestion is:

So, please install the full design from the INTEL website, and make sure it runs. I am afraid you loaded a different design, given those clock_in and reset_in files that do not exist in my design.

Then replace the files from INTEL example, with the files in the attachements, that i sent you earlier.

But maybe another appraoch example is also possible:

What i really want, is connect my ethernet PHY, via the INTEL Low Latency MAC, to the PCIE Hard IP (gen2 x 4) and map it onto the cyclone10GX development board.

(I have in the INTEL design example for Low Latency MAC, replaced the INTEL PHY iwth my own PHY, and it runs in simulation, and i still have difficulties in getting the tranceiver properly mapped onto hte cyclone10GX)

If you do have an example with this, (but then of course with the INTEL PHY), that would be most welcome!

Regards, Pieter

KhaiChein_Y_Intel · ‎03-13-2021

Hi Pieter,

I find this reference design relevant to your request.

Cyclone 10 GX PCIe Gen2 x4 Avl-ST

https://fpgacloud.intel.com/devstore/platform/18.0.0/Pro/cyclone-10-gx-pcie-gen2-x4-avl-st/

If it's not , you may find some other design examples in Intel FPGA Design Store.

https://fpgacloud.intel.com/devstore/?page=4&search=pci

Do let me know if the above design is helpful.

Thanks

Best regards,

KhaiY

PVanL · ‎03-15-2021

Hi KhaiY,

Thanks for your response. Something in our communication seems not working.

the design example you suggest, is the one i am working with:

Cyclone 10 GX PCIe Gen2 x4 Avl-ST

The design example i started with, and for which i ask your advice, is the one described in UG-01145_avst | 2020.06.02 (newer versions may exist), and which you can download from the IP Catalog in Quartus PCI Express / Intel Arria 10 / Cyclone 10 Hard IP for PCI Express => Platform Designer => Parameter setting: System Settings: Standalone, Avalon-ST, Gen2x4, 64 bit @ 250 MHz, Native Endpoint, Balanced, RX Buffer: Header: 112 Data: 440. All other settings: default. Example Design: PIO, Development Kit: Arria 10 GX Development Kit (I have the Cyclone 10GX development kit, and i understand this should not be a problem)

I asked you: is it possible to change the testbench and application, such that transfers with Payloads of many bytes is possible, the design example only allows for 4 Bytes. I changed the testbench, the TLP Parser such, that in my opinion it should transfer 8 Bytes, but so far no success. My assumption is: if I manage 8 Bytes, i can also change to many bytes, as long as within the boundaries of the maximum payload.
What i really want: connect INTEL's Low Latency MAC 10-GBASER design example to the hard IP PCIE of the PCIE example design as described above. Instead of finding out the wheel myself, an example design doing exactly this will be most appreciated. The list of design examples you send me, do not contain (unless i missed something, you can never exclude that) such an example.
So, I really hope somebody did this is exercise (connect LL MAC to PCIE) and is willing to share this code with me. Otherwise, I have to do it myself, and a step to achieve this, is to make the PCIE design example suitable for transfers >4 Bytes, for which i need your help.

I hope this clarifies my requests...

Regards, Pieter

KhaiChein_Y_Intel · ‎03-18-2021

Hi Pieter,

I would like to inform you that I am still working on this. Please allow me some time on this.

Thanks

Best regards,

KhaiY

PVanL · ‎03-19-2021

Hi KhaiY, thanks for updating me. It is really not an easy part of code.

What i could not figure out, or not understand is: if you load the board with the example code, and connect it to the other PC, as described in document an-855 paragraph 1.7 Hardware Installation, does this also use the very same configuration? Because it says: ($1.8) Set the Transfer length to 100,000 bytes and the Sequence to Write only,....transfer data from the FPGA to the system memory in chunks of 100,000 bytes. So, you would expect the very same application (APPS and the others) really do accept and send back packets of length 512 / 1024 / 2048 packet to larger chunks... If so, then why not in the testbench?

(we have this software test working)

Looking forward to your progress / feedback

Regards, Pieter

KhaiChein_Y_Intel · ‎03-29-2021

Hi Pieter,

I am sorry for the delay in response. Thank you for your patience.

I discussed with the team, the testbench does not have the flexibility and does not cover certain traffic profile stimuli. It would be a time consuming process to edit and debug. Is there any ultimate goal that you would like to achieve from this?

Thanks

Best regards,

KhaiY

PVanL · ‎03-30-2021

Hi KhaiY,

thanks for response, but i don't understand it.

I thought i explained my goals (see earlier posts):
What i really want, is connect my ethernet PHY, via the INTEL Low Latency MAC, to the PCIE Hard IP (gen2 x 4) and map it onto the cyclone10GX development board.
(I have in the INTEL design example for Low Latency MAC, replaced the INTEL PHY iwth my own PHY, and it runs in simulation, and i still have some difficulties in getting the tranceiver properly mapped onto the cyclone10GX).
If you do have an example with this, (but then of course with the INTEL PHY), that would be most welcome!

For PCIE testbench I am using the example of UG-01145_avst. The difficulty is not in the testbench, but in the APPS component, in fact, i believe it is in the TLP parser that is not built to send anything back but ONE DWORD of 4 byte. I tried to modify the TLP parser to make it suitable for 2 DWORD of data, but somehow the Hard_IP PCIE block makes a mess out of it.

The UG-01145 document says (page 157): "It can only handle received read requests that are less than or equal to the currently set Maximum payload size option specified under PCI Express/PCI
Capabilities heading under the Device tab using the parameter editor. Many systems are capable of handling larger read requests that are then returned in multiple completions.

That would be good enough for me, but i dont get it working! It only seems to work for 4 Bytes!
And if you look at the code, (driver_downstream.v), line 270 - 289, the length is cut off at 4 Bytes, and there is a remark:
line 271: //TODO extend to more than 1 DW.

So, it seems as there is some work to be done ??????!!!!!

So, you can help me in several ways:

1/ provide me with a TLP parser that is able to reply correctly with packets of many bytes. I think i can manage to integrate that into the APPS entity and the MEM entity, and modify the testbench such, that it maybe does not compare, but at least i can verify manually that received = send

OR:

2/ provide me with an example where your LL_MAC_10GBASE_R example is already integrated into a PCIE gen2x4 example

OR

3/ explain me how, without using the testbench, but by directly building the above (PCIE + LLMAC_10GBASER) and synthesise/place/route it onto the Cyclone10GX development board, and run a software test similar to the one described in AN 855: PCI Express* High Performance Reference Design for Intel® Cyclone® 10 GX.

Thanks for your feedback!

regards, Pieter

KhaiChein_Y_Intel · ‎04-06-2021

Hi Pieter,

I checked with the team, we have 10G ethernet example design standalone that can generate from the 10G MAC IP GUI, but we don’t have PCIe + 10G ethernet design, where the customer need to integrate by themselves.

You may find the steps to generate the example design here: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-20162.pdf

Thanks

Best regards,

KhaiY

PVanL · ‎04-07-2021

Dear KhaiY,

As i told you, we already worked with hte example in UG-20162. I understand you don't have an example design with the two connected.

What i don't understand, is that you write in the manual for the PCIE example:

The UG-01145 document says (page 157): "It can only handle received read requests that are less than or equal to the currently set Maximum payload size option specified under PCI Express/PCI
Capabilities heading under the Device tab using the parameter editor. Many systems are capable of handling larger read requests that are then returned in multiple completions.

meanwhile, i managed to send 8 bytes, but is a hard struggle. In fact, i am disappointed that this testbench / example cannot handle anything else but those 4 bytes. I reluctantly seem to have to accept that INTEL is not intented to repair this shortcoming, although the manual says it does accept TLP until max payload. Regards Pieter

KhaiChein_Y_Intel · ‎04-07-2021

Hi Pieter,

Yes. It does accept TLP up to max payload. The current testbench and Root Port BFM provide a simple method to do basic testing of the Application Layer logic that interfaces to the variation. This BFM allows you to create and run simple task stimuli with configurable parameters to exercise basic functionality of the Intel example design. The testbench and Root Port BFM are not intended to be a substitute for a full verification environment. Corner cases and certain traffic profile stimuli are not covered. To ensure the best verification coverage possible, Intel suggests strongly that you obtain commercially available PCI Express verification IP and tools, or do your own extensive hardware testing or both.

Could you share the modification that you have made? I believe it would be beneficial for other user or customer.

Thanks

Best regards,

KhaiY