FPGA, SoC, And CPLD Boards And Kits
FPGA Evaluation and Development Kits
5891 Discussions

SIV PCIe high performance ref design data can arrive out of order

Altera_Forum
Honored Contributor II
1,695 Views

Hi, 

 

We've been using the Stratix IV PCIe high-performance reference design for a while and see some funny behavior when performing a read DMA: the data arrives into our logic out of order intermittently. 

 

Let me set the scene - we want to transfer a 4kB block of data from the RC's memory to our logic using the read DMA functionality in the CDMA reference design - for simplicity, assume our logic is just like the end point memory instantiated in the reference design, except it is 128-bit wide to match the datapath used for the x8, gen2 variant we're using. We set up the read descriptor table to transfer a single 4kB block of data from RC to EP. 

 

The CDMA/PCIe block splits the 4kB block of data into many 512Byte payloads, which I'll label P1, P2, P3, etc, where P1 has the first 512 bytes of the 4kB block, P2 has the next 512 bytes, etc. Most of the time the data arrives at the EP memory in exactly the order it is in the RC's memory, but sometimes we see that the first 48 bytes from P2 and P3 appear in P1!  

 

In terms of memory addresses, we would expect to see the EP memory addresses to appear in the following order as data is written to the EP memory: 

 

0, 16, 32, 48, 64, 80, 96, 112, 126, 144, 160, 176, 192, etc 

 

But sometimes we see: 

 

0, 16, 32, 48, 512, 528, 544, 1024, 1040, 1056, 64, 80, 96, etc 

 

where 512, 528, 544, are addresses associated with payload P2, and addresses 1024, 1040, 1056 are associated with payload P3. Note that the 48-byte chunks are effectively cut-and-pasted from where they should appear into P1 i.e. they don't reappear again later. It's like entries in the PCIe/CDMA rx-buffer have been re-ordered. 

 

We see this occurring about 10% of the time. 

 

All of the data arrives ok and gets written to the correct EP memory address, but order is important to us as we actually want to write the data to a fifo rather than a memory. 

 

Note also that this behavior is only seen in actual hardware - it doesn't occur in the reference design functional simulation. 

 

Has anybody else seen this behavior? If anyone can shed any light on this I would greatly appreciate it!
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
422 Views

If i remember correctly the chaining dma example can have multiple outstanding read requests at any one time. It keeps track of them based on the tag id used to request memory, and stores the address requested based on tag id in a memory block(tag_dpram possibly). It is a little more complicated then simply storing the tag id and address in memory, but that is the general gist of it. 

 

It does this to improve latency and throughput, helping to cover up the round trip time of data request to data arrival. When multiple requests are outstanding there are no guarantees as to the order they will arrive. If you are pushing the data into a fifo rather than into memory where you can't just use the stored address and tag memory element, then you will have to modify the example to only have 1 outstanding tag at a time. I know there is a generic define that specifies the number of allowable tags, but don't think it will be as simple as setting that to 1. I'm pretty sure it uses a fixed tag id for the descriptors, and possibly others as well. 

 

Hope this helps. 

 

Kevin
0 Kudos
Altera_Forum
Honored Contributor II
422 Views

Hi Kevin, 

 

Thanks for the swift reply.  

 

I can understand that multiple outstanding requests could be dealt with out of order to eek out the best performance, but I wonder if they could be dealt with atomically i.e. once it starts pumping out data for a particular tag_id, it can only provide data for that tag until completed, even if a few clock cycles are unused. 

 

Looks like I'll have to delve into the ref design more deeply rather than just using it! 

 

Cheers, 

 

Dwayne
0 Kudos
Altera_Forum
Honored Contributor II
422 Views

U could provide more fifo's on the front of the one you want to use, one fifo for each possible outstanding tag id. Then push the data from the PCIe into those fifo's depending on tag id and pull it out resync'd on the other end in the order you requested them. Don't know if you have the space for that, but possibly a simple solution. 

 

Good luck, 

Kevin
0 Kudos
Altera_Forum
Honored Contributor II
422 Views

Have this issue solved? 

How to solve it?I meet the same issue. 

Thanks.
0 Kudos
Altera_Forum
Honored Contributor II
422 Views

Hi Nodec, 

 

No we still haven't got to the bottom of this and are still investigating. We have noticed that we don't see the re-ordering occuring on a Xeon/X58-based platform, but do see it on an nVidia GeForce 9300-ITX MCP7A-based Zotac platform. We haven't been able to get any details on the MCP7A chipset to see how it deals with relaxed ordering - note we set the relaxed ordering bit in the read_request headers to do strict ordering, rather than relaxed ordering. Maybe nVidia assumes it can do relaxed ordering to maximise throughput as they assume the PCIe is being used to transfer graphics data?! 

 

So our current workaround is to use a Xeon/X58 platform but it also has thrown up a problem where we see extraneous writes to our EP memory map every 13ms, which gradually corrupts our whole memory map over time - we're using signaltap today to work out what type of packets are coming across from RC to EP to cause this. 

 

Cheers, 

 

Dwayne
0 Kudos
Altera_Forum
Honored Contributor II
422 Views

Your 4k requests are split two times: Once in your CDMA wrapper leading to multiple – say: 8 – read requests, 512 bytes each, respecting the max_read_request_size parameter and hopefully the 4k read request boundary. Then, second, the responses are split by the completer into multiple packets for each request. It’s impossible to tell how many completion packets will be sent by the RC as it only has to adhere to the Read Completion Boundary (RCB), so your read request might end up being serviced as 64 byte chunks, i.e. 8 or 9 completion packets per read request. 

 

While the completion packets for a single read request must be transferred in order to each other, they can overlap in any possible way with completions for the other read requests. This is independent from the relaxed ordering attribute of your read request, and you cannot avoid that behavior of PCIe. Sure, a chipset is not required to chop them up artificially, and it seems your Xeon chipset keeps them in-order, but it is not required to do so, and your MCP7A platform seems to make use of that feature. 

 

So, as soon as you issue multiple PCIe read requests, you have to be prepared to do out-of-order processing for each of the requests – remember, the completions for each request must come in-order. If you have to feed a Fifo internally, be prepared to instantiate some buffer inbetween: Either one fifo per parallel request, or a unified buffer memory block which you write out of order, as the completions arrive, and read in-order.  

 

Keep an eye on completion timeout functionality. If you don’t design it in from the start, you might end up with a rewrite later. 

 

Actually, there is much magic in the CDMA design example, and is not exact where it should be, sometimes obviously buggy. Nothing I would spin off. E.g., look into altpcierd_cpld_rx_buffer.vhd and search for »TODO corner case when simultaneous RX TX«. The example does bring you somewhere, but I like to do my own faults instead of debugging missing support for corner cases. And not all missing/wrong code in the CDMA example documents itself as such …
0 Kudos
Reply