I have x16 lane PCIe Gen2 data comming into the FPGA. Each lane is de-serialized to 32 bits so I have 512 bits of data every clock tick that I need to to weed throguh to determine address data and command. It gets complciated as sometimes PCIe requests span across multiple of clocks and some times one clock tick may contain multiple PCIe commands. This requires pipelined design. I use verilog for this design. I would like to know if there is a documentation available that can help guide on how to systematically tackle this issue.
I don't know of any existing material but I recommend using a micro core based approach for this. This approach just means that you use multiple small cores that are tuned for specific tasks that are assembled to an end goal. This is commonly used in packet processing when you need to strip header/command information out of the packet in a sensible and reusable way.In your case unless I'm missing something I could see the 32--> 512 bit deserialization being one core with a streaming output that feeds more cores downstream and perform address and command decode, which then feed what I presume would be a master if you need to pack datagrams into memory.