dcfifo timing details

Altera_Forum · ‎08-27-2009

I am trying to find out some timing info of dcfifo, but couldn't find it anywhere. The "Single- and Dual-Clock FIFO Megafunction User Guide" talks about the flag latency, but failed to disclose which ports are flopped-in or flopped-out. For example, one would think that rdempty is flopped using rdclk, but it is NOT (my design couldn't meet timing unless I explicitly flop rdempty)!

Also why would rdusedw[] have 2-rdclk latency from rdreq (Table 3 on pp.8 of the user guide)? This signal is generated in the rdclk domain and thus should have 1-rdclk latency from rdreq. This extra cycle latency really complicates my design. Is it possible to tweak the megacore somehow?

Altera_Forum · ‎08-27-2009

the Rdusedw has to have a 2 clock latency in a DC fifo because it has to get the difference in the read and write counters that are in 2 difference clock domains, and needs to synchronise them - hence the extra clock.

As for timing problems with rdempty - what are you doing with this signal? doing some logic with it?

Altera_Forum · ‎08-27-2009

i meet the same problem.

don't know to resolt

Altera_Forum · ‎08-27-2009

First off, the flags and pointers are a complete pain to predict their behavior. They often do not seem logical. The reason behind this is that a) crossing clock domains is very difficult and b) they are very robust.

The read/write pointers are just free-running counters, so the only way of knowing where you're at is to compare to the other one, and that means crossing clock domains. That also means encoding the count values as gray codes, de-metastabilizing transfers into the other domain, and then comparing. Then you have to take into account that the FIFO handles some pretty off clock rates. Let's say you write to the FIFO with a 4ns clock and read with a 10ns clock. You could get two writes without the read clock having a single clock to update you of any changes, so anything in that domain(rdempty, rdusedwds) is going to be off by 2.

The common complaint(although I haven't seen it recently) is that a flag goes high before the FIFO is empty. For example, it might say it's empty, but you know you haven't read everything out, but then a clock or two later, even though nothing's happened, the empty flag goes low and you can read the rest out(this might occur at the full side and not the empty, I can't remember). The reason for this is that the FIFO has to guess that under worst case conditions with different clock rates, it could be empty.

(I've seen people design their own FIFOs, and it's easy to create something that works nicely under a few tests, but over time they always run into weird corner cases where data gets dropped, the FIFO rolls over, or something else like that.)

So I basically write logic that follows the rdempty and wrfull flags, i.e. it won't read or write when told not to. The usedwds are useful as general fullnes, i.e. am I 1/4th full or 3/4ths full, but do not rely on at the full/empty boundaries.

One other case I've seen is the "I know I fill the FIFO with 256 consecutive writes, and then read it out with 256 consecutive reads, so if the full/empty flag goes early, I have no way to back-pressure my system. The solution for that is to disable the protection circuitry in the Megawizard and just let it go, i.e. you may write when the FIFO thinks it's full, but if you know it's not, it works out. Of course if you're assumptions on the design's behavior is wrong, it could cause problems.

Altera_Forum · ‎08-27-2009

First of all, crossing clock domain is a well-known problem and it's no longer difficult. There are canned solutions and numerous proven designs for example. I myself have done many async FIFO designs in ASICs and full-custom microprocessors and never had any issue. The reason I didn't use my own FIFO design here is that I see that dcfifo uses less resource (probably because it's better optimized using some internal "secret sauce"). Also I want to avoid some extra work with set_false_path etc in timequest. Last but not least, I hope through this discussion Altera can improve dcfifo to benefit more designers. I have been doing ASIC and full-custom IC design for 15 years, but I am very new in doing FPGA design.

As for the robustness, it is understood that flags could be "conservative". For example, rdempty may go high when the FIFO is not empty. However, the flags should never be "aggressive" (e.g. rdempty becomes 0 when the FIFO is empty) as it may cause FIFO underflow or overflow. As long as all flags are conservative, the design will be robust.

However, some flags in dcfifo are NOT conservative. For example, rdusedw has a 2-rdclk latency from rdreq, which means that the first cycle after rdreq, rdusedw will not decrease (but it can increase due to previous write). The result is that rdusedw falsely indicates that the FIFO has more items than it does -- and this can cause FIFO underflow. In fact, since rdusedw is not flopped, it would require an extra cycle to flop it if timing is critical, and this makes the problem even worse.

Can some Altera guru suggest anyway to tweak dcfifo for the issues mentioned above?

P.S. For readers who are interested in issues in async fifo as well as solutions, Clifford Cummings' SNUG02 paper "Simulation and Synthesis Techniques for Asynchronous FIFO Design" provides good summary.

Altera_Forum · ‎08-27-2009

--- Quote Start ---

the Rdusedw has to have a 2 clock latency in a DC fifo because it has to get the difference in the read and write counters that are in 2 difference clock domains, and needs to synchronise them - hence the extra clock.

--- Quote End ---

This doesn't explain. I am asking the latency from rdreq to rdusedw, so this can be done in one cycle no matter how long it takes to synchronize the write pointer into read domain. Even worse, this rdusedw is NOT flopped, so to close timing would require an extra cycle, making it 3-rdclk latency from rdreq.

--- Quote Start ---

As for timing problems with rdempty - what are you doing with this signal? doing some logic with it?

--- Quote End ---

I am using rdempty (along with rdusedw and q) to generate next rdreq. My FIFO stores different elements. For certain elements I must maintain atomicity, i.e. either don't send out anything or burst out the entire element back-to-back. So I use "read ahead" mode to check the data. If it's the header for atomic transfer, then I check rdusedw to make sure FIFO has the entire element. Otherwise I can read whenever rdempty is 0. So the logic is something like this:

rdreq = rdusedw > `ATOM_LEN || q[`TYPE] != `ATOM_TYPE && !rdempty;

Altera_Forum · ‎08-27-2009

I'm not saying it's not understood, I'm saying most users who complain don't understand the complexities(i.e. a decent number of people on this forum are probably in college). You obviously do...

I'm not sure how you want to remove the two cycle latency(which obviously makes the full path slower, if it doesn't meet timing with them in there already). Speed, area and latency are more trade-offs this has to deal with.

The latencies are pretty well documented:

http://www.altera.com/literature/ug/ug_fifo.pdf#page=14

My point in the previous slide was not to rely on rdusedw to determine if it's empty, as rdusedwd is not intended for that. But since you know it has a latency of two, it's probably easier to decode that(double-register the rdreq in parallel, and if rdempty is at 2 and there were two rdreqs on the last two cycles, assume it is empty. It's more complicated than that, but just an idea.)

Altera_Forum · ‎08-28-2009

I know there is tradeoff between latency and cycle time. All I want is to make the latency from rdreq to rdusedw one clock cycle instead of 3 (since rdusedw is not flopped you need an extra cycle). This is very feasible, because you just need to do something like this:

wire [] write_delta = write_ptr_synced_in_rdclk - write_ptr_last;

always @(posedge rdclk)

begin

write_ptr_last <= write_ptr_synced_in_rdclk;

rdusedw <= rdusedw + write_delta - rdreq;

end

I omitted reset logic but that's simple.

I have read the user guide in your link (as pointed out in my original post), but the document doesn't mention anying about which ports are flopped in or flopped out. In order to meet timing in high-speed design, you need to add one more cycle to the latencies listed in Table 3.

Your suggestion is exactly what I had to do to work around the long latency issue, but the actually design is much more complicated because the long latency makes it impossible to pipeline (such that FIFO can be read every cycle when possible). Therefore I had to implement another small FIFO to do the look ahead logic. Had dcfifo implemented rdusedw with 1-rdclk latency as I suggested, it would make my life as well as many other engineers' life much easier. I hope Atera would consider improve dcfifo in future releases.

Altera_Forum · ‎08-28-2009

I've actually never seen that request before, but can see how it would help your application(and someone else chimed in they've seen this). I'll see what I can do to get someone to look at it. My concern is that, in your application, I don't think what you have would meet timing anyway. Note that "write_ptr_synced_in_rdclk" is a gray coded value. You can't do simple math with a gray code(at least carry-chain math), so you need to either convert it back to binary or go through a large decode just to do that first line. So even if it were made to work in 1 clock(and at first glance it looks all right, but I haven't really dissected it) I doubt it would make timing.

To get a better idea of what's going on, go to the RTL viewer. The behaviour is pretty understandable. Note that the combinatorial output for rdusedwd is just a single adder, which should be able to use the carry chain. If that can't meet timing in your design, I What are your read and write clock frequencies? I'm trying to think of another way, and that might help.

Altera_Forum · ‎08-28-2009

My "write_ptr_synced_in_rdclk" is actually the sync'ed gray code converted back into binary, and I've used this logic in many other designs so it's guaranteed to work. The key point here is that no matter how long it takes to sync the write pointer into rdclk domain, rdusedw only has one-rdclk latency from rdreq. And more importantly, rdusedw is conservative, which means that the FIFO would never underflow if read request is made using rdusedw. For details on my read logic, you can refer to my# 6 reply above.

Why would you think rdusedw would not meet timing? write_ptr_synced_in_rdclk and write_ptr_last are both flopped, and rdreq is fresh input. Unless rdreq is late arrival, the logic should have no problem meet timing.

Is there a simulation model or source code of dcfifo? That can help me understand the timing. Also is it possible to pull out some internal signals (e.g. write pointer synced into rdclk domain) and use them in my RTL?

FYI, my FIFO has wrclk=133MHz and rdclk=200MHz.

Altera_Forum · ‎08-28-2009

Synthesis the FIFO by itself and launch the RTL viewer(or just dive down to it from your full design). The actual synthesized nodes are a pain and so the combinatorial names in TimeQuest won't help, but the RTL viewer should be good enough to give a sense of what's going on, specifically when you see sub-hierarchies with names like gray2bin. Just push down into the FIFO, and use right-click to filter from/to things you've highlighted. It's not going to give exact details, but enough to know where the pipelines are at, what's being compared, etc.

Altera_Forum · ‎08-28-2009

Instead of raising this on the forum, it may be better to raise this via mysupport of the altera web page instead.