Re:Arria 10 EMIF waitrequest_n, operating conditions for it to suddenly go low and stay low forever

RHIER1 · ‎11-18-2021

I am using altera_emif for controlling DDR4 doing video frame buffering; this design is done in Quartus 16.0, and has been in production for some time. Customer is now reporting that, after working properly for some time, occasionally the video output screen simply goes black (may possibly be related to changes in some of the video input rasters, etc., no hard evidence/correlations yet as it happens so infrequently). On at least one occasion (all we have been able to see so far), monitoring the amm_ready (a.k.a. waitrequest_n) signal from the EMIF, that signal appeared to go low (and stick low) around the time the screen went black (makes sense, if nothing can go through the frame buffers...). At that time, emif_user_reset_n did not go low (that wee know of), ddr4_pll_locked and local_cal_success stayed high, local_cal_fail stayed low, etc. It appears that Issuing the global_reset_n signal seems to have gotten things going again, at least for a couple of cases so far. Can someone tell me, from a design perspective, what conditions (as seen by the EMIF) might cause the EMIF to suddenly decide that it is not (ever) ready to accept data? From that, I might have a shot at tracing those conditions back to whatever might be initiating the problem (unless of course it might be a spontaneously occurring bug...). Thanks...

AdzimZM_Intel · ‎11-24-2021

Hi Sir,

May I know on which operation that has been assigned during the situation?

Do you try to execute the burst operation?

Do you see any pattern that this issue might happens?

Or it's just randomly occur?

Thanks,

Adzim

RHIER1 · ‎12-03-2021

Sorry, didn't see this earlier, had thought that I might get some notification if there were a reply. Unfortunately not sure that I know much about burst; as to pattern, as I said, (may possibly be related to changes in some of the video input rasters, etc., no hard evidence/correlations yet as it happens so infrequently).

AdzimZM_Intel · ‎12-08-2021

Hi Sir,

Maybe the KDB in the link below can explain the waitrequest behaviour.

https://www.intel.com/content/www/us/en/support/programmable/articles/000084591.html

I'm not sure on how to trace the root cause as it's not frequently happen.

Do the device has passed the calibration?

Regards,

Adzim

RHIER1 · ‎12-15-2021

Actually, another Knowledge Article (entitled "Local ready signal issues with Altera external memory controller IP") which was sent to me seems plausibly to fit our situation somewhat more closely, although again we have fairly few details to go on. To try to answer more/earlier questions as well as I can: Yes, we are apparently using burst mode, and yes, it appears that the DDR4 devices have successfully passed calibration (the local_cal_success signal remains high, and the local_cal_fail signal remains low). The aforementioned article mentions that one common potential cause of the "local_ready" signal (in our Quartus 16.0 case "amm_ready" / "waitrequest_n") going low and staying that way forever ("effectively locking up the controller and preventing any further accesses...at this point nothing can start the controller again other than a reset") might be, for example, beginning a burst of writes but not providing all beats of that write before requesting other commands/actions.

Unfortunately, this example seems to assume that a user is trying to bring up and debug a system which never works, presumably having been designed incorrectly from the start (and mentions that "determining if a write burst is incomplete at one particular point in a design can be difficult", apparently even when this behavior is always present as presumably in the example). They suggest using either simulation or Signal Tap to monitor various signals, particularly including "enough_data_to_write" and "proper_beats_in_fifo", neither of which seems to appear anywhere in our project, perhaps as it seems this article was based on an even much earlier version of Quartus than we are using -- and then identifying and fixing the precise location in the design where the error exists based on such investigation.

In our situation, the design has been working quite well in the field (as well as in all our internal testing) for some time and over a wide range of applications, but now a particular customer is reporting very occasional instances of the system locking up (apparently in this manner). Our application is in a video display monitor, which accepts several independent video sources concurrently, and each of those video sources is associated with a write and read port, all accessing the block of DDR4 via an arbiter circuit, which I believe may have come originally from an Altera reference design(?). It appears that the lockup (which may occur at random times over hours/days/weeks of normal/proper operation) may be associated with when a particular (and VERY expensive) IR camera (which we cannot have any direct access to), which is connected to one or two of our video inputs, occasionally has disruptions in its video signal(s). Try as we might, we have not been able to replicate such failures at our facility, so neither simulation nor Signal Tap seem to be viable here since basically all we would be able to see is normal/proper operation. We have provided the customer with customized FPGA content, firmware, and software which samples as many signals as we could find to be potentially applicable/useful, and logs them in a file which they provide to us in the event of a failure -- so far we have received just two such sets of event logs, obtained over the last couple months or so. In these cases, we do see the amm_ready signal going/staying low at around the time their screen goes black (corresponding to the DDR4 system "locking up" as described), and those have occurred roughly when the video from the IR camera has apparently been disrupted in some fashion, but we have seen no other indications in those logs of details as to how or why the system may have been caused, for example, to have had burst operations disturbed, or whatever else may be causing problems.

As you can see, this seems to be quite a difficult debug situation; any further ideas (or even just guiding questions) you might have would be greatly appreciated.