FPGA instability after reset

Altera_Forum · ‎02-24-2013

Hi all,

I am working with a custom Cyclone IV E based HW that is suffering from strange startup instability.

The HW runs fine under various stress tests and temperature ranges when programmed through JTAG,

but if the FPGA configures from EPCS flash the main CPU (Nios2) will likely hang inside the bootloader

(running from internal RAM). If the bootloader code manages to complete the system will run fine. The

actual hang traces to SDRAM access - copying firmware code from EPCS to SDRAM. Interestingly

if after the hang SDRAM test program is downloaded through JTAG (w/ or w/o FPGA reconfiguration) it

will run without errors. Similarly the bootloader runs fine when programmed through JTAG.

The hang is more likely to occur when using the reset switch than at power on (reset switch is wired

to nCONFIG). The internal reset (CPU, ethernet Phy, PLLs, ...) is similar to 'global reset generator' used

in many Altera examples. But instead of one reset signal I use two, they assert simultaneousely and deassert

in sequence. The reset signal to deassert first goes to PLLs and the second to CPU, ... What I've found

is that the system instability is somehow related to PLLs reset. If I leave areset fixed to 0 than the system

is less likely to hang and if I increase the second reset time to ~100ms (first reset - PLLs - deaserts after 1ms)

the system doesn't hang. I am not comfortable with this workaround, since according to the datasheet max

PLL resync time is 1ms, beside the PLL locked signals are among my reset sources and I didn't see any

self reset during boot.

I could use some pointers. I didn't design the custom board I am using and it's not really my field, so

unfortunately I cannot give more details about the HW.

Thanks.

Altera_Forum · ‎02-24-2013

Insufficient decoupling?

Altera_Forum · ‎02-24-2013

I wish I knew, my developement kit (which boots fine) has certainly more decoupling components.

I guess some were omitted on my custom HW due to PCB constraints. As I understand - from the

documents and posts I've read - the critical supply for PLLs is VCCA. I understand what are the

recomandations, but what is sufficient...

Altera_Forum · ‎02-24-2013

Altera has documents on decoupling.

Altera_Forum · ‎02-24-2013

--- Quote Start ---

... instead of one reset signal I use two, they assert simultaneousely and deassert

in sequence. The reset signal to deassert first goes to PLLs and the second to CPU, ... What I've found

is that the system instability is somehow related to PLLs reset. If I leave areset fixed to 0 than the system

is less likely to hang and if I increase the second reset time to ~100ms (first reset - PLLs - deaserts after 1ms)

the system doesn't hang. I am not comfortable with this workaround, since according to the datasheet max

PLL resync time is 1ms, beside the PLL locked signals are among my reset sources and I didn't see any

self reset during boot.

--- Quote End ---

Try using SignalTap II to trace your reset signals and any other signals you think are worth checking. SignalTap II has a power-on feature where you can get it to trigger after power-on, and then you can download the traces. I've used this feature to capture PCIe power-on reset signals.

Are you using the PLL locked output as a reset for your downstream logic? If so, look at it with SignalTap using a oscillator clock source, eg., the input clock to the PLL. The locked output should usually be filtered before using it as a reset signal.

There's code for a PLL reset debounce/deglitch in the source code associated with this PCIe thread:

http://www.alteraforum.com/forum/showthread.php?t=35678

Another thing - is your design fully constrained with TimeQuest constraints?

Cheers,

Dave

Altera_Forum · ‎02-24-2013

Thanks for the SignalTap tip, but I am not sure what to look for - the system wakes up, reads firmware header in EPCS, validates checksum, starts copying to RAM and than bum

it freezes - reset is not asserted. I didn't observe any freezes before execution of bootloader code and none after it - in fact it actually doesn't matter what code I run just after reset

it will likely freeze, could be a simple filling of RAM with zeroes. I've tried many variations, but long pause after PLL reset fixes it - or I should better say masks the actual problem.

I also build a minimal test system - just nios2, sdram and a led, it exhibits the same behaviour.

Yes I use PLL locked outputs as reset sources, but the reset code already compensates for possible locked signal toggling during resynchronization period after PLL reset is

deasserted. The design is fully constrained, there are no violations reported.

Altera_Forum · ‎02-24-2013

--- Quote Start ---

Yes I use PLL locked outputs as reset sources, but the reset code already compensates for possible locked signal toggling during resynchronization period after PLL reset is

deasserted. The design is fully constrained, there are no violations reported.

--- Quote End ---

Great, that eliminates transient timing related issues.

--- Quote Start ---

the system wakes up, reads firmware header in EPCS, validates checksum, starts copying to RAM and than bum

it freezes - reset is not asserted.

--- Quote End ---

What is doing the copy from RAM? The bootloader?

--- Quote Start ---

I didn't observe any freezes before execution of bootloader code and none after it - in fact it actually doesn't matter what code I run just after reset

it will likely freeze, could be a simple filling of RAM with zeroes.

--- Quote End ---

Ok, so once your code has jumped into main() and you fill RAM with zeros, you get a lock-up?

If you've got to the point that your processor is running, try using the NIOS debugger.

Cheers,

Dave

Altera_Forum · ‎02-24-2013

The bootloader which runs from internal ram copies the actual firmware from EPCS flash to external SDRAM and jumps to its start address.

The problem is not in the bootloader or firmware code, they run just fine when downloaded through JTAG and most of the time after power

on or reset. Just sometimes it happens that shortly after reset (toggling nCONFIG with a switch in my case) the system freezes. This happens

on the cusom HW I work on, I didn't observe such behaviour on my developement kit.

Altera_Forum · ‎02-25-2013

--- Quote Start ---

Just sometimes it happens that shortly after reset (toggling nCONFIG with a switch in my case) the system freezes.

--- Quote End ---

The switch connects directly to nCONFIG?? That is not a good design. The nCONFIG signal has a timing requirement that needs to be met, eg. see tCFG

http://www.ovro.caltech.edu/~dwh/carma_board/fpga_configuration.pdf

How can you ensure tCFG is met with a bouncing switch signal? Unless you have switch debouncing logic, all-bets-are-off, since you are potentially violating the device power-on-reset requirements.

Cheers,

Dave

Altera_Forum · ‎02-25-2013

I see, thanks! I'll verify how this is implemented on my board.

Altera_Forum · ‎02-28-2013

I've traced the system hangs to DDR calibration failure. The local_init_done signal from DDR stays 0 and ctl_cal_fail, ctl_init_fail are 1 when the boot hangs.

The DDR calibration fails only when deaserting reset too fast (< ~100ms) after FPGA configuration. DDR testing shown no errors under various conditions and

temperature ranges, so for now it looks like a reset issue only. The really strange behaviour I've noticed is that when the boot hangs it will continue to hang if

FPGA is reset by toggling nCONFIG, if the power is turned on/off quickly it will still hang, turning power off and waiting a minute gets the system a chance to

boot normally. Because of this I don't see any way out of boot hang even if I monitor the local_init_done. How is the local_init_done meant to be used, in

example designs I've seen it's left unconnected?

Altera_Forum · ‎03-01-2013

--- Quote Start ---

I've traced the system hangs to DDR calibration failure.

--- Quote End ---

Congrats! Its a pain to track subtle bugs down ...

--- Quote Start ---

The local_init_done signal from DDR stays 0 and ctl_cal_fail, ctl_init_fail are 1 when the boot hangs.

The DDR calibration fails only when deaserting reset too fast (< ~100ms) after FPGA configuration.

--- Quote End ---

So why not take the easy way out then; add a reset component that is enabled at power-on after the external reset is deasserted, and it holds an internal reset asserted for at least 100ms (or resets the DDR for that long anyway).

--- Quote Start ---

The really strange behaviour I've noticed is that when the boot hangs it will continue to hang if FPGA is reset by toggling nCONFIG, if the power is turned on/off quickly it will still hang, turning power off and waiting a minute gets the system a chance to boot normally.

--- Quote End ---

An FPGA starts out 'from scratch' if nCONFIG is pulsed (although it won't go through its power-on-reset sequence). Its unlikely to be the FPGA that is the problem (unless you are violating an I/O voltage or power supply rise time requirement).

I'd guess that you have something external that changes, eg., your DDR or some other device gets into a weird state that is stalling your boot.

Keep in mind that when you configure an FPGA there is a large inrush current (on some of the supplies). If your power supplies are marginal (with respect to the current sourcing ability), then they might work sometimes, and not others.

--- Quote Start ---

How is the local_init_done meant to be used, in example designs I've seen it's left unconnected?

--- Quote End ---

Example designs are not always the best references. Read the DDR controller User Guide, trace the hardware with SignalTap II, try a few things, break things, fix them, ... that is the better way to understand how the IP cores need to be used. If you find something that does not work according to the IP User Guide, then file a Service Request with Altera. You'll get a much better response if you can show you've looked at the problem in detail and identified something that is not documented or is incorrectly documented.

Cheers,

Dave