Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
20693 Discussions

State machine crashes (Cyclone II) - no idea why. How do I debug?

migry_tech
Beginner
854 Views

I have implemented bit mapped VGA graphics using a CycloneII and fast 12ns external SRAM, connected to a MC68000. I am trying to get EMUTOS to run on this hardware, and it does boot to the desktop, however I appear to have a problem which I have now confirmed in the SRAM arbiter state machine crashing. When this happens I lose the picture.

 

When I make changes to code, sometimes it works and sometimes it doesn't. It feels like the hardware equivalent of a wayward software pointer stomping over memory 😐 . I have had this problem for months since I first wrote the code. It happens on my mk1 144 QFP pin CycloneII homebrew board, and not it happens on my mk2 208 pin CycloneII board. I now have enough pins for 8 diagnosttics LEDs which helped confirm the suspicion (I bring out the state variable to the 8 LEDs). I have also now captured the event on my scope.

 

Timing is clean. The clock to the state machine is from the 50MHz Xtal. The video pixel clock is 25MHz which is simply the Xtal divided by 2. The 68k runs at 10MHz. I am very familiar with the issue of signals crossing clock domains. The state machine is an arbiter which allows access to either the video generator or the 68k.

 

Most of the time it simply sites in the idle state, but every 160ns there is one cycle to fetch the next pixel.

 

I have carefully written the state machine to avoid any missing next state values.

 

 

reg [7:0] r_arb_state; localparam VRAM_ARB_STATE_IDLE = 8'b00000000; localparam VRAM_ARB_STATE_READPIXEL = 8'b00000001; localparam VRAM_ARB_STATE_S1 = 8'b00000011; localparam VRAM_ARB_STATE_S2 = 8'b00000111; localparam VRAM_ARB_STATE_S3 = 8'b00001111; localparam VRAM_ARB_STATE_WRITEIDLE = 8'b00011111; localparam VRAM_ARB_STATE_S4 = 8'b11000011; localparam VRAM_ARB_STATE_S5 = 8'b11000111; localparam VRAM_ARB_STATE_S6 = 8'b11001111; localparam VRAM_ARB_STATE_ILLEGAL = 8'b11111111;

In the picture below the state machine crash can be seen.

 

No - adding a picture does not seem to work with my browser and the Intel pop up window 😦

 

Instead of staying in the IDLE state (00000000) it changes to (11000000) which is an undefined state. From here the "default" takes it to the ILLEGAL state (11111111) where it stays (and video is lost). The blue low going signals are the SRAMs not chip select.

 

always @ (posedge w_vram_clk) begin if (IO_RSTN==1'b0) begin r_arb_state <= VRAM_ARB_STATE_IDLE; r_pixelread_done <= 1'b0; r_writeeven_done <= 1'b0; r_writeodd_done <= 1'b0; VRAM_addr <= vga_addr; VRAM_dataout <= vga_dataout; VRAM_cs <= vga_cs; VRAM_we <= vga_we; end else begin   case (r_arb_state) VRAM_ARB_STATE_IDLE: begin r_pixelread_done <= 1'b0; r_vram_writeaddreven <= #3 {r_VideoRamOffset,1'b0}; // 14:0 = 32k bytes r_vram_writeaddrodd <= #3 {r_VideoRamOffset,1'b1}; // 14:0 = 32k bytes // does the video generator need to read pixel data? (occurs at regular intervals) if (vga_cs==1'b0) begin VRAM_addr <= vga_addr; VRAM_dataout <= vga_dataout; VRAM_cs <= vga_cs; VRAM_we <= 1'b1; // was vga_we - but not used outside of reset r_arb_state <= VRAM_ARB_STATE_READPIXEL; end else begin VRAM_cs <= 1'b1; VRAM_we <= 1'b1; VRAM_addr <= VRAM_addr; VRAM_dataout <= VRAM_dataout; if ((r_writeeven_done==1'b0)&&(r_write_uds==1'b1)) begin r_arb_state <= VRAM_ARB_STATE_S1; end else begin if ((r_writeodd_done==1'b0)&&(r_write_lds==1'b1)) begin r_arb_state <= VRAM_ARB_STATE_S4; end else begin r_arb_state <= VRAM_ARB_STATE_IDLE; if (r_write_uds==1'b0) begin r_writeeven_done <= 1'b0; // acked end else begin r_writeeven_done <= r_writeeven_done; end if (r_write_lds==1'b0) begin r_writeodd_done <= 1'b0; // acked end else begin r_writeodd_done <= r_writeodd_done; end end end end end  

 

I can't understand why the state machine is crashing. Timing is good. 50MHz is not excessive for this chip. I have pondered whether it might be a power supply glitch, but the +5V is from a good Rigol PSU and this is regulated to 1.2V by a regulator on the PCB. I have plenty of decoupling caps.

 

I am desperate for ideas as to how to diagnose the problem. I am pulling out my hair, and so I am reaching out to this forum out of desperation!

0 Kudos
1 Solution
ak6dn
Valued Contributor III
637 Views

State machines usually crash because they entered into an illegal, undocumented state.

This can happen when an input signal that is sampled is either asynchronous, or poorly synchronized.

The signal goes to two separate parts of the state machine transition logic, and is interpreted as a H in one part, and a L in the other.

This can then cause a transition to an illegal state.

 

You don't show enough of your code to know how this might apply in your case. The module header, and how all input signals are generated, is necessary to know.

 

Quartus will on occasion re-encode the state machine in another form (usually one-hot) where each defined state is implemented as being encoded with just one state bit set. You need to look in your report files to see if this was done (or not). Doing this can be disabled by user control.

 

As you mention synchronous clock timing can also be a cause, but Quartus should be able to tell you which paths did not meet your 50MHz timing (if any).

View solution in original post

8 Replies
migry_tech
Beginner
637 Views

Here is a photo of the scope screen, showing the state machine crash.

D0 is at the top with D7 at the bottom, so reads upside-down.

Each blip of D0 is where the state machine goes to state 00000001 to read a byte from the SRAM, and you can see the resulting chip select going to the SRAM.

You can see the unexpected 11000000 state, which is detected by the verilog default case and goes to a spedcial "ILLEGAL" value of 11111111, where it stays. This was added to allow debugging, and shouldn't be necessary if everything was working correctly.

There are some states which start 1100... No idea if this is a clue.

 

IMG_1149.JPG

0 Kudos
ak6dn
Valued Contributor III
638 Views

State machines usually crash because they entered into an illegal, undocumented state.

This can happen when an input signal that is sampled is either asynchronous, or poorly synchronized.

The signal goes to two separate parts of the state machine transition logic, and is interpreted as a H in one part, and a L in the other.

This can then cause a transition to an illegal state.

 

You don't show enough of your code to know how this might apply in your case. The module header, and how all input signals are generated, is necessary to know.

 

Quartus will on occasion re-encode the state machine in another form (usually one-hot) where each defined state is implemented as being encoded with just one state bit set. You need to look in your report files to see if this was done (or not). Doing this can be disabled by user control.

 

As you mention synchronous clock timing can also be a cause, but Quartus should be able to tell you which paths did not meet your 50MHz timing (if any).

migry_tech
Beginner
637 Views

Dear @ak6dn, thank you. I think that a penny might have dropped.

 

> The signal goes to two separate parts of the state machine transition logic, and is interpreted as a H in one part, and a L in the other.

 

Bingo!

 

There are indeed inputs from other clock domains going into this state machine. Although in this case the state machine is only swapping between IDLE and PIXELREAD, the state transition could still be affected (corrupted) by other inputs to the state machine. I forget that each path (from all inputs) to each (of the 8) state flop comes from different logic cones, and a transition (e.g. CPU write -> SRAM write request) near the 50MHz clock edge could reach some state flops in time, and not others! This also explains that when I changed the localparam STATE values (sparsely coded) the crashing behaviour changes, sometimes becoming more stable sometimes not.

 

I am certainly aware of *some* of the issues that can arise from signals crossing clock domains, but clearly I have much more to learn. I need to cerfully review all asynchronous inputs to the state machine and try to understand how I can make their re-sampling safe.

 

Thank you! Many thanks for taking the time to reply! You have given me hope! 😀

 

BTW I was really concerned about glitches on the core power supply, but when I scoped (and used a short gnd connection) the trace was very clean where the state machine crashed.

 

--migry

0 Kudos
ak6dn
Valued Contributor III
637 Views

Typically any inputs that are not fully synchronous to the state machine clock must go thru a dual-rank synchronizer (google) so that the output signal will be stable in the state machine clock domain. Just requires a couple of registers per async signal.

0 Kudos
migry_tech
Beginner
637 Views

Thank you once again @ak6dn for replying to my original post, as you correctly identified the cause of the my problem.

I have re-coded by re-synchronising the key inputs from the CPU clock domain onto the 50MHz clock domain, using the technique you describe above.

Ironically I am well aware of this solution to re-sync signals crossing clock domains due to the issue of metastability, but failed to understand that I needed to use this for the inputs from other clock domains for this particular state machine.

The new code is working perfectly and video is clean with no corruptions. I now see a perfect EMUTOS desktop!

This is another lesson hard learned, which hopefully means I won't make that mistake again.

It also explains why every code change (or even using SignalTap) sometimes caused the circuit to "work" better.

Reminds me about my 3rd year Uni project, where my tape reader TTL circuit was failing (that dates me!). My professor pointed to the long thin red and black power wires and pointed out the issue of voltage drop. Short thicker wires solved the problem. I have always had that lesson in the back of my mind when wiring up any power supply wires since then!

best regards...

--migry

0 Kudos
migry_tech
Beginner
637 Views

BTW, a little searching revealed that others have "made the same mistake".

I was convinced that I had a power supply issue with the Cyclone II due to a lack of adequeate power supply decoupling.

I hope that anyone else searching finds the above solution.

0 Kudos
Rahul_S_Intel1
Employee
637 Views

Hi ,

I felt the above problem is not because of your coding style , the power utilization that means to say may be FPGA power is using more and power supply is not cater provide the sufficient power

0 Kudos
migry_tech
Beginner
637 Views

Hello @RSree, thank for your reply, but the problem has been solved. It was not caused by bad power supply to the Cyclone II. I t was caused by my bad RTL coding.

 

I just could not understand why the state machine was crashing, as I had reviewed *my* code again and again. As a consequence I started to NOT trust my circuit/PCB hardware implementation, even though I used plently of decoupling capacitors and bulk caps too. But when I scoped the 1.2V core power supply, it was clean in the area where the crash was happening (I added the 1.2V trace to the above pictured conditions). I think that I was "clutching at straws" and bad power was the best idea I could come up with. Thankfully I was wrong!

 

Once it was clearly explained about the race condition of un-re-synchronoised inputs from other clock domains, the penny dropped. I changed the RTL to re-sync all inputs using standard "shift register" techniques. My (fixed) RTL is now solid and works perfectly. I am very relieved, and very importantly I am also confident that my hardware design using the Cyclone II can be trusted!

 

We learn by our mistakes, and I am actually pleased to have better understood how to code robust state machines. I have gone through all my RTL and made sure that I have fixed the same problem elsewhere in my code.

 

regards...

--migry

 

 

0 Kudos
Reply