amazing error

Altera_Forum · ‎04-07-2005

Hi!

I submitted a similar error time ago, but now the "ghosts" have appeared again.

I have a program calls a function many times. This function has a static variable called "gState" (mapped in ram, static), which is initialized at the beginning of the program. When the function is called, the variable has the correct value, but after some calls, apparently the value of the variable has changed and the function returns error.

So I have debugged it, and the fact is that if I look at the RAM address where the variable is mapped, the value is correct, but it seems that the ldw instruction has not loaded the correct value!

So the address of the variable is 0x83c954. At the beginnig of the program, this address has a value of 0. After initialization, it changes to 1 (STATE_OPEN) and 2 (STATE_OPERATIVE). This should be the correct value, and should not change.

The check I have is:

if (gState != NVRAM_STATE_OPERATIVE)

assert(0);

If I put the breakpoint in the "if", I see that at that point, the RAM address has a correct value, and the r4 register, which is assigned the value of the variable, is also correct.

(Assignation of r4 to the value of the variable)

0x00802450 <NvramGetData+56>: ldw r4,-32240(gp) // (gp is 0x844744 -> gp - 32240 = 0x83c954)

But if I put the breakpoint in the assert, in order to see when it fails, I find that when the program stops, the value of the address 0x83c954 is still 2, but r4 value is 0!!!! Also, if I point to the value with the mouse in the debugger, it says that the value is NVRAM_STATE_OPERATIVE!!!!

From the assignation of r4 to the assert there's no other assignation

0x00802450 <NvramGetData+56>: ldw r4,-32240(gp) (ASSIGNATION)

0x00802454 <NvramGetData+60>: movi r5,6

0x00802458 <NvramGetData+64>: addi r6,fp,4

0x0080245c <NvramGetData+68>: cmpeqi r3,r4,2

0x00802460 <NvramGetData+72>: bne r3,zero,0x802490 <NvramGetData+120> (BREAKPOINT)

and the only IRQ present is the one from the TIMER. That interrupt should be saving and restoring r4 (from vectors.S, it does), so shouldn't be corrupted.

the fact is that if I change the software (add printfs or othe instructions) the problems may disappear, but later they appear at other point.

After that... some clues about what's happening? Ii don't think it's a bad pointer, because no other threads are active. And I don't think also the stack is been corrupted, as I have only this thread, and the timer IRQ.

Help or suggestions will be accepted...

aLeX

Altera_Forum · ‎04-07-2005

Could it be that the value of the global pointer (gp) is being corrupted? This will cause your code to read from the wrong location, resulting in the symptoms you describe.

There is code within the eCos kernel that updates the value of gp. This is intended to allow execution to switch backwards and forwards between a monitor (i.e. Redboot) and an eCos application. Assuming you are running a monolithic system (i.e. a ROM or ROMRAM configuration) then the value of gp should always be restored to the same value.

This may not be the case if there's a bug in the kernel, or, as is more likely, the stored value of gp is being overwritten due to memory corruption. Look at the value of the global _gp to see if this is the case.

Altera_Forum · ‎04-07-2005

The value of gp is correct when the error occurs. I've checked it.

aLeX

Altera_Forum · ‎04-07-2005

To rule out the possibility that the timer interrupt is somehow changing the contents of registers/memory, it would be worth disabling interrupts around the section of code in question.

Assuming that doesn't make any difference, then could well be something like an SDRAM timing problem.

Altera_Forum · ‎04-07-2005

The problem I've had until now is that when I was near to isolate the code that was causing the bug, the error changed. So if now I add the inhibition of the timer, the error does not occur.

Anyway, I've been having problems with the timer interrupt disabled, so I don't think that the problem is the timer.

So the actual situation is that I've spoken with the hardware designer of the board, and we have chhanged a parameter in the SOPC builder, that was realted to the wr_enabled of the flash, and it seems to work.

I am not sure if this was the problem, as other times we have modified something wich seemed to be the problem, and a few days later the "ghosts" have come back.

If the problems appear again, I wll come here again. Thank you for the help.

aLeX

Altera_Forum · ‎04-07-2005

Here I am again!

As I suspected, the error wasn't solved. Only has changed the place where it occurs.

As you said, I've checked the timing requirements, and I have found that they can't be modified. We are using the 71V416 ram chips, which are the ones used by the SOPC. I've read that the class of this ram assigns the timing requirements automatically, so I can't see any way to improve that...

Also I've done a small RAM test, writing values and checking them, and it doesn't seem to fail...

I'm completely lost :-(

aLeX

Altera_Forum · ‎04-08-2005

I've seen similar problems with SDRAM, where the kind of random accesses that 'real' code makes can show up problems that walking one style memory tests don't. However I'd expect SRAM to be much better behaved than SDRAM.

I'm assuming that you run your memory tests without a data cache (otherwise you may not be making any memory accesses at all).

In your particular case, I'd say that the first thing you need to be certain of is that this isn't a problem with the exception handling - but it does sound like a hardware problem.