Adding two 64 bit variables

Altera_Forum · ‎07-26-2006

Hi

I experienced a problem with optimized code, adding two 64 bit (alt_64) variables:

Sometimes the upper word of the result gets an extra increment or decrement by 1 although no overflow (should have) occured when adding the lower words.

The following are actual (negative) input numbers and the expected sum:

0xFFFFFFFFD75C5BDE

+ 0xFFFFFFFFFFFFF242

= 0xFFFFFFFFD75C4E20

But what I get is a zero upper word and therefore a result that is off +2^32:

= 0x00000000D75C4E20

In other cases I got an upper word of 0xFFFFFFFE when I expected 0xFFFFFFFF, and AFAIR even 0x00000001 when I expected zero. Actually it happens in a control loop where the small value as above is added to a large sum, positive or negative in a rather random order.

The problem occurs with interrupts disabled, and there is no other hardware/logic/coprocessor that could cause data corruption during the computation. The core is a NIOS II/s with nothing else, running from onchip memory. With optimization off (-O0), the sum is always computed correctly. But the code in question is time critical, so that is not an option.

I'm still trying to write an example to reproduce the problem outside my specific application (but failed so far) and maybe I should try to reproduce it in ISS, but first I'd like to ask here if there is a known problem with optimizations of 64 bit arithmetics? And a known workaround maybe?

Kolja

Altera_Forum · ‎07-26-2006

In my experience, this does work. It would be helpful if you posted a snippet of code, as these types of problems are almost always type/typecast sort of issues.

Cheers,

- slacker

Altera_Forum · ‎07-27-2006

In the snippet below, I get the errors if alt_64 is used for t2, but no errors if alt_32 is used or optimization is turned off.

For explanation:

The problematic original code simply does VELOCITY+=ACCELERATION. The code below is a temporary workaround.

The ACCELERATION is always small enough (positive or negative, less than 0x1000 and controlled externally) so that VELOCITY never should exceed +- (2^31)-1, therefore the manual computation of VELOCITY_HI is valid.

Actually, when the breakpoint at the end is hit, it is t1 (held in registers) where the upper word is wrong, not VELOCITY (in DPRAM).

You might be tempted to comment on the "DPRAM" definition; yes, data is in a dual-ported RAM, and another CPU accesses the same area while this code runs in a loop. The other CPU regularly writes ACCELERATION and only reads VELOCITY. Both CPUs have 32-bit-access to the DPRAM.

#define VELOCITY     (*(volatile alt_64 *)((void *)DPRAM_BASE+8))# define VELOCITY_LO  (*(volatile alt_32 *)((void *)DPRAM_BASE+8))# define VELOCITY_HI  (*(volatile alt_32 *)((void *)DPRAM_BASE+12))# define ACCELERATION (*(volatile alt_32 *)((void *)DPRAM_BASE+16))
...
while(1) {
...
alt_64 t2; /* it works if you use alt_32 here! */
alt_64 t1;
...
t2 = ACCELERATION;
t1 = VELOCITY + t2;
VELOCITY_LO += t2;
VELOCITY_HI = (VELOCITY_LO >= 0) ? 0 : -1;
/* Don&#39;t continue if error occured: good place for a breakpoint  */
while(VELOCITY != t1) 
  asm volatile("nop");
...
} /* while(1) */

The following is the resulting code for the working version with "alt_32 t2" on the left and the erratic code with "alt_64 t2" on the right (only the part that matches the above snippet). In every other place the resulting binaries are exactly the same.

DPRAM_BASE is 0x80180, so

- VELOCITY(_LO) is 0x80188 (0x80000+392) and

- VELOCITY_HI is 0x80188 (0x80000+396) and

- ACCELERATION is 0x80190 (0x80000+400)

  e4:  movhi  r4,8       |    e4:  movhi  r6,8
  e8:  addi  r4,r4,400   |    e8:  addi  r6,r6,400
  ec:  ldw  r9,0(r4)     |    ec:  ldw  r8,0(r6)
  f0:  ldw  r2,0(r17)    |    f0:  ldw  r4,0(r6)
  f4:  ldw  r3,4(r17)    |    f4:  ldw  r2,0(r17)
  f8:  ldw  r8,0(r17)    |    f8:  srai  r5,r8,31
  fc:  mov  r6,r9        |    fc:  ldw  r3,4(r17)
 100:  srai  r7,r9,31    |   100:  ldw  r8,0(r17)
 104:  add  r8,r8,r9     |   104:  add  r6,r2,r4
 108:  stw  r8,0(r17)    |   108:  movhi  r10,8
 10c:  ldw  r9,0(r17)    |   10c:  addi  r10,r10,392
 110:  add  r4,r2,r6     |   110:  add  r8,r8,r4
 114:  cmpltu  r8,r4,r2  |   114:  stw  r8,0(r17)
 118:  cmplt  r9,r9,zero |   118:  ldw  r9,0(r17)
 11c:  movhi  r2,8       |   11c:  cmpltu  r8,r6,r2
 120:  addi  r2,r2,396   |   120:  movhi  r2,8
 124:  sub  r9,zero,r9   |   124:  addi  r2,r2,396
 128:  movhi  r10,8      |   128:  cmplt  r9,r9,zero
 12c:  addi  r10,r10,392 |   12c:  sub  r9,zero,r9
 130:  stw  r9,0(r2)         130:  stw  r9,0(r2)
 134:  ldw  r2,0(r10)        134:  ldw  r2,0(r10)
 138:  add  r5,r3,r7     |   138:  add  r7,r3,r5
 13c:  add  r8,r8,r5     |   13c:  add  r8,r8,r7
 140:  mov  r6,r4        |   140:  mov  r3,r6
 144:  mov  r7,r8        |   144:  mov  r4,r8
 148:  beq  r2,r4,248    |   148:  beq  r2,r6,248 <alt_main+0x248>
 14c:  mov  r3,r10       |   14c:  mov  r5,r10
 150:  nop                   150:  nop
 154:  ldw  r2,0(r3)     |   154:  ldw  r2,0(r5)
 158:  bne  r2,r6,150    |   158:  bne  r2,r3,150 <alt_main+0x150>
 15c:  ldw  r2,4(r3)     |   15c:  ldw  r2,4(r5)
 160:  bne  r2,r7,150    |   160:  bne  r2,r4,150 <alt_main+0x150>

Thanks for looking at the problem!

Kolja

Altera_Forum · ‎07-27-2006

Hi,

I think I found the cause for my problem.

Looking just at the code that computes the high word of variable t1, I see only one major difference (register numbers matching the alt_64-t2-version, listed first)

The version where the error happens loads ACCELERATION twice from Dual Port memory, first to r8, then to r4. The other version loads it once only and then merely copies the register content around.

If the other CPU in my system changed ACCELERATION in DPRAM between these two accesses, it would be actually +0xDBE in R8 and -0xDBE in R4 (or vice versa).

The sign extension in R5 matches the value loaded into R8 but the overflow bit at PC=0x11C is computed from the value loaded into R4.

Do you agree that this might be causing my problems? Then at least I know the cause, can implement proper workarounds, and do not have to fear about wrong 64 bit results in situations where no other CPU accesses the operands. I didn't expect that gcc would produce code to fetch the same volatile operand twice for a single computation.

Erratic alt_64 t2 version

 ec:  ldw  r8,0(r6)  /* r6 = &ACCELERATION */
 f0:  ldw  r4,0(r6)
 f4:  ldw  r2,0(r17) /* r17 = &VELOCITY */
 f8:  srai  r5,r8,31
 fc:  ldw  r3,4(r17)
104:  add  r6,r2,r4
11c:  cmpltu  r8,r6,r2
138:  add  r7,r3,r5
13c:  add  r8,r8,r7
144:  mov  r4,r8 /* => new VELOCITY_HI, sometimes wrong */

Working alt_32 t2 version

 ec:  ldw  r9,0(r4)  /* r6 = &ACCELERATION */
 f0:  ldw  r2,0(r17) /* r17= &VELOCITY */
 f4:  ldw  r3,4(r17)
 fc:  mov  r6,r9
100:  srai  r7,r9,31
110:  add  r4,r2,r6
114:  cmpltu  r8,r4,r2
138:  add  r5,r3,r7
13c:  add  r8,r8,r5
144:  mov  r7,r8 /* => new VELOCITY_HI, always correct */

Or did I put the "volatile" at the wrong place and should've defined something like

#define VELOCITY (*(alt_64 *volatile)((void *)DPRAM_BASE+8))

instead of

#define VELOCITY (*(volatile alt_64 *)((void *)DPRAM_BASE+8))

Altera_Forum · ‎07-27-2006

Ah! Moving the "volatile" specifier behind the '*' did the trick!

Kolja