Nios II/e timing anomalies

Altera_Forum · ‎05-13-2016

I'm experiencing some rather strange timing anomalies when running my Nios II/e based system and I was hoping that someone could help me shed some light on the problem. In the below code example, I'm creating an array of structs of type foo_t, containing two member variables a and b of type int. Note that the third member variable c is commented out for now. I then iterate over the array, accessing each struct by setting the member variable a to zero. All three loop iterations are timed using the Altera performance counter and reported back at the end of the program. I compile the program with no code optimisation using Nios II 14.1 Software Build Tools for Eclipse/GCC. The modules used for the HW platform on which I'm running the program can be seen here (http://imgur.com/1xczgti). I have two interval timers present, but as you can see from the code below, they are never initialized.


#include <stdio.h>
#include "system.h"
#include "altera_avalon_performance_counter.h"
#define ITERATIONS 3
typedef struct foo
{
    int a;
    int b;
    //int c;
} foo_t;
foo_t foo_arr;
int main()
{int i;
PERF_RESET(PERFORMANCE_COUNTER_0_BASE);
PERF_START_MEASURING(PERFORMANCE_COUNTER_0_BASE);
for(i = 0; i < ITERATIONS; i++)
{
PERF_BEGIN(PERFORMANCE_COUNTER_0_BASE, 1+i);
foo_arr.a = 0;
PERF_END(PERFORMANCE_COUNTER_0_BASE, 1+i);
}
PERF_STOP_MEASURING(PERFORMANCE_COUNTER_0_BASE);
perf_print_formatted_report((void *)PERFORMANCE_COUNTER_0_BASE, alt_get_cpu_freq(), 3, "Iteration 0", "Iteration 1", "Iteration 2");
return 0;
}

Ok, so we know that Nios II/e has no cache memories nor branch prediction and we would expect all three iterations of the loop to require the same number of cycles. This is confirmed when we look at the timing report. Let's refer to this as Case A.

Iteration 0: 124 clock cycles

Iteration 1: 124 clock cycles

Iteration 2: 124 clock cycles

Now comes the part that I'm struggling to understand: If we now add the third member variable c to the foo_t struct, but leave the rest of the code as it is, the loop iterations no longer executes in the same number of clock cycles. Let's refer to this as Case B.

Iteration 0: 155 clock cycles

Iteration 1: 199 clock cycles

Iteration 2: 236 clock cycles

Here is the disassembly of the row foo_arr.a = 0 in the two cases:

case a:

000402c8: movhi r3,5

000402cc: addi r3,r3,12872

000402d0: ldw r2,-4(fp)

000402d4: slli r2,r2,3

000402d8: add r2,r3,r2

000402dc: stw zero,0(r2)

case b:

000402cc: movhi r16,5

000402d0: addi r16,r16,12888

000402d4: ldw r2,-8(fp)

000402d8: mov r4,r2

000402dc: movi r5,12

000402e0: call 0x4038c <__mulsi3>

000402e4: add r2,r16,r2

000402e8: stw zero,0(r2)

In Case B, a call to the multiplication function __mulsi3 is is being made. Ok, we know that Nios II/e does not have hardware support for multiplication, fair enough. __mulsi3 is implemented in lib2-mul.c:


SItype
__mulsi3 (SItype a, SItype b)
{
  SItype res = 0;
  USItype cnt = a;
  while (cnt)
    {
      if (cnt & 1)
    res += b;      
      b <<= 1;
      cnt >>= 1;
    }
  return res;
}

So, in Case B we perform a multiplication. I would understand that this would add to the total execution time of each iteration compared to Case A, but I still expect each iteration to require the same number of clock cycles. At least Case B is deterministic in the sense that it keeps reporting these same numbers for each run.Could anyone please try to give me an explanation on what is happening here? If you require more information, just let me know!

/J

Altera_Forum · ‎05-13-2016

__mulsi3 is not deterministic so if the input 'a' into that library is large it takes more time for that while loop to complete than small values of 'a'. I suspect the multiply is needed because the struct offsets are now 12 bytes instead of 8 bytes when 'c' wasn't present. As a result when calculating the index into the array it has to multiply by i*12 to get the new index instead of just increasing the index by i*8 which is just i<<3. So as 'i' increases the input into that multiplier library increases which causes the while loop to iterate more.

I'm not sure why the compiler isn't just using an adder, if you are using -O0 optimization try -O2, maybe that will remove the multiplier library and replace it with an adder instead.

Altera_Forum · ‎05-24-2016

--- Quote Start ---

__mulsi3 is not deterministic so if the input 'a' into that library is large it takes more time for that while loop to complete than small values of 'a'. I suspect the multiply is needed because the struct offsets are now 12 bytes instead of 8 bytes when 'c' wasn't present. As a result when calculating the index into the array it has to multiply by i*12 to get the new index instead of just increasing the index by i*8 which is just i<<3. So as 'i' increases the input into that multiplier library increases which causes the while loop to iterate more.

I'm not sure why the compiler isn't just using an adder, if you are using -O0 optimization try -O2, maybe that will remove the multiplier library and replace it with an adder instead.

--- Quote End ---

This makes perfect sense. Thank you so much for clarifying! :)