Re: how many cycles does each command take in nios?

Altera_Forum · ‎01-05-2016

in my project, i make a loop to blink a led:

int main(void)

{

int i=0;

for(i=0;i<100;i++)

{

IOWR_ALTERA_AVALON_PIO_DATA(QD_PIO_0_BASE,0xff);

delay();

IOWR_ALTERA_AVALON_PIO_DATA(QD_PIO_0_BASE,0x0);

delay();

// printf("Hello NIOS II! %d\n",i);

}

return 0;

}

void delay(void)

{

alt_u32 i =0;

while(i < 100000)

{

i++;

}

the output wave is 30ms period, and my clk frequency is 100M,

so it looks like the "i++", takes 15 clock cycles, is it right?

how can i know each command takes how many cycles in my project?

Altera_Forum · ‎01-05-2016

If you are worried about how long NIOS instructions take, be aware that NIOS is a very slow processor. The free one is staggeringly slow and inefficent. If this is a concern, use one of the SoC chips with built in ARM processor or write your algorithm in Verilog or VHDL. Almost any external micro will be faster than NIOS as well.

Altera_Forum · ‎01-05-2016

thanks for your reply.

i am not worried about it, i just want to know how long NIOS instructions take,

this may be helpful.

i read the objdump file, i looks like a assembly language, about the i++, it shows:

void delay(void)

{

alt_u32 i =0;

while(i < 100000)

80031c: e0ffff17 ldw r3,-4(fp)

800320: 008000b4 movhi r2,2

800324: 10a1a7c4 addi r2,r2,-31073

800328: 10fff92e bgeu r2,r3,800310 <__reset+0xff7f8310>

{

i++;

}

80032c: e037883a mov sp,fp

800330: df000017 ldw fp,0(sp)

800334: dec00104 addi sp,sp,4

800338: f800283a ret

does each line cost one clock?

Altera_Forum · ‎01-05-2016

https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/nios2/n2cpu_nii5v1.pdf

See "Instruction Performance" on page 5-11, 5-19, or 5-21 depending on what core you're using.

Your question was asking about instruction performance, but if you really just care about higher-level C function/loop execution times, AN391 is a good read: https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/an/an391.pdf Especially the Performance Counter IP block is very useful.

Many things can be done in a single cycle. But getting the compiler to emit the best code, and constructing optimized hardware, can all become a small research project by themselves.

For example, if you just rewrote your delay() in a form that GCC likes just a little bit better, it looks like it would average (3) cycles per loop iteration on an "F" core.


void delay(void)
{
  register int i =0;
  const register int limit = 100000;
  for(i=0; i < limit; i++) {
  }
}

And the assembly (gcc -S foo.c): (.L3 is the loop iterator increment, followed by the .L2 "blt" compare against the 100000)


delay:
        addi    sp, sp, -12
        stw     fp, 8(sp)
        stw     r17, 4(sp)
        stw     r16, 0(sp)
        addi    fp, sp, 8
        mov     r17, zero
        movhi   r16, 2
        addi    r16, r16, -31072
        mov     r17, zero
        br      .L2
.L3:
        addi    r17, r17, 1
.L2:
        blt     r17, r16, .L3
        addi    sp, fp, -8
        ldw     fp, 8(sp)
        ldw     r17, 4(sp)
        ldw     r16, 0(sp)
        addi    sp, sp, 12
        ret