Multiplying by four a register

Altera_Forum · ‎08-24-2004

Hi all,

I have just started using NIOS II, and I have a small question:

which is the best way for multiplyiing a register by four?

until now I found the following alternatives (suppose r16 contains the value to be multiplied)

1) generated by gcc accessing an array of 32 bit integers... does the compiler uses this instruction also on the small and on the economic version of niosII?

muli r16, r16, 4

2)

addi r16, r16, r16

3)

sll r16, r16, 2

Best Regards,

Paolo

Altera_Forum · ‎08-24-2004

Ehm... of course it was:

2)

add r16,r16,r16

Paolo

Altera_Forum · ‎08-24-2004

On the Economic version you don't get multiplier hardware.

On the other two versions you have the multiplier that also performs shift operations (If you roll left by 2 (ie multiply by 4) then I wouldn't doubt that you end up just multiplying anyway).

You're solutions 1 and 3 probably take the exact same amount of time, whereas number 2 would be longer I would assume since I don't see them being able to add all three in parallel (maybe they can).

Either way you're talking probably 1 clock cycle to 2 or 3 cycles anyway. Hopefully you don't need better performance then that.

Altera_Forum · ‎08-25-2004

Ok, I think I wll use option 3, that also on the economic version of the NIOS II is able to shift in 2 cycles...

Thanks!!!

Paolo

Altera_Forum · ‎08-25-2004

Be careful with shifting signed numbers. (don't want to modify the sign bit)

Altera_Forum · ‎08-26-2004

The best option depends on how soon the result of the multiply is used by other instructions,

which FPGA family you are using, and which Nios II you are using.

In general, option 2 is the best since it has a throughput of 0.5 cycles and a latency of 2 cycles

in all combinations.

BTW, a throughput of 0.5 cycles means you get a multiply result every 1/0.5 = 2 cycles and

a latency of 2 cycles means the result isn't ready for 2 cycles.

Let me explain more. On Stratix I and Stratix II devices, the Nios II/s and Nios II/f

use the hardware multipliers to perform multiplies. The throughput is one multiply per cycle but

with a 3 cycle latency. If you try to use the result of multiply in one or two cycles, the dependent

instruction is stalled which results in a throughput of 0.33 cycles and a latency of 3 cycles.

For example, this code:

muli r16, r16, 4

xor r4, r5, r16

will take 4 cycles to execute because the xor is stalled for 2 cycles since it uses the muli result.

However, this code:

muli r16, r16, 4

muli r17, r17, 4

muli r18, r18, 4

xor r4, r5, r16

will also take 4 cycles to execute because the non-dependent muli to r17 and r18 (or any other non-dependent

instructions) don't stall and the xor that uses r16 is far enough away from the muli to r16 to not stall.

So, this code achieves multiplies with a throughput of 1 cycle and the latency of 3 cycles is hidden by

the non-dependent instructions.

Option 3 (using a shift) has the same performance as the multiply on Nios II/f and Nios II/s on Stratix I and Stratix II

because we actually use the hardware multiplier to perform shifts and rotates.

Altera_Forum · ‎08-26-2004

Hi,

really thanks a lot to all for the good informations you gave me...

To answer the question "be aware of signed integers"...currently I'm using these instruction to address in assembler some small vectors of integers, so I expect the indexes to be small positive numbers :-)

Thanks again,

Paolo

Altera_Forum · ‎08-27-2004

Another question related to this topic...

Since stalls in the pipeline influence the performance of the code I'm writing, is it possible to know, given a NIOS II hardware, if and where a set of assembler instructions are stalling due to register precedence relations?

Thanks again for all,

Paolo

Altera_Forum · ‎08-27-2004

If you can run on modelsim, use the w command in modelsim to display waves.

Then you can see your the exact timing of your instructions.

Altera_Forum · ‎08-30-2004

Somewhere in the big NIOS II doc, they give you the timing for the assembly instructions, but like James said their can be exceptions for many cases.

Sounds like you need/want every clock cycle you can get so modelsim should be a lot of help to you (never used it but it looked like it could give you a lot of info).