Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++

HW Multiplier

Altera_Forum
Honored Contributor II
1,522 Views

Who was the xyz! who took out the hardware multiplier in NIOS2 ???? 

 That person did not think. I know the plan it to use the HW Mult units in CycloneII BUT consider the following....  NIOS is fluid ! With help of its flexibility I have fixed faults and upgraded my clients systems with ease. So I am a small timer. In the last 3 months I have only sold 92 boards. 50 of those are in vehicles 500 -900km away from me. In the last few weeks we noticed a few small timing problems that will be eradicated with NIOS II. BUT I need the HW Mult. command otherwize the speed gain is wasted. I cannot quickly rip out 50 units and replace them with fresh Cyclone IIs. Plus the design is now stable(according to my client) and approved. I am awaiting a long term order for 2500 units. For many reasons I will stick with Cyclone I. So no HW multiplier in NIOS II means that I cannot use it.  I have systems out there running on Apex20K100 chips with NIOS I. One such system, with about 2 years of continuous upgrading, has run out of speed. This board clocks at about 20MHz. NIOS II will be like Viagra to this system, but no HW multiplier !  I know I am not the only one. So here is what I propose.... If this is already in action I appologize.   Somewhere in the header file there is a directive for a HW multiplier. It is currently set to 0. If I set it to 1 I believe my compiler will compile code with 'mul', 'imul',and 'mulxss/mulxuu'. I know the difference is that the new multiply command is 32x32 bit compared to NIOS I's 16x16. But I will rather live with a delay of 8 to 16 cycles, for a multicycle multiplier, than a software calculation. I remember that the old HW mul command used up about 400 LEs. That I can still live with that on all my designs.  so, is it possible that altera can publish the multiplier header/ instantiation so that those with time and brains can create a hdl solution for this problem?   Victor

0 Kudos
10 Replies
Altera_Forum
Honored Contributor II
575 Views

> Who was the idiot who took out the hardware multiplier in NIOS2 [for Cyclone] ???? 

 

That would be me! (Actually there were three of us who decided that, but hey, I'm here.) And you are absolutely right--about the need for a multiplier in Nios II for Cyclone. 

 

The good news is that it's in the next release of the dev. kit (enabled by some Quartus II 4.1 synthesis enhancements). And further good news is that the release is "near the end of the year," as Alan says in another thread. 

 

But since you need something now, elan describes his solution in the thread "hw multiplier for Cyclone". 

 

And more related info is in the thread "NIOS II DMIPS". 

 

An added bonus is that the 32-bit multiplier in the Nios II/s and Nios II/f cores in Cyclone should be faster than 32-bit multiplication in the original Nios core, because in Nios II it's a full hardware solution, while in the original Nios it's a software routine that uses the 16x16 multiply instruction three times. 

 

Kerry 

Altera
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

Kerry 

 

http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/wink.gif Sorry for insulting you. 

 

 

Will have a look at those threads. 

 

 

 

Thanks 

 

Victor
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

HI all people who wants a fast multiply now, 

 

Even without adding a hardware multiplier, it is possible to gain some performance on the multiply. I disassembled the multiply routine (__mulsi3): 

 

__mulsi3:        addi    sp,sp,-8                   <- useless, can be removed        stw     fp,4(sp)                   <- useless, can be removed        mov     r3,zero        mov     fp,sp                       <- useless, can be removed        beq     r4,zero,mul_30 mul_14:        andi    r2,r4,1        cmpeq   r2,r2,zero        srli    r4,r4,1        bne     r2,zero,mul_28        add     r3,r3,r5 mul_28:        slli    r5,r5,1        bne     r4,zero,mul_14 mul_30:        mov     r2,r3        ldw     fp,4(sp)                <- useless, can be removed        addi    sp,sp,8                <- useless, can be removed        ret 

 

This routine can be optimized a lot. The most important optimization that can be done is removing the stack frame stuff. This removes 5 instructions without any problem. This together with two minor optimizations results in the following function: 

 

__mulsi3:        mov     r2,zero        beq     r4,zero,mul_30 mul_14:        andi    r3,r4,1        srli      r4,r4,1        beq     r3,zero,mul_28        add     r2,r2,r5 mul_28:        slli      r5,r5,1        bne     r4,zero,mul_14 mul_30:        ret 

 

Just put this function in an assember source file and it should overrule the default implementation from the library. 

For us this resulted in a performance gain of about 10%. 

 

When this is still not good enough, make a custom instruction and implement the the function like this: 

 

__mulsi3:         custom  0,r2,r4,r5         ret 

 

I used an opencore multiplier and changed it a bit to do a 32x32 multiply in 4 clock cycles (see below). Currently the upper 32-bits of the result are thrown away, so if you also want fast multiplies with &#39;long long&#39; results, some work still has to be done to make it possible to get the upper 32-bits and correctly implement the __muldi3 function. 

The multiplier takes 708 LEs in my Cyclone EP1C3 design. I think this is not the best implemenattion of a multiplier, but it works fast and it did only take a few hours to make. 

 

Have fun with it, 

 

Tim Brugman. 

 

------------------------------------------------------------------------------- --  File: mult_unit.vhd                                                      -- --                                                                           -- --  Copyright © Deversys, 2003                                             -- --                                                                           -- -- function : entity and architecture for multiplication algorithms testing  -- --                                                                           -- --  Author: Vladimir V. Erokhin, PhD,                                        -- --         e-mails: vladvas@deversys.com; vladvas@verilog.ru;                -- --                                                                           -- ---------------  Revision History      ---------------------------------------- --                                                                           -- --     Date  Engineer               Description                        -- --                                                                           -- --      041108   T. Brugman               Customized to make a fast multiply -- --                                        for Altera NiosII                  -- ------------------------------------------------------------------------------- library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity MULT_UNIT is     port (     clk: in STD_LOGIC;     clk_en: in STD_LOGIC;     reset: in STD_LOGIC;     start: in STD_LOGIC;  dataa: in STD_LOGIC_VECTOR (31  downto 0);  datab: in STD_LOGIC_VECTOR (31 downto 0);  done: out std_logic;  result: out STD_LOGIC_VECTOR (31 downto 0)  ); end MULT_UNIT; architecture RTL of MULT_UNIT is function MULT_HIER(MULTIPLICAND: STD_LOGIC_VECTOR; MULTIPLIER: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR is variable MUL_RESULT_1, MUL_RESULT_2, MUL_RESULT_3, MUL_RESULT_4: STD_LOGIC_VECTOR(MULTIPLICAND&#39;length - 1 downto 0); variable RESULT: STD_LOGIC_VECTOR(MULTIPLICAND&#39;length * 2 - 1 downto 0); variable HIGH_HALF_OF_MCD, LOW_HALF_OF_MCD, HIGH_HALF_OF_MER, LOW_HALF_OF_MER: STD_LOGIC_VECTOR((MULTIPLICAND&#39;length +1)/2 - 1 downto 0); variable MUL_RESULT_1_4: STD_LOGIC_VECTOR((MULTIPLICAND&#39;length+1)/2*3 - 1 downto 0); variable TEMP1, TEMP2: STD_LOGIC_VECTOR(1 downto 0); variable TEMP3: STD_LOGIC_VECTOR(2 downto 0); begin  if MULTIPLICAND&#39;length = 2 then      TEMP1 := (MULTIPLIER(1) and MULTIPLICAND(0)) & (MULTIPLIER(0) and MULTIPLICAND(0));      TEMP2 := (MULTIPLIER(1) and MULTIPLICAND(1)) & (MULTIPLIER(0) and MULTIPLICAND(1));      if TEMP1(1) = &#39;1&#39; then         TEMP3 := ((MULTIPLICAND(1) and MULTIPLIER(1) and MULTIPLIER(0)) & (TEMP2 + 1));      else         TEMP3 := (&#39;0&#39; & TEMP2);      end if;      return TEMP3 & TEMP1(0);   else      HIGH_HALF_OF_MCD := MULTIPLICAND(MULTIPLICAND&#39;high downto (MULTIPLICAND&#39;high + 1)/2);      LOW_HALF_OF_MCD := MULTIPLICAND((MULTIPLICAND&#39;high+1)/2 - 1 downto 0);      HIGH_HALF_OF_MER := MULTIPLIER(MULTIPLIER&#39;high downto (MULTIPLIER&#39;high + 1)/2);      LOW_HALF_OF_MER := MULTIPLIER((MULTIPLIER&#39;high+1)/2 - 1 downto 0);            MUL_RESULT_1 := MULT_HIER(LOW_HALF_OF_MCD, LOW_HALF_OF_MER);      MUL_RESULT_2 := MULT_HIER(LOW_HALF_OF_MCD, HIGH_HALF_OF_MER);      MUL_RESULT_3 := MULT_HIER(HIGH_HALF_OF_MCD, LOW_HALF_OF_MER);      MUL_RESULT_4 := MULT_HIER(HIGH_HALF_OF_MCD, HIGH_HALF_OF_MER);      MUL_RESULT_1_4 := MUL_RESULT_4 & MUL_RESULT_1(MULTIPLICAND&#39;high downto (MULTIPLICAND&#39;high+1)/2);      RESULT := (MUL_RESULT_1_4 + ((&#39;0&#39; & MUL_RESULT_2) + MUL_RESULT_3)) & MUL_RESULT_1((MULTIPLICAND&#39;high+1)/2-1 downto 0);      return(RESULT);   end if;     end MULT_HIER; TYPE TState IS (IDLE, MUL1, MUL2, MUL3); SIGNAL state : TState; SIGNAL result_1, result_2, result_3, mul16_result : STD_LOGIC_VECTOR(31 downto 0); SIGNAL mul_result: STD_LOGIC_VECTOR(63 downto 0); SIGNAL reg_a, reg_b: STD_LOGIC_VECTOR(15 downto 0); BEGIN process(reset,clk,mul16_result,state,dataa,datab,mul_result)  VARIABLE result_1_4: STD_LOGIC_VECTOR(47 downto 0);  VARIABLE arg_a, arg_b : STD_LOGIC_VECTOR(15 downto 0); begin  case state is    when IDLE =>      arg_a := dataa(15 downto 0);      arg_b := datab(15 downto 0);    when MUL1 =>      arg_a := dataa(15 downto 0);      arg_b := datab(31 downto 16);    when MUL2 =>      arg_a := dataa(31 downto 16);      arg_b := datab(15 downto 0);    when MUL3 =>      arg_a := dataa(31 downto 16);      arg_b := datab(31 downto 16);  end case;  mul16_result <= MULT_HIER(arg_a, arg_b);  if reset=&#39;1&#39; then    state <= IDLE;    done <= &#39;0&#39;;  elsif clk=&#39;1&#39; and clk&#39;event then    if clk_en = &#39;1&#39; then      case state is        when IDLE =>          done <= &#39;0&#39;;          result_1 <= mul16_result;          if start = &#39;1&#39; then            state <= MUL1;          end if;        when MUL1 =>          result_2 <= mul16_result;          state <= MUL2;        when MUL2 =>          result_3 <= mul16_result;          state <= MUL3;        when MUL3 =>          result_1_4 := mul16_result & result_1(31 downto 16);          mul_result <= (result_1_4 + ((&#39;0&#39; & result_2) + result_3)) & result_1(15 downto 0);          done <= &#39;1&#39;;          state <= IDLE;      end case;    end if;  end if;  result <= mul_result(31 downto 0); end process; END RTL;
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

Tim  

 

Excellent !! 

 

http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/biggrin.gif  

 

 

Only one thing to remember... 

 

I have entered the &#39;overriding&#39; code in one of my SDK (I still using legacy code) libraries and compiled via &#39;make all&#39; in the lib directory. 

The thing to remember is to make the function global otherwise the old __mulsi3 will stick like mud on a bumper! 

 

.global __mulsi3 .type __mulsi3, @function __mulsi3:              custom 0,r2,r4,r5 ret 

 

 

I haven&#39;t done major code benchmarks but I did see serious improvements with the custom instruction. 

 

Be careful of the compiler! It is smart. Don&#39;t just multiply something and then check the speed. Make sure that your code can&#39;t be optimized out into add instructions. Look at your .objdump to see if the code really uses <__mulsi3>. 

 

 

"Now go forth and multiply..." 

 

 

Victor
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

Tim: 

 

What fmax on which speed grade can be achieved with this synchronous multiplier? 

We are getting around 70MHz on a Cyclone C12-8 with our asynchronous multicycle command. 

 

Chris
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

Chris, 

 

I got 58MHz on a Cyclone C3-8 instead of 70MHz without the multiplier. As I allready said, it is not the best implementation, just a quick hack based on an opencore. If you have a better solution, please share it with the community... 

 

Greetz, 

Tim Brugman.
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

 

--- Quote Start ---  

originally posted by timbr@Nov 8 2004, 09:27 AM 

hi all people who wants a fast multiply now, 

 

even without adding a hardware multiplier, it is possible to gain some performance on the multiply. i disassembled the multiply routine (__mulsi3): 

 

__mulsi3: â  â  â  â addi â  â sp,sp,-8 â  â  â  â  â  â  â  â  â  <- useless, can be removed â  â  â  â stw â  â  fp,4(sp) â  â  â  â  â  â  â  â  â  <- useless, can be removed â  â  â  â mov â  â  r3,zero â  â  â  â mov â  â  fp,sp â  â  â  â  â  â  â  â  â  â  â  <- useless, can be removed â  â  â  â beq â  â  r4,zero,mul_30 mul_14: â  â  â  â andi â  â r2,r4,1 â  â  â  â cmpeq â  r2,r2,zero â  â  â  â srli â  â r4,r4,1 â  â  â  â bne â  â  r2,zero,mul_28 â  â  â  â add â  â  r3,r3,r5 mul_28: â  â  â  â slli â  â r5,r5,1 â  â  â  â bne â  â  r4,zero,mul_14 mul_30: â  â  â  â mov â  â  r2,r3 â  â  â  â ldw â  â  fp,4(sp) â  â  â  â  â  â  â  â <- useless, can be removed â  â  â  â addi â  â sp,sp,8 â  â  â  â  â  â  â  â <- useless, can be removed â  â  â  â ret 

 

this routine can be optimized a lot. the most important optimization that can be done is removing the stack frame stuff. this removes 5 instructions without any problem. this together with two minor optimizations results in the following function: 

 

__mulsi3: â  â  â  â mov â  â  r2,zero â  â  â  â beq â  â  r4,zero,mul_30 mul_14: â  â  â  â andi â  â r3,r4,1 â  â  â  â srli â  â  â r4,r4,1 â  â  â  â beq â  â  r3,zero,mul_28 â  â  â  â add â  â  r2,r2,r5 mul_28: â  â  â  â slli â  â  â r5,r5,1 â  â  â  â bne â  â  r4,zero,mul_14 mul_30: â  â  â  â ret 

--- Quote End ---  

 

 

I think you get the same result if you should be able to compile the libraries with the -f-omit-frame-pointer compiler option.  

 

Has anyone done this already, is it hard to do for someone that has practically no experience in porting or making the gnu toolchain? 

 

Stefaan
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

Another try : 

 

When implemeting the simple sequence as above as a custom instruction, it saves already a lot of time, for only 280 LEs in a cyclone. 

 

I have measured the time for this 

 

int r; volatile int* pr = &r; //prevent optimalisations for (int a = -1000; a < 1000; a++)   for (int b = 1000; b >-1000; b--)  //count backwards : prevent optimalisations      *pr = a*b; 

 

Without cutom instruction : this takes about 22.4 seconds at 50Mc with the standard NiosII 

With the custominstruction in the mulsi function : 4.3 seconds (without frame pointer stuff) 

with the custominstruction inlined (without the additinal function call and ret instruction) : 2.4 seconds. 

 

This is a very big advantage for only 280 LE&#39;s I think. (22.4 seconds down to 4.3 seconds for 4.000.000 muls and overhead for the loops) 

 

But I think the compiler still thinks that the cost for a multiply is very high, so it wants to optimise to shifts and adds where possible. This can reduce the benefit of this code a lot. 

 

If anyone is interested, I&#39;ll post the verilog code for the custom instruction.
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

Stefaan  

 

Please feel free to post it 

 

 

 

Victor
0 Kudos
Altera_Forum
Honored Contributor II
575 Views

See post your own ip section of the forum. 

 

Stefaan
0 Kudos
Reply