Who was the xyz! who took out the hardware multiplier in NIOS2 ????
That person did not think. I know the plan it to use the HW Mult units in CycloneII BUT consider the following.... NIOS is fluid ! With help of its flexibility I have fixed faults and upgraded my clients systems with ease. So I am a small timer. In the last 3 months I have only sold 92 boards. 50 of those are in vehicles 500 -900km away from me. In the last few weeks we noticed a few small timing problems that will be eradicated with NIOS II. BUT I need the HW Mult. command otherwize the speed gain is wasted. I cannot quickly rip out 50 units and replace them with fresh Cyclone IIs. Plus the design is now stable(according to my client) and approved. I am awaiting a long term order for 2500 units. For many reasons I will stick with Cyclone I. So no HW multiplier in NIOS II means that I cannot use it. I have systems out there running on Apex20K100 chips with NIOS I. One such system, with about 2 years of continuous upgrading, has run out of speed. This board clocks at about 20MHz. NIOS II will be like Viagra to this system, but no HW multiplier ! I know I am not the only one. So here is what I propose.... If this is already in action I appologize. Somewhere in the header file there is a directive for a HW multiplier. It is currently set to 0. If I set it to 1 I believe my compiler will compile code with 'mul', 'imul',and 'mulxss/mulxuu'. I know the difference is that the new multiply command is 32x32 bit compared to NIOS I's 16x16. But I will rather live with a delay of 8 to 16 cycles, for a multicycle multiplier, than a software calculation. I remember that the old HW mul command used up about 400 LEs. That I can still live with that on all my designs. so, is it possible that altera can publish the multiplier header/ instantiation so that those with time and brains can create a hdl solution for this problem? Victor
> Who was the idiot who took out the hardware multiplier in NIOS2 [for Cyclone] ????That would be me! (Actually there were three of us who decided that, but hey, I'm here.) And you are absolutely right--about the need for a multiplier in Nios II for Cyclone. The good news is that it's in the next release of the dev. kit (enabled by some Quartus II 4.1 synthesis enhancements). And further good news is that the release is "near the end of the year," as Alan says in another thread. But since you need something now, elan describes his solution in the thread "hw multiplier for Cyclone". And more related info is in the thread "NIOS II DMIPS". An added bonus is that the 32-bit multiplier in the Nios II/s and Nios II/f cores in Cyclone should be faster than 32-bit multiplication in the original Nios core, because in Nios II it's a full hardware solution, while in the original Nios it's a software routine that uses the 16x16 multiply instruction three times. Kerry Altera
HI all people who wants a fast multiply now,Even without adding a hardware multiplier, it is possible to gain some performance on the multiply. I disassembled the multiply routine (__mulsi3):
__mulsi3: addi sp,sp,-8 <- useless, can be removed stw fp,4(sp) <- useless, can be removed mov r3,zero mov fp,sp <- useless, can be removed beq r4,zero,mul_30 mul_14: andi r2,r4,1 cmpeq r2,r2,zero srli r4,r4,1 bne r2,zero,mul_28 add r3,r3,r5 mul_28: slli r5,r5,1 bne r4,zero,mul_14 mul_30: mov r2,r3 ldw fp,4(sp) <- useless, can be removed addi sp,sp,8 <- useless, can be removed retThis routine can be optimized a lot. The most important optimization that can be done is removing the stack frame stuff. This removes 5 instructions without any problem. This together with two minor optimizations results in the following function:
__mulsi3: mov r2,zero beq r4,zero,mul_30 mul_14: andi r3,r4,1 srli r4,r4,1 beq r3,zero,mul_28 add r2,r2,r5 mul_28: slli r5,r5,1 bne r4,zero,mul_14 mul_30: retJust put this function in an assember source file and it should overrule the default implementation from the library. For us this resulted in a performance gain of about 10%. When this is still not good enough, make a custom instruction and implement the the function like this:
__mulsi3: custom 0,r2,r4,r5 retI used an opencore multiplier and changed it a bit to do a 32x32 multiply in 4 clock cycles (see below). Currently the upper 32-bits of the result are thrown away, so if you also want fast multiplies with 'long long' results, some work still has to be done to make it possible to get the upper 32-bits and correctly implement the __muldi3 function. The multiplier takes 708 LEs in my Cyclone EP1C3 design. I think this is not the best implemenattion of a multiplier, but it works fast and it did only take a few hours to make. Have fun with it, Tim Brugman.
------------------------------------------------------------------------------- -- File: mult_unit.vhd -- -- -- -- Copyright © Deversys, 2003 -- -- -- -- function : entity and architecture for multiplication algorithms testing -- -- -- -- Author: Vladimir V. Erokhin, PhD, -- -- e-mails: firstname.lastname@example.org; email@example.com; -- -- -- --------------- Revision History ---------------------------------------- -- -- -- Date Engineer Description -- -- -- -- 041108 T. Brugman Customized to make a fast multiply -- -- for Altera NiosII -- ------------------------------------------------------------------------------- library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity MULT_UNIT is port ( clk: in STD_LOGIC; clk_en: in STD_LOGIC; reset: in STD_LOGIC; start: in STD_LOGIC; dataa: in STD_LOGIC_VECTOR (31 downto 0); datab: in STD_LOGIC_VECTOR (31 downto 0); done: out std_logic; result: out STD_LOGIC_VECTOR (31 downto 0) ); end MULT_UNIT; architecture RTL of MULT_UNIT is function MULT_HIER(MULTIPLICAND: STD_LOGIC_VECTOR; MULTIPLIER: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR is variable MUL_RESULT_1, MUL_RESULT_2, MUL_RESULT_3, MUL_RESULT_4: STD_LOGIC_VECTOR(MULTIPLICAND'length - 1 downto 0); variable RESULT: STD_LOGIC_VECTOR(MULTIPLICAND'length * 2 - 1 downto 0); variable HIGH_HALF_OF_MCD, LOW_HALF_OF_MCD, HIGH_HALF_OF_MER, LOW_HALF_OF_MER: STD_LOGIC_VECTOR((MULTIPLICAND'length +1)/2 - 1 downto 0); variable MUL_RESULT_1_4: STD_LOGIC_VECTOR((MULTIPLICAND'length+1)/2*3 - 1 downto 0); variable TEMP1, TEMP2: STD_LOGIC_VECTOR(1 downto 0); variable TEMP3: STD_LOGIC_VECTOR(2 downto 0); begin if MULTIPLICAND'length = 2 then TEMP1 := (MULTIPLIER(1) and MULTIPLICAND(0)) & (MULTIPLIER(0) and MULTIPLICAND(0)); TEMP2 := (MULTIPLIER(1) and MULTIPLICAND(1)) & (MULTIPLIER(0) and MULTIPLICAND(1)); if TEMP1(1) = '1' then TEMP3 := ((MULTIPLICAND(1) and MULTIPLIER(1) and MULTIPLIER(0)) & (TEMP2 + 1)); else TEMP3 := ('0' & TEMP2); end if; return TEMP3 & TEMP1(0); else HIGH_HALF_OF_MCD := MULTIPLICAND(MULTIPLICAND'high downto (MULTIPLICAND'high + 1)/2); LOW_HALF_OF_MCD := MULTIPLICAND((MULTIPLICAND'high+1)/2 - 1 downto 0); HIGH_HALF_OF_MER := MULTIPLIER(MULTIPLIER'high downto (MULTIPLIER'high + 1)/2); LOW_HALF_OF_MER := MULTIPLIER((MULTIPLIER'high+1)/2 - 1 downto 0); MUL_RESULT_1 := MULT_HIER(LOW_HALF_OF_MCD, LOW_HALF_OF_MER); MUL_RESULT_2 := MULT_HIER(LOW_HALF_OF_MCD, HIGH_HALF_OF_MER); MUL_RESULT_3 := MULT_HIER(HIGH_HALF_OF_MCD, LOW_HALF_OF_MER); MUL_RESULT_4 := MULT_HIER(HIGH_HALF_OF_MCD, HIGH_HALF_OF_MER); MUL_RESULT_1_4 := MUL_RESULT_4 & MUL_RESULT_1(MULTIPLICAND'high downto (MULTIPLICAND'high+1)/2); RESULT := (MUL_RESULT_1_4 + (('0' & MUL_RESULT_2) + MUL_RESULT_3)) & MUL_RESULT_1((MULTIPLICAND'high+1)/2-1 downto 0); return(RESULT); end if; end MULT_HIER; TYPE TState IS (IDLE, MUL1, MUL2, MUL3); SIGNAL state : TState; SIGNAL result_1, result_2, result_3, mul16_result : STD_LOGIC_VECTOR(31 downto 0); SIGNAL mul_result: STD_LOGIC_VECTOR(63 downto 0); SIGNAL reg_a, reg_b: STD_LOGIC_VECTOR(15 downto 0); BEGIN process(reset,clk,mul16_result,state,dataa,datab,mul_result) VARIABLE result_1_4: STD_LOGIC_VECTOR(47 downto 0); VARIABLE arg_a, arg_b : STD_LOGIC_VECTOR(15 downto 0); begin case state is when IDLE => arg_a := dataa(15 downto 0); arg_b := datab(15 downto 0); when MUL1 => arg_a := dataa(15 downto 0); arg_b := datab(31 downto 16); when MUL2 => arg_a := dataa(31 downto 16); arg_b := datab(15 downto 0); when MUL3 => arg_a := dataa(31 downto 16); arg_b := datab(31 downto 16); end case; mul16_result <= MULT_HIER(arg_a, arg_b); if reset='1' then state <= IDLE; done <= '0'; elsif clk='1' and clk'event then if clk_en = '1' then case state is when IDLE => done <= '0'; result_1 <= mul16_result; if start = '1' then state <= MUL1; end if; when MUL1 => result_2 <= mul16_result; state <= MUL2; when MUL2 => result_3 <= mul16_result; state <= MUL3; when MUL3 => result_1_4 := mul16_result & result_1(31 downto 16); mul_result <= (result_1_4 + (('0' & result_2) + result_3)) & result_1(15 downto 0); done <= '1'; state <= IDLE; end case; end if; end if; result <= mul_result(31 downto 0); end process; END RTL;
TimExcellent !! http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/biggrin.gif Only one thing to remember... I have entered the 'overriding' code in one of my SDK (I still using legacy code) libraries and compiled via 'make all' in the lib directory. The thing to remember is to make the function global otherwise the old __mulsi3 will stick like mud on a bumper!
.global __mulsi3 .type __mulsi3, @function __mulsi3: custom 0,r2,r4,r5 retI haven't done major code benchmarks but I did see serious improvements with the custom instruction. Be careful of the compiler! It is smart. Don't just multiply something and then check the speed. Make sure that your code can't be optimized out into add instructions. Look at your .objdump to see if the code really uses <__mulsi3>. "Now go forth and multiply..." Victor
Tim:What fmax on which speed grade can be achieved with this synchronous multiplier? We are getting around 70MHz on a Cyclone C12-8 with our asynchronous multicycle command. Chris
Chris,I got 58MHz on a Cyclone C3-8 instead of 70MHz without the multiplier. As I allready said, it is not the best implementation, just a quick hack based on an opencore. If you have a better solution, please share it with the community... Greetz, Tim Brugman.
--- Quote Start --- originally posted by timbr@Nov 8 2004, 09:27 AM hi all people who wants a fast multiply now,
even without adding a hardware multiplier, it is possible to gain some performance on the multiply. i disassembled the multiply routine (__mulsi3):
__mulsi3: â â â â addi â â sp,sp,-8 â â â â â â â â â <- useless, can be removed â â â â stw â â fp,4(sp) â â â â â â â â â <- useless, can be removed â â â â mov â â r3,zero â â â â mov â â fp,sp â â â â â â â â â â â <- useless, can be removed â â â â beq â â r4,zero,mul_30 mul_14: â â â â andi â â r2,r4,1 â â â â cmpeq â r2,r2,zero â â â â srli â â r4,r4,1 â â â â bne â â r2,zero,mul_28 â â â â add â â r3,r3,r5 mul_28: â â â â slli â â r5,r5,1 â â â â bne â â r4,zero,mul_14 mul_30: â â â â mov â â r2,r3 â â â â ldw â â fp,4(sp) â â â â â â â â <- useless, can be removed â â â â addi â â sp,sp,8 â â â â â â â â <- useless, can be removed â â â â ret
this routine can be optimized a lot. the most important optimization that can be done is removing the stack frame stuff. this removes 5 instructions without any problem. this together with two minor optimizations results in the following function:
__mulsi3: â â â â mov â â r2,zero â â â â beq â â r4,zero,mul_30 mul_14: â â â â andi â â r3,r4,1 â â â â srli â â â r4,r4,1 â â â â beq â â r3,zero,mul_28 â â â â add â â r2,r2,r5 mul_28: â â â â slli â â â r5,r5,1 â â â â bne â â r4,zero,mul_14 mul_30: â â â â ret--- Quote End --- I think you get the same result if you should be able to compile the libraries with the -f-omit-frame-pointer compiler option. Has anyone done this already, is it hard to do for someone that has practically no experience in porting or making the gnu toolchain? Stefaan
Another try :When implemeting the simple sequence as above as a custom instruction, it saves already a lot of time, for only 280 LEs in a cyclone. I have measured the time for this
int r; volatile int* pr = &r; //prevent optimalisations for (int a = -1000; a < 1000; a++) for (int b = 1000; b >-1000; b--) //count backwards : prevent optimalisations *pr = a*b;Without cutom instruction : this takes about 22.4 seconds at 50Mc with the standard NiosII With the custominstruction in the mulsi function : 4.3 seconds (without frame pointer stuff) with the custominstruction inlined (without the additinal function call and ret instruction) : 2.4 seconds. This is a very big advantage for only 280 LE's I think. (22.4 seconds down to 4.3 seconds for 4.000.000 muls and overhead for the loops) But I think the compiler still thinks that the cost for a multiply is very high, so it wants to optimise to shifts and adds where possible. This can reduce the benefit of this code a lot. If anyone is interested, I'll post the verilog code for the custom instruction.