Nios2 Performance

Altera_Forum · ‎01-19-2005

I just ran some benchmarks for an ARM922 and a NIOS2/f. Both were run with the same code and ran at 90MHz. I did benchmarks for interger multiply, integer divide, floating point multiply, and floating point divide. The NIOS2/f was configured for LE multiply (cyclone), and use hardware divide. It also had 16K of Icache and 8K of Dcache. The NIOS2 performed as well as if not better than the ARM when it came to the integer math operations. However, the ARM smoked the NIOS by a factor of 7 on the floating point math. All optimizations were turned on to full and the NO_INSTRUCTION_EMULATION preprocessor was used. I also did not use any timers or STDIO, so I would imagine that there was very little interupts happening. The ARM has no floating point processor, but I am told that the floating point library uses its on board barrel shifter. My question is can I get better performance? Is there an option that I didn't think of? Is the floating point library that ineficient? If so, is there another one out there? I could find much in the software docs....

Rick

Altera_Forum · ‎01-19-2005

Rick,

I am no expert when it comes to floating point but I think there is a tool at your disposal that can help answer this: the .objdump. I'm assuming you know how to create one for Nios II -- it will show you what happens in the GNU FP libs... is it possible to get the equivalent for the arm processor and take a look?

Don't forget that several users have posted various bits of FP hardware to this forum that should integrate nicely with Nios II; perhaps another part of your analysis could include wiring one of these up, comparing performance again, and then factor in the cost (from additional LE usage) to make things fair?

Altera_Forum · ‎01-20-2005

Floating point calculations without floating point hardware is never fast. The fact that the ARM uses a barrelshifter for the floating point calculations would suggest it would have a speed advantage. The hardware that you used with the 'f' core was intended for integer math. Tacking a floating point device onto a Nios II core would close the gap.

Altera_Forum · ‎01-20-2005

It is obvious that without a FP unit, FP math is a magnitude slower than integer math. I just saw that one compiler/library is better than the other when it came to FP math. With the NIOS integer performance on par with the ARM (even better in some cases), the FP math is just too lopsided. It seems to me that the ARM FP library takes advantage of the its hardware better than the NIOS FP library. Could we feed this info back up to Altera? Are you suggesting a custom instructions (fmul, fdiv, etc) via a FP component?

If I did, would my code change from:

x * y = z;

k = i / j;

To:

z = fmul (x, y);

k = fdiv (i, j);

Rick

Altera_Forum · ‎01-20-2005

The easy way is to create custom instructions and use them like what you have shown below. There are things you can do with the compiler to associate * and / to the new multiply and divide hardware (I don't know how because I don't mind the extra typing of mul (a,http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/cool.gif and div(a, http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/cool.gif )

But when you say that the support library for the ARM is more efficient it is still like comparing apples to oranges. The amount of clock cycles you quoted it taking isn't all that bad (I think anyway). I have seen processors take much longer then that with floating point math and no hardware support. The amount of floating point support you add in hardware really depends on the application. If you just need to take two number and spit out an answer then simple math is only needed. If you need to do control based on floating point values then it becomes a question of whether or not to add hardware for that as well (logic size versus speed trade off).

Altera_Forum · ‎01-20-2005

rppolicy do you have available any Nios II/f measurements in cycles needed per operation that you used in your comparison tests which you could post here?

I am facing here a big bottleneck in an application that uses a double precision multiplication operation and doing some analysis here, using the performance counters peripherals, I see that Nios II/f needs about 1100 cycles!! per DP multiplication. Is this any close to the performance you are measuring there? I have also tried the same options like you did. I also tried a 3d party software FP library but the number of cycles remains pretty much the same. Is this number real or am I missing something here?

BTW, BadOmen to which amount of quoted cycles are you reffering ?

Jesse: I do not think that adding a hardware FP to Nios II and then comparing it with an ARM using a software FP and a hardware Barrel shifter (which by the way Nios II/f also has) would make things fair. I think that this is exactly what would make thinks unfair. I think it would be then like comparing apples to oranges to use BadOmen's expression.

Altera_Forum · ‎01-20-2005

--- Quote Start ---

originally posted by anagnost@Jan 20 2005, 05:44 PM

jesse: i do not think that adding a hardware fp to nios ii and then comparing it with an arm using a software fp and a hardware barrel shifter (which by the way nios ii/f also has) would make things fair. i think that this is exactly what would make thinks unfair. i think it would be then like comparing apples to oranges to use badomen's expression.

--- Quote End ---

I am sorry if I was ambiguous; that isn't what I meant. I was suggesing that he take a quick look at objdumps (or for that matter, just find the libs and compare C code) to see what the difference in software emulation of floating point arithmetic is between the two processors. From the descriptions above it sounds awfully like a software library difference.

Now, as far as improving things on the Nios II side, there are those FP units available for download and test that have helped a number of people out in their designs -- the trade-off is whether you can afford to spend the additional FPGA LEs... I guess I was just trying to highlight the availability of those as potential solutions. Hope this makes what I was trying to convey less ambiguous.

Altera_Forum · ‎01-21-2005

<div class='quotetop'>QUOTE </div>

--- Quote Start ---

BTW, BadOmen to which amount of quoted cycles are you reffering[/b]

--- Quote End ---

---> oops wrong post. Someone (maybe rppolicy, or maybe it was you) did some benchmarking on the software floating point times (I was thinking of that and forgot it wasn't in this post but somewhere else in the forum).

Altera_Forum · ‎01-21-2005

I didn't use any cycle count comparisons, but I used a standard digital output connected to a logic analyzer. I would set the output at the beginning of the function and then reset it. The function runs in a massive loop, so the output set/reset timing is negligable. The only difference between the code is how I do the set/reset of the digital line. Here is the fp code that was run on each processor:

In the main funct:

pio_data |= set_mask[2];

IOWR_ALTERA_AVALON_PIO_DATA (USER_PIO_BASE, pio_data);

fsum = TestFloatMult();

pio_data &= reset_mask[2];

IOWR_ALTERA_AVALON_PIO_DATA (USER_PIO_BASE, pio_data);

Fmult benchmark funct (simple mult/accum):

float TestFloatMult(void)

{

register float *ptr1, *ptr2, sum;

int i;

sum = 0;

ptr1 = &gFloatArray1[0]; // gFloatArray1 is a randomly generated float array

ptr2 = &gFloatArray2[0]; // gFloatArray2 is a randomly generated float array

for(i=0; i < ARRAY_SIZE; i++) // ARRAY_SIZE = 1000

{

sum += *ptr1++ * *ptr2++;

}

return sum;

}

What I found is that the ARM executed it in 940uS and the NIOS in 7.0mS. I went to great lengths (mix mode) to make sure that the ARM did not optimize the crap out of the loop. Everything looked normal.

Rick

Altera_Forum · ‎01-22-2005

Thank you rppolicy for posting those data. It seems that we are both getting pretty much the same results. If i have done the calculations correct here, Nios II/f spends about 630 cycles in each iteration of your loop which contains a float multiply accumulate scenario and two pointers increment. So, my measured 1100 cycles per double precision multiply operation, considering the double vs float overhead, is more or less in agreement with your results.

On the other hand, according to your results, ARM needs about 85 cycles per iteration of your loop, which is at least a x7 performance advantage over Nios II/f as you said in your first post. Whether this performance gap is only a result of the H/W barrel shifter explotiation by the ARM's software floating point library, is definitely a question seeking an answer. Probably the most appropriate person to answer this question is someone from Altera that knows more details than us, regarding the Altera's software floating point library implementation.

If exploitation of a H/W barrel shifter gives such a boost in FP operations, and since Nios II/f allready has a H/W barrel shifter, I think that it is a pity not to use it. But then again, it could be that it is not the H/W barrel shifter that makes the difference. Perhaps it is the instruction set differences (not the best candidate in my opinion), or the software implementation of the Altera FP library, or even a "bug". In any case i feel that this issue defenitely needs and worths further investigation.

Jese: I fully agree with you that it sounds awfully like a software difference. Regarding your alternate solution suggestion and analysis, it is now clear to me.

BadOmen: I think it was me. I had started another thread before this one, about the FP performance, and you had post giving me the information about the available H/W FP IPs in here. Thank you once more.