Re: BeMicro SDK - Nios II/f - Cyclone IV - Performance - dhrystone

Altera_Forum · ‎10-27-2011

Hi All,

We are using the - BeMicro SDK board - (no link provided as need more than five posts)

I'm trying to measure the dhrystone performance of this set up.

(i.e. to determine if this core has enough performance for our application)

The BeMicro SDK was running pre-complied ulinux using the example from the altera wiki. (no link provided as need more than five posts)

Summary as follows:- Nios II/f - clocked at 100Mhz

Instruction cache 4Kbytes

Data cache 2Kbytes

Nios II/f - ulinux running from 64Mbytes of DDR RAM

(the application will be running from ddr ram)

ulinux included the dhrystone benchmark, result as follows:- Trying 500000 runs through Dhrystone

Microseconds for one run through Dhrystone: 45.5

Dhrystones per Second: 21997.4

Converting to DMIP 12.5 - seems very low

The hardware guys tried to re-generate the reading as produced by Altera

(no link provided as need more than five posts)Microseconds for one run through Dhrystone: 26.8

Dhrystones per Second: 37379.7

DMIPS rating = 21.275

Increasing the Instruction / Data cache to 8K improved the result (however no sram blocks were left) DMIPS rating = 34

However, we need a DMIPS score of at least 100 in the Cyclone range....

Can this be achieved?

We can increase the processor speed to 150Mhz, but thats not enough

Thanks

Altera_Forum · ‎10-27-2011

I'm not sure exactly what the Dhrystone benchmark does...

Given the effect of cache sizes, you really need a test that is much the same as the code you need to run - not some random benchmark.

For maximum performance you may need to look carefully at the generated code:

1) Compile with -O2 or -O3.

2) Arrange that all memory accesses are either relative to %gp, or done relative to a global register variable (slightly better than %gp).

3) Avoid instruction stalls following memory reads.

4) Avoid mis-predicted branches, arrange to use the 'branch not taken' path if at all possible.

5) Avoid the compiler doing register spills to stack.

6) Consider using custom instructions for some operations.

Some of the above are probably rather difficult if you are running any form of operating system!

I removed almost all the spare instructions, unnecessary memory accesses and pipeline stalls from some code that does hdlc in software - I got that down to 149 clocks (max) per byte for rx and tx.

Altera_Forum · ‎11-01-2011

Thanks dsl

Will be using an operating system, can't use linux so will be probably use ucosII.

For our application the Ethernet input could be upto 80% loaded and we need to store some of this data (upd / tcp + commercial grade web server) The application has a lot of other thinks to do as well.

"i'm not sure exactly what the dhrystone benchmark does...

given the effect of cache sizes, you really need a test that is much the same as the code you need to run - not some random benchmark."

Agree, however no code exists... This is just to get warm feeling that the processor has enough power (anything with no caches will be too slow).

The application will be fairly big and multi tasking (can't fit into internal sram block), likely to run from DDR ram...caches are there to over come slow performance of DDR ram

Have not got that warm feeling yet, will attempt to try some performances tests using ucosII and InterNiche stack

Altera_Forum · ‎11-01-2011

If you are doing a lot of TCP/IP then the checksum routine needs some TLC.

The nios cpu doesn't have a 32bit 'add with carry' instruction so the checksum loop is horrid, a custom instruction to add the two 16bit halves of one word onto another helps no end. The byteswap one will help elsewhere (if you can persuade the compiler to use it properly).

But I'm not at all sure any 100MHz cpu can 80% load a 100M ethernet with non-trivial traffic.

Altera_Forum · ‎11-02-2011

Thanks for the reply DSL some useful info..

From the original post the dhrystone measurement was poor... I believe this is down how the memory setup for this board...The mobile DDR ram is 16 bits wide and the software is running from DDR ram...if this was 32 bits wide the performance would be much improved

Altera_Forum · ‎11-02-2011

The change in the drystone figures when the cache sizes were increased certainly shows that the working set of the test was significantly larger than the cache size in the first test (and possibly still with 8k caches) - so the memory access times dominate.

Changing the memory width to 32bit should reduce the access times. It is also worth checking there are no clock crossing bridges (I doubt running the DDR faster than the nios will gain more than the cost of these bridges).