Nios® II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
12408 Discussions

Nios II/f performance on Cyclone IV E

Honored Contributor I



I am trying to evaluate Nios II/f on a dev. board fitted with 

EP4CE55F23C7. I've ported the Cyclone III Dhrystone example from Altera (Fast Nios 

II Hardware Design Example) to my board. Design specs are: 


* Nios II/f, 4 Kbytes i-cache, 2 Kbytes d-cache 

* System clock: 140MHz 

* JTAG debug module: Yes 

* On-chip RAM: 64 Kbytes 


* Timer: 1 


I've kept the settings, clock, build flags etc. same as in original example. When 

running the dhrystone example I get: 



Microseconds for one run through Dhrystone: 6.5 

Dhrystones per Second: 153506.9 

VAX MIPS rating = 87.369 


This is way less than what the attached readme.txt states: 

...This system achieves over 150 DMIPS on the Cyclone III FPGA Development Kit, 

and can achieve over 190 DMIPS when targeting the fastest speed grade of a  

CycloneIII device... 


According to ds_nios2_perf.pdf MIPS/MHz ratio of Cyclone IV GX is the same as 

Cyclone III LS and a bit higher than Cyclone III. Regardless of that MIPS != DMISP 

I can't imagine such DMIPS/MHz ratio difference between Cyclone IV and III. Or 

am I missing some difference with Nios2 between Cyclone IV GX and E? 


Can somebody shed some light on this? 


Thanks for your help.
0 Kudos
11 Replies
Honored Contributor I

I'd guess at either code optimisation level or cache thrashing. 

The Altera figure might be for tightly coupled memory (no cache line reads at all). 

The other possibility is the type of multiplier used - Altera won't let you throw large numbers of logic gates into a fast multiplier!
Honored Contributor I

SOPC system layout is exactly the same as their (Altera) CycloneIII design 

example which (they say) achieves over 150 DMIPS clocked at 140MHz. 

Dhrystone app has a set of build scripts carefully specifying how to (or rather 

not to) optimize the code. HW multiplier is build with embedded multipliers, 

all code runs from internal RAM (dual port - one to instruction and other to 

data master). 


I've also tested running the same code from SDRAM - 41 DMIPS at 100MHz. 

I get the same result on another design on which I run uClinux with included 

dhrystone (different compiler, different program - but same dhrystone 2.1 



There is an old post about NiosII/f dhrystone on Cyclone EP1C20F400C7. 

The author got 35 DMIPS at 50 MHz and Cyclone I doesn't even have 

embedded multipliers. I get 31 DMIPS as 50 MHz...
Honored Contributor I

You might have a clock crossing adapter in the memory path. Make sure the Nios II core and the memory are on the same clock domain otherwise each memory read will take around 9 clock cycles which will have a huge impact on the Dhyrstone algorithm. Also make sure the timer resolution is the same as the orginal design, if it's set to fine then the CPU will get blasted with interrupts and take longer to complete the benchmark.

Honored Contributor I

Thanks for the info. The problem seems to be in the compiler - dhrystone test compiled 

with gcc 4.1.2 estimates 0.62 DMIPS/MHz and with gcc 3.4.6 1.04 DMIPS/MHz. It 

doesn't matter which version was used to compile the BSP, just the dhrystone itself. 

I've used the same makefile / compiler flags, but changing e.g. optimization level doesn't 

yield much difference with 4.1.2. I didn't investigate this any further, but obviously it's 

something to keep in mind.
Honored Contributor I

I did try compiling my code with gcc4, my static analysis showed the code to both larger and slower than that generated by gcc3. I didn't look to see why. 

(That code is rather carefully crafted to avoid anything on-stack.) 

That would have been after I'd applied my changes to the gcc4 code generator and reversed some of Altera's changes. 


I'm also surprised that changing the optimisation level had little/no effect. The difference between -O0 and -O2 should be massive.
Honored Contributor I

gcc 4.1.2 -O0 gives 0.37 DMIPS/MHz, -O1 0.59, -O2 0.62, -O3 0.664 and -O6 0.665 

gcc 3.4.6 -O0 0.41, -O1 0.83, -O2 0.91, -O3/6 1.04  


gcc 4.1.2 from nios2-linux-20100621.tar produces similar results as 4.1.2 version 

provided with QuartusII v11.
Honored Contributor I

Try the following changes to the gcc sources, the 2nd may be relevant. 


gcc/config/nios2/nios2.h may need some changes undone/fixed:  


1) Change the value of JUMP_TABLES_IN_TEXT_SECTION from "1" to "flag_pic".  


2) Remove "!SYMBOL_REF_EXTERNAL_P(RTX) &&" from the definition of SYMBOL_REF_IN_NIOS2_SMALL_DATA_P(RTX).  


From see also which may improve the code for both versions of gcc.
Honored Contributor I

Thanks, I'll try to find time to test this. For now I'll stick with gcc3.

Honored Contributor I

Dear Izidor,  

How to compile c program in nios ii in (-O2)?  

could you show me some brief steps?
Honored Contributor I

Application makefile created by Nios2 sw build tools should contain APP_CFLAGS_OPTIMIZATION, set it to 

-O2 or -O3..6. Changing BSP optimization flags can be done by passing --set hal.make.bsp_cflags_optimization -O2 

to nios2-bsp. You chould check Nios2 Software Build Tools documentation for more. 


On a side note you might wan't to check the newly released sourcery gcc 4.7.3. I've found it generates much better 

- faster - code in some cases than the old gcc from Altera. But beware it has some serious issues with custom 

floating point instructions.
Honored Contributor I

Thanks a lot Izidor,  

I'll try to find the doc you mentioned.