Hi,I am trying to evaluate Nios II/f on a dev. board fitted with EP4CE55F23C7. I've ported the Cyclone III Dhrystone example from Altera (Fast Nios II Hardware Design Example) to my board. Design specs are: * Nios II/f, 4 Kbytes i-cache, 2 Kbytes d-cache * System clock: 140MHz * JTAG debug module: Yes * On-chip RAM: 64 Kbytes * JTAG UART: 1 * Timer: 1 I've kept the settings, clock, build flags etc. same as in original example. When running the dhrystone example I get: ... Microseconds for one run through Dhrystone: 6.5 Dhrystones per Second: 153506.9 VAX MIPS rating = 87.369 This is way less than what the attached readme.txt states: ...This system achieves over 150 DMIPS on the Cyclone III FPGA Development Kit, and can achieve over 190 DMIPS when targeting the fastest speed grade of a CycloneIII device... According to ds_nios2_perf.pdf MIPS/MHz ratio of Cyclone IV GX is the same as Cyclone III LS and a bit higher than Cyclone III. Regardless of that MIPS != DMISP I can't imagine such DMIPS/MHz ratio difference between Cyclone IV and III. Or am I missing some difference with Nios2 between Cyclone IV GX and E? Can somebody shed some light on this? Thanks for your help.
I'd guess at either code optimisation level or cache thrashing.The Altera figure might be for tightly coupled memory (no cache line reads at all). The other possibility is the type of multiplier used - Altera won't let you throw large numbers of logic gates into a fast multiplier!
SOPC system layout is exactly the same as their (Altera) CycloneIII designexample which (they say) achieves over 150 DMIPS clocked at 140MHz. Dhrystone app has a set of build scripts carefully specifying how to (or rather not to) optimize the code. HW multiplier is build with embedded multipliers, all code runs from internal RAM (dual port - one to instruction and other to data master). I've also tested running the same code from SDRAM - 41 DMIPS at 100MHz. I get the same result on another design on which I run uClinux with included dhrystone (different compiler, different program - but same dhrystone 2.1 codebase). There is an old post about NiosII/f dhrystone on Cyclone EP1C20F400C7. The author got 35 DMIPS at 50 MHz and Cyclone I doesn't even have embedded multipliers. I get 31 DMIPS as 50 MHz...
You might have a clock crossing adapter in the memory path. Make sure the Nios II core and the memory are on the same clock domain otherwise each memory read will take around 9 clock cycles which will have a huge impact on the Dhyrstone algorithm. Also make sure the timer resolution is the same as the orginal design, if it's set to fine then the CPU will get blasted with interrupts and take longer to complete the benchmark.
Thanks for the info. The problem seems to be in the compiler - dhrystone test compiledwith gcc 4.1.2 estimates 0.62 DMIPS/MHz and with gcc 3.4.6 1.04 DMIPS/MHz. It doesn't matter which version was used to compile the BSP, just the dhrystone itself. I've used the same makefile / compiler flags, but changing e.g. optimization level doesn't yield much difference with 4.1.2. I didn't investigate this any further, but obviously it's something to keep in mind.
I did try compiling my code with gcc4, my static analysis showed the code to both larger and slower than that generated by gcc3. I didn't look to see why.(That code is rather carefully crafted to avoid anything on-stack.) That would have been after I'd applied my changes to the gcc4 code generator and reversed some of Altera's changes. I'm also surprised that changing the optimisation level had little/no effect. The difference between -O0 and -O2 should be massive.
gcc 4.1.2 -O0 gives 0.37 DMIPS/MHz, -O1 0.59, -O2 0.62, -O3 0.664 and -O6 0.665gcc 3.4.6 -O0 0.41, -O1 0.83, -O2 0.91, -O3/6 1.04 gcc 4.1.2 from nios2-linux-20100621.tar produces similar results as 4.1.2 version provided with QuartusII v11.
Try the following changes to the gcc sources, the 2nd may be relevant.gcc/config/nios2/nios2.h may need some changes undone/fixed: 1) Change the value of JUMP_TABLES_IN_TEXT_SECTION from "1" to "flag_pic". 2) Remove "!SYMBOL_REF_EXTERNAL_P(RTX) &&" from the definition of SYMBOL_REF_IN_NIOS2_SMALL_DATA_P(RTX). From http://www.alterawiki.com/wiki/crossgcc see also http://www.alterawiki.com/wiki/gcc_patches which may improve the code for both versions of gcc.
Application makefile created by Nios2 sw build tools should contain APP_CFLAGS_OPTIMIZATION, set it to-O2 or -O3..6. Changing BSP optimization flags can be done by passing --set hal.make.bsp_cflags_optimization -O2 to nios2-bsp. You chould check Nios2 Software Build Tools documentation for more. On a side note you might wan't to check the newly released sourcery gcc 4.7.3. I've found it generates much better - faster - code in some cases than the old gcc from Altera. But beware it has some serious issues with custom floating point instructions.