Porting PIO C-code from ARM-7 to Nios II (SE1-board) + performance issue.

Altera_Forum · ‎10-07-2010

Hi all,

I'm having a try at the moment to port my C-code from the arm-7 mcu, LPC-P2148 to Nios II, SE1-board , Cyclon-II, EP2C20F484C7. The arm-none-eabi-gcc compiler was used for the ARM-7 MCU and the NIOSII Eclipse Platform , version 9.1 for the nios ii/e cpu. ( e CPU: Cost and licence is based ). More details can bei obtained from my homepage.

A) Performance issue:

ARM-7: 12MHz clock , running at 48 Mhz via PLL.

Nios II/e 50MHz clock , running at 50 Mhz.

Used code: ( shifting LED Demo )

void startup_leds(void);

void delay(void);# include <stdio.h># include <system.h># include "altera_avalon_pio_regs.h"

//

void startup_leds(void)

{

short is;

short count = 4;

int laufled = 0x00000001;

for (count = 0; count < 4; count++ ) {

for (is = 0; is < 16; is++ ) {

IOWR_ALTERA_AVALON_PIO_DATA(PIO_0_BASE, laufled);

// IOPIN1 = laufled<<16; // was ARM-Code

delay(); // wait

laufled = laufled <<1;

}

for (is = 0; is < 16; is++ ) {

//IOPIN1 = laufled<<16; // was ARM-Code

IOWR_ALTERA_AVALON_PIO_DATA(PIO_0_BASE, laufled);

delay(); // wait

laufled = laufled >>1;

}

//

void delay(void)

{

short int wait = 6000; // arm-7: 50000 for same speed.

while (wait) {

wait = wait -1;

}

//

int main(void)

{

short count = 0;

int delay;

printf(" Hello from Nios II \n\r");

startup_leds();

while(1)

{

startup_leds();

}

return 0;

}

Result. the arm-7 runs about 7 times faster than the nios ii/e !

Is the reason based on the Nios II/e CPU ?

Maybe something wrong configured via SOPC Builder?

Is my C-code correct?

IOPIN1 = laufled<<16; = ARM-7 code with additional 16 ASR !

Converted to : OWR_ALTERA_AVALON_PIO_DATA(PIO_0_BASE, laufled);

B) What is the recommendation reading Data from a PIO or test

the condition from a single bits. For Example, I want to check the

status of BIT-16. The ARM-7 C-code Statement

if (!( IOPIN0 & 0x00010000 )) .... Nios II statement ?

I was playing around with the IORDALTERA_AVALON_PIO_DATA(base)

but without any success yet.

Problem with SOPC Builder and PIO optional settings ?

Everybody answer is welcome, Regards , Reinhard

Altera_Forum · ‎10-07-2010

Check the compiler options - especially for -O2 or -O3.

And check that your delay loop doesn't get optimised away!

An asm volatile("\n":::"memory") inside the loop should persuade gcc to generate all the code.

Oh - and, in general, don't use 'short int' unless you are trying to pack data into a structure or require the compiler to generate code that explicitly wraps on the small boundary.

Altera_Forum · ‎10-07-2010

Nios II/e has a MIPS/Mhz ratio from 0.105 to 0.15 depending on the fpga device family. An ARM7 core has a MIPS/Mhz ratio of nearly 1. This is about 7 times the performance a Nios II/e has. Nios II/e does not have a pipeline, this makes the instruction execution of this core slow.

Altera_Forum · ‎10-07-2010

Many thanks for the upgrade info although the answer is not good for me because I need at least the performance of an ARM-7. Do you think a Nios II/s or Nios II/f CPU is able to perform like a ARM-7 CPU ? Also, as far as I know, these Nios II CPU's are only useable with a special licence ? Too bad for me and it looks to me to redesign my project completely.

Concerning question B): This "problem" is fixed, I configured via SOPC Builder a PIO with extra input and output channel. The ported Statemant:

if (!( IORD_ALTERA_AVALON_PIO_DATA(PIO_0_BASE) & 0x00010000 ))

is now working fine ( only to slow :confused: )

Many thanks and regards.

Altera_Forum · ‎10-08-2010

The other Nios cores are definitely faster, but I don't think they reach a 1 MIPS/MHz ratio though.

You can evaluate them without paying, the only thing is that you won't be able to write the design to flash, and will need to keep the USB connection on while you run your application.

Altera_Forum · ‎10-08-2010

If you are careful, and are executing code from tightly coupled memory, it is possible to avoid almost all pipeline stalls. This will give you one instruction per clock.

You will need to inspect the generated code though, and modify the source in places (mainly to avoid register spills, stalls due to 'late result' on memory reads and mispredicted branches).

Altera_Forum · ‎10-08-2010

Why not adding a cache between NiosII-e and the external memory ?

Reinhard, your NiosII-e runs @ 50MHZ as well as your SDRam.

By placing a cache between NiosII-e and SDRam this could speed up your execution.

Such a cache (with wishbone interface) is used by the zet project and available as verilog soure. maybe it is time to write a wrapper around it that it acts as a slave to NiosII but and as a Master to SDramIP Controller ...

This verilog source can be parametrized in deep and size

Altera_Forum · ‎10-08-2010

IMHO a cache does not signicantly improve the NIOS II/e performance. I think the 0.15 MIPS/MHZ are the best case achievable. A slow memory/memory interface worsens this value. NIOS II/e does not have a pipeline so instructions are executed one after another. An instruction has several stages, so the execution of one instruction lasts several clock cycles even if the attached memory can be accessed in one clock cycle. A fast memory/cache can reduce the length of the code fetch stage or the write back stage to a minimum, but it will still last several clock cycles before the next instruction is fetched.

Altera_Forum · ‎10-08-2010

correct.

but the cache could help, 512bytes is 1 M4K memory block so a few KB cache can help

also Nios, cache SDRam could run at a higher clock rate whereas timer and others are placed in a slower clock domain.

but what helps most is to get the timing relevant stuff into pure hardware

Altera_Forum · ‎10-08-2010

--- Quote Start ---

IMHO a cache does not significantly improve the NIOS II/e performance.

--- Quote End ---

Run a real program with instruction/data caches on and off and see for yourself that it makes a substantial difference. I tried to turn off caches (code and data) in 3 combinations to free up onchip RAM and the resulting speed was unusable for us.

Bill A

Altera_Forum · ‎10-08-2010

disabling cache is only possible for NiosII-s or NiosII-f

disabling the cache on one of them doesn't turn them into an equal NiosII-e

The instruction execution is still faster than compared with an NiosII-e

Altera_Forum · ‎10-08-2010

Ahhh, I didn't know that - thanks. We are using the "/f" NIOS II.

Bill A

Altera_Forum · ‎10-08-2010

I talked about the NIOSII/e core. Using a cache for the /s and /f core can improve the performance significantly.

Altera_Forum · ‎10-08-2010

Since you want maximum performance, why use /e?

Bill A

Altera_Forum · ‎10-10-2010

Hi all,

this was a very interesting discussion for me with a lot of

interesting informationen.Thank you very much.

I am still surprised about the big performance difference, though

and was of the opinion to port the ARM-7 data transfer C-code

direct to Nios II. ( In my case, about 250Khz/16bit tranfer rate )

Now, I will redesign my project to run most of the design in pure hardware

I am having a try now with SOPC DMA Controller Core.

Regards, Reinhard

Altera_Forum · ‎10-11-2010

I'm doing a lot of 64k hdlc channels entirely in software (including bit-stuffing and crc). The problem I have at the moment is getting the data onto and off the fpga. For performance (and minimising the worst case):

- Use the /f core.

- Put code and data into tightly coupled memory.

- disable the data cache.

- disable the instruction cache unless you need it for download.

- Don't call anything in libc, and don't use the Altera supplied startup code. (you only need to set sp, gp and et).

- Use gcc's __builtin_expect() to control which branches are taken.

- Adjust the C source to avoid stalls following memory reads.

- Ask your Altera FAE how to disable the dynamic branch predictor.

Then, apart from the Avalon MM transfers to your target device, you'll have consistent instruction timings that match those documented.

Altera_Forum · ‎10-11-2010

Each call to the pio function results in 2 for loops that call the delay function 16 times which loops 6000 times. Each call therefore delays 32 times decrements 6000 times.

Actually I think they are embedded in a for that loops 4 times.

The delay which simply decrements wait is really the only thing consuming cpu time. So the comparison is really showing how fast the different cpu's/programs can:

load 1;

load wait;

subtract;

store wait - 1;

load 0;

compare wait - 1, 0;

if (true) return;

else

jump to load 1;

Question 1: Did the compiler allocate registers or memory for the operands?

Question 2: Is this loop sufficient to measure cpu performance? In other words, is it a good

benchmark of cpu performance?

Altera_Forum · ‎10-11-2010

The delay() loop should end up being something like:

delay:
        ori r2,r0,<loop count>     #  for a 16bit constant
loop:
        add r2,r2,-1
        bne r0,r2,loop
        ret

With the /f cpu, the add instruction takes 1 clock, the conditional branch (as coded) 2 clocks when going round the loop, and 4 when the loop exits.

It is possible to get the branch to be 1 clock in the loop exit path - by jumping forwards to an unconditional branch and disabling the dynamic branch predictor. In this case the loop would be 4 + 2 clocks.

Altera_Forum · ‎10-11-2010

--- Quote Start ---

The delay() loop should end up being something like:

delay:
        ori r2,r0,<loop count>     #  for a 16bit constant
loop:
        add r2,r2,-1
        bne r0,r2,loop
        ret

With the /f cpu, the add instruction takes 1 clock, the conditional branch (as coded) 2 clocks when going round the loop, and 4 when the loop exits.

It is possible to get the branch to be 1 clock in the loop exit path - by jumping forwards to an unconditional branch and disabling the dynamic branch predictor. In this case the loop would be 4 + 2 clocks.

--- Quote End ---

Thanks. This is a nice assembler loop, I agree. How do we know that the given C compiler would generate this optimized loop? Also you did not include cycles to fetch the instructions and progress thru the pipe to do the add register in 1 cycle. Also the objective was to compare two different cpu architectures by running C code generated by two different compilers.

Also, the ori must be fetched and completed before the add can be done. That is something like 2-3 cycles memory access, plus 5 cycles thru the pipe. Assuming the add fetch was started a cycle after the ori fetch it is probably ready to execute, OK. Now the add result may be written to the register then compared, then the result used to determine the next instruction to fetch, then after the memory access it will start thru the pipe.

Altera_Forum · ‎10-11-2010

Hello and thanks for all the additional infos. Using the /f core NiosII results in my case with the error messages: Megafunction that supports opencore features will stop functioning in 1 hour after device is programmed. Reason: I don't have the correct licens to support the /f core. My network licence supports only the /e core. I also don't know how much does such a licence cost. If the CPU speed is really the "bottle-neck" , I would investigate to buy this licence or alternatively if the licence ist too expensive, I would prefere to install a ARM-7 based Header-Board ( eg. LPC-H21-03/-06/-24 from OLIMEX) on the DE1-Header connector.

Altera_Forum · ‎10-12-2010

--- Quote Start ---

Thanks. This is a nice assembler loop, I agree. How do we know that the given C compiler would generate this optimized loop? Also you did not include cycles to fetch the instructions and progress thru the pipe to do the add register in 1 cycle. Also the objective was to compare two different cpu architectures by running C code generated by two different compilers.

--- Quote End ---

gcc will generate that code - probably from the given source - but it might need the variable changed to be 32bits (ie not a short).

--- Quote Start ---

Also, the ori must be fetched and completed before the add can be done. That is something like 2-3 cycles memory access, plus 5 cycles thru the pipe. Assuming the add fetch was started a cycle after the ori fetch it is probably ready to execute, OK. Now the add result may be written to the register then compared, then the result used to determine the next instruction to fetch, then after the memory access it will start thru the pipe.

--- Quote End ---

For the /f core, and executing from tightly coupled memory (or the instruction cache) the instruction fetches can be ignored. The pipeline loss for the call to the delay function would be attributed to the call instruction (and is 2 clocks). The 'ori' and 'add' and 'beq' execute in adjacent clocks, the backwards conditional branch will be (statically) predicted as taken - so be 2 clocks.

Actually such a small function is probably best marked 'inline' (or as a# define) in order to make more registers available to the calling code.

I've actually removed all subroutine calls from my code in order to give the compiler the best chance of not running out of registers. The only accesses to %sp are the pointless saving of the caller-saved registers on entry to a function that doesn't ever return!