Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
12748 Discussions

Porting PIO C-code from ARM-7 to Nios II (SE1-board) + performance issue.

Altera_Forum
Honored Contributor II
5,051 Views

Hi all, 

 

I'm having a try at the moment to port my C-code from the arm-7 mcu, LPC-P2148 to Nios II, SE1-board , Cyclon-II, EP2C20F484C7. The arm-none-eabi-gcc compiler was used for the ARM-7 MCU and the NIOSII Eclipse Platform , version 9.1 for the nios ii/e cpu. ( e CPU: Cost and licence is based ). More details can bei obtained from my homepage. 

A) Performance issue:  

ARM-7: 12MHz clock , running at 48 Mhz via PLL. 

Nios II/e 50MHz clock , running at 50 Mhz. 

Used code: ( shifting LED Demo ) 

 

void startup_leds(void); 

void delay(void);# include <stdio.h># include <system.h># include "altera_avalon_pio_regs.h" 

// 

void startup_leds(void) 

short is; 

short count = 4; 

int laufled = 0x00000001; 

for (count = 0; count < 4; count++ ) { 

for (is = 0; is < 16; is++ ) { 

IOWR_ALTERA_AVALON_PIO_DATA(PIO_0_BASE, laufled); 

// IOPIN1 = laufled<<16; // was ARM-Code 

delay(); // wait 

laufled = laufled <<1; 

for (is = 0; is < 16; is++ ) { 

//IOPIN1 = laufled<<16; // was ARM-Code 

IOWR_ALTERA_AVALON_PIO_DATA(PIO_0_BASE, laufled); 

delay(); // wait 

laufled = laufled >>1; 

 

// 

void delay(void) 

short int wait = 6000; // arm-7: 50000 for same speed. 

while (wait) { 

wait = wait -1; 

// 

int main(void) 

short count = 0; 

int delay; 

printf(" Hello from Nios II \n\r"); 

startup_leds(); 

while(1) 

startup_leds(); 

return 0; 

 

Result. the arm-7 runs about 7 times faster than the nios ii/e ! 

Is the reason based on the Nios II/e CPU ? 

Maybe something wrong configured via SOPC Builder? 

Is my C-code correct? 

IOPIN1 = laufled<<16; = ARM-7 code with additional 16 ASR ! 

Converted to : OWR_ALTERA_AVALON_PIO_DATA(PIO_0_BASE, laufled); 

 

B) What is the recommendation reading Data from a PIO or test 

the condition from a single bits. For Example, I want to check the 

status of BIT-16. The ARM-7 C-code Statement 

if (!( IOPIN0 & 0x00010000 )) .... Nios II statement ? 

I was playing around with the IORDALTERA_AVALON_PIO_DATA(base) 

but without any success yet. 

Problem with SOPC Builder and PIO optional settings ? 

 

Everybody answer is welcome, Regards , Reinhard
0 Kudos
29 Replies
Altera_Forum
Honored Contributor II
552 Views

Let me see ... Nios will do the loop in 6 clocks ... assuming the timing measurement by PDPGY is accurate, then ARM must be doing the loop in 1 clock to get the 7 to 1 timing. 

 

This fails the sanity test.
0 Kudos
Altera_Forum
Honored Contributor II
552 Views

Once again, the 7 to 1 timing applies to a nios ii/e core (this is the core PDP11GY used).  

Have a look at the NIOS II performance benchmarks  

www.altera.com/literature/ds/ds_nios2_perf.pdf 

 

NIOS II/e 0.15 MIPS/MHz 

NIOS II/s 0.64 MIPS/MHz 

NIOS II/f 1.13 MIPS/MHz
0 Kudos
Altera_Forum
Honored Contributor II
552 Views

 

--- Quote Start ---  

Let me see ... Nios will do the loop in 6 clocks ... assuming the timing measurement by PDPGY is accurate, then ARM must be doing the loop in 1 clock to get the 7 to 1 timing. 

 

This fails the sanity test. 

--- Quote End ---  

 

 

Actually for the /e it is more likely to be 12 clocks, and 18 if the 'short' causes an additional 'andi rn,rn,0xffff' instruction. 

Not to mention the Avalon MM delays reading the instructions, I suspect the fastest you'll see (from an M9K memory block) is one wait state. 

 

I can't remember the ARM instruction set that well, and don't know the exact timings - the branch cost (particularly mispredicted) will depend very much on the ARM architecture. 

 

So one wait state and 6 clocks for 3 instructions is about a 7:1 timing difference against even a nios /f core.
0 Kudos
Altera_Forum
Honored Contributor II
552 Views

 

--- Quote Start ---  

Once again, the 7 to 1 timing applies to a nios ii/e core (this is the core PDP11GY used).  

Have a look at the NIOS II performance benchmarks  

www.altera.com/literature/ds/ds_nios2_perf.pdf 

 

NIOS II/e 0.15 MIPS/MHz 

NIOS II/s 0.64 MIPS/MHz 

NIOS II/f 1.13 MIPS/MHz 

--- Quote End ---  

 

 

So all of those things used to get the cycles/loop reduced to 6 do not apply. And knowing the MIPS without knowing the benchmark is meaningless. And the cost budget only covered the NIOSII/e, so PDP11GY could not afford NIOSII/f. 

 

I sure would like to understand how a single core can achieve 1.13 MIPS/MHZ. It must complete more than one instruction per cycle, i.e. 2 per cycle about 13% of the time?
0 Kudos
Altera_Forum
Honored Contributor II
552 Views

As far as the benchmark goes, it's the standard dhrystone mips (http://en.wikipedia.org/wiki/dhrystone) that most everyone uses, including ARM.

0 Kudos
Altera_Forum
Honored Contributor II
552 Views

A couple of suggestions. 

 

1) don't force a short int for NIOS. Compilers are free to choose a number of different int sizes based on the 'natural' size of the int for the targeted processor. Short int on the ARM is probably selecting the ARM register size, whereas the NIOS is having to do extra work to mask each access to 16-bits rather than the 32-bit register size. Just use a plain int and you will get each processors 'natural' register size. 

 

2) Try forcing a 32-bit int for each processor and see what happens. uint32_t for NIOS, and the same for ARM (or maybe long int if stdint is not available for the ARM compiler). The results might be interesting. 

 

3) for fastest counting, try 

while(wait--); // yes, this is valid C code. // Test is for NON-zero, not true/falseinstead of all that extra code. It's easier for a compiler to optimize.
0 Kudos
Altera_Forum
Honored Contributor II
552 Views

Actually the compiler will transform the loop you gave into: 

if (wait != 0) { do wait = wait - 1; while (wait != 0); }Depending on the code layout the first conditional might get optimised for the 'wait == 0' case. 

You get better code from a do ... while () loop. 

So you want while (--wait); 

 

However, for the example code, the compiler knew the value of the constant, so won't have compiled the initial test.
0 Kudos
Altera_Forum
Honored Contributor II
552 Views

 

--- Quote Start ---  

 

I sure would like to understand how a single core can achieve 1.13 MIPS/MHZ. It must complete more than one instruction per cycle, i.e. 2 per cycle about 13% of the time? 

--- Quote End ---  

 

 

The 1.13 MIPS/MHz are DMIPS/MHz. DMIPS is a calculated value based on the benchmark test. A value >1 does not necessarily mean, that the cpu can perform more than one instruction per cycle. The MIPS/MHz value for the Nios /f core should be less than 1, otherwise NIOS II/f would be a superscalar architecture.
0 Kudos
Altera_Forum
Honored Contributor II
552 Views

 

--- Quote Start ---  

So you want while (--wait); 

--- Quote End ---  

 

That one does the really long wait for a value of zero, which might be what is wanted. 

 

I'll get around to looking at NIOS code for these eventually. I tested this on several CISC processors and also HiTech C on several varieties of PIC RISC. For short int, I know the CISCs had a one instruction 'decrement, branch if non-zero'. I think the PIC did too (long time ago now, so I may be mixing my memories), but for whatever reason, the while(var--) did a 1 CPU cycle per count, + overhead. The while(--var) was probably the same except for no short-cut exit. Other variations took significantly longer. NIOS might be different. 

 

Either way, the biggest reduction was selecting the processors register size as the variation of the 'int' to use for the counter. Equal variable sizes on processors with different sized registers isn't a fair test.
0 Kudos
Reply