Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12600 Discussions

Can I upload Nios2 SMP system?

Altera_Forum
Honored Contributor II
1,571 Views

Hi, guys & Altera corp. 

 

I'm now making a Nios2 SMP system for my research purpose. It's still a little bit buggy and slow, but I succeeded to boot Linux kernel and execute bash. 

 

Linux version 2.6.30 (hamada@Messiah2) (gcc version 4.1.2 (Wind River Linux Sour cery G++ 4.1-176))# 1915 SMP Tue Sep 4 18:16:32 JST 2012 console enabled Early printk initialized Linux/Nios II-MMU Altera Nios II-MMU support (C) 2004 Wind River Systems. init_bootmem_node(?,0x3d0, 0x0, 0x8000) free_bootmem(0x3d0000, 0x7c30000) reserve_bootmem(0x3d0000, 0x1000) Detected 1 available secondary CPU(s) Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512 Kernel command line: kgdboc=ttyS0, 115200 kgdbwait NR_IRQS:32 PID hash table entries: 512 (order: 9, 2048 bytes) Console: colour dummy device 80x25 Dentry cache hash table entries: 16384 (order: 4, 65536 bytes) Inode-cache hash table entries: 8192 (order: 3, 32768 bytes) We have 32768 pages of RAM Memory available: 125824k/3902k RAM, 0k/0k ROM (1457k kernel code, 2445k data) Calibrating delay loop... 19.55 BogoMIPS (lpj=97792) Mount-cache hash table entries: 512 CPU1: Booted secondary processor Calibrating delay loop... 19.96 BogoMIPS (lpj=99840) Brought up 2 CPUs SMP: Total of 2 processors activated (39.52 BogoMIPS). init_BSP(): registering device resources bio: create slab <bio-0> at 0 msgmni has been set to 246 io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered (default) ttyJ0 at MMIO 0xa60a440 (irq = 2) is a Altera JTAG UART console handover: boot -> real ttyS0 at MMIO 0x8000060 (irq = 3) is a Altera UART ifconfig: socket: Function not implemented ifconfig: socket: Function not implemented Welcome to ____ _ _ / __| ||_| _ _| | | | _ ____ _ _ _ _ | | | | | | || | _ \| | | |\ \/ / | |_| | |__| || | | | | |_| |/ \ | ___\____|_||_|_| |_|\____|\_/\_/ | | |_| For further information check: http://www.uclinux.org/ Why came here? CPU0, task inetd pte c71f4c40, entry 07a0704b, address 2ab10000 BusyBox v1.14.2 (2012-06-26 16:39:29 JST) hush - the humble shell Enter 'help' for a list of built-in commands. /# ls bin etc init mnt root sys usr dev home lib proc sbin tmp var /# bash # ls -lp drwxr-xr-x 2 root root 0 Sep 4 2012 bin/ drwxr-xr-x 6 root root 0 Sep 4 2012 dev/ drwxr-xr-x 5 root root 0 Sep 4 2012 etc/ drwxr-xr-x 3 root root 0 Sep 4 2012 home/ lrwxrwxrwx 1 root root 10 Sep 4 2012 init -> /sbin/init drwxr-xr-x 3 root root 0 Sep 4 2012 lib/ drwxr-xr-x 2 root root 0 Sep 4 2012 mnt/ dr-xr-xr-x 34 root root 0 Nov 30 00:00 proc/ drwxr-xr-x 2 root root 0 Sep 4 2012 root/ lrwxrwxrwx 1 root root 3 Sep 4 2012 sbin -> bin/ drwxr-xr-x 11 root root 0 Nov 30 00:00 sys/ drwxr-xr-x 2 root root 0 Nov 30 00:01 tmp/ drwxr-xr-x 5 root root 0 Sep 4 2012 usr/ drwxr-xr-x 7 root root 0 Nov 30 00:01 var/ # cat /proc/cpuinfo CPU: NIOS2 MultiCore MMU: ways:16 entries:512 FPU: none Clocking: <not supported> BogoMips: 19.96 Calibration: 9984000 loops CPU: NIOS2 MultiCore MMU: ways:16 entries:512 FPU: none Clocking: <not supported> BogoMips: 19.96 Calibration: 9984000 loops # cat /proc/interrupts CPU0 0: 13931 NIOS2-INTC timer 2: 133 NIOS2-INTC JTAGUART 3: 0 NIOS2-INTC UART 30: 4875 NIOS2-INTC IPI 0 31: 17375 NIOS2-INTC IPI 1 # cat /proc/stat cpu 124 0 27259 1939 0 0 2 0 0 cpu0 54 0 13448 1183 0 0 2 0 0 cpu1 70 0 13811 756 0 0 0 0 0 intr 38284 14687 0 163 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 515 9 18275 ctxt 33554 btime 943920000 processes 674 procs_running 2 procs_blocked 0 # The cpu core is a clone of the genuine Nios2/f core and almost all features are implemented except the details of 1st data cache. 

If anyone has interest, I want to upload these to 'Altera Wiki', but it's a problem because the 'cpu' is a clone and Altera corp. has their copyright for Nios2's instruction set and its architecture. If Altera corp. kindly permit me to upload all including hardware's source codes, this is the best way. But if not so, how can we share these result? 

 

Kazu
0 Kudos
15 Replies
Altera_Forum
Honored Contributor II
798 Views

How are you getting around the problem that the nios has no locked bus cycles - so you can't implement any mutex or other atomic operations into normal memory? 

Typically these need a minimum of a locked 'compare and exchange' instruction - which the avalon bus doesn't support. 

 

If you've re-implemented nios, you might notice that nios is basically a reimplementation of MIPS :-)
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Its not possible to do an SMP Linux system on Altera NIOS (see several discussions on this). Altera would need to implement SMP-save atomic instructions first. (See e.g. ARM's "load locked", "store conditional" instructions for a decent way to do this.) 

 

I understand that you created a NIOS compatible CPU yourself. Of course here you can in fact implement such instructions, but I feel that a NIOS clone (done in Verilog or whatever) will be much slower than an Altera branded thingy that uses low-level optimizations that Verilog and friends don't provide.  

 

I agree that implementing a MIPS clone seems more appropriate than implementing a NIOS clone There are some free 32 Bit CPUs in Verilog code available in the Net.  

 

Of course cache synchronization is a huge task to do.  

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi, 

 

 

--- Quote Start ---  

How are you getting around the problem that the nios has no locked bus cycles - so you can't implement any mutex or other atomic operations into normal memory? 

Typically these need a minimum of a locked 'compare and exchange' instruction - which the avalon bus doesn't support. 

 

--- Quote End ---  

 

 

I implemented an instruction which swaps the values between the register and memory operand in atomic, like 

 

swap ra, imm16

 

Spinlocks are using this. And the 'compare and exchange' instruction is implemented with the combination of spinlocks like 

 

static inline int atomic_cmpxchg(atomic_t *v, int old, int new) { int ret; unsigned long flags; _atomic_spin_lock_irqsave(v, flags); ret = v->counter; if (likely(ret == old)) v->counter = new; _atomic_spin_unlock_irqrestore(v, flags); return ret; } in the Linux kernel. 

 

 

--- Quote Start ---  

 

If you've re-implemented nios, you might notice that nios is basically a reimplementation of MIPS :-) 

--- Quote End ---  

 

 

Of course, I know that Nios2 is a copy of ....:D 

 

 

--- Quote Start ---  

 

I understand that you created a NIOS compatible CPU yourself. Of course here you can in fact implement such instructions, but I feel that a NIOS clone (done in Verilog or whatever) will be much slower than an Altera branded thingy that uses low-level optimizations that Verilog and friends don't provide.  

 

--- Quote End ---  

 

 

Of course, clone core's fmax is a big problem if you want to achieve better performance than single core case. I compiled the source for my DE2-115 with switches 'less Optimizations' and 'Fast fit', and got the fmax result around 60MHz. By optimizing the details and compilation switches, may be I can get 75~80MHz, but over 100MHz is impossible.  

 

 

--- Quote Start ---  

 

I agree that implementing a MIPS clone seems more appropriate than implementing a NIOS clone There are some free 32 Bit CPUs in Verilog code available in the Net.  

 

--- Quote End ---  

 

 

Why Nios2? Because I love Nios2:) and Altera:D. 

 

 

--- Quote Start ---  

 

Of course cache synchronization is a huge task to do.  

 

--- Quote End ---  

 

 

To achieve the cache coherency, I implemented the 1st data cache as 'write through' one, and set the cache flush signals which are sent to other cpu's data cache whenever the write operations are done. If the target data has been cached in other caches, it's only flushed from the cache. The new data is filled at the next memory access. 

 

Kazu
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Sounds nice.  

 

It would be great if a way could be found (maybe with Altera's help) to use the original NIOS implementation and add what's necassary.  

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi,  

 

As we discussed it before, to make a clone cpu which has atomic instructions is 'one solution' to build the Nios2 SMP system. 

 

 

--- Quote Start ---  

 

It would be great if a way could be found (maybe with Altera's help) to use the original NIOS implementation and add what's necassary.  

 

--- Quote End ---  

 

 

By this research, I understand how to make a SMP system from the view points of hardware and software sides, so I'm now seeking the way to make it by using original Nios2 processors. 

 

Kazu
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

If you can do the locked read-write avalon bus cycle you should be able to generate one from a custom instruction - except it would have to use a separate avalon master and so bypass the data cache. 

 

Actually you'll have no cache coherencey either - very grim! 

You'd have to use an external data cache. 

 

I've thought that the nios cpu isn't much more than a great heap of mux. 

My guess is that RA and RB are always read, pipeline stalls (re-execute) if a write to RA is pending, and for RB if the low two instruction bits differ (NFI why the instructions aren't organised so this is a single bit!). This gives the instrcution three 32bit words to play with, write-back to RB or RC is dependant on the decoded instruction. 

That makes me think that the readra/readrb bits of the custom instruction are ignored - but I've not done any experiments.
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi,  

 

 

--- Quote Start ---  

If you can do the locked read-write avalon bus cycle you should be able to generate one from a custom instruction - except it would have to use a separate avalon master and so bypass the data cache. 

 

Actually you'll have no cache coherencey either - very grim! 

You'd have to use an external data cache. 

 

--- Quote End ---  

 

 

Of course, we must add some custom instructions to do locked read-write memory operations, but we can't forget the existence of MMU. About the cache coherency, we can remove the original data cache and put special one. 

 

 

--- Quote Start ---  

 

I've thought that the nios cpu isn't much more than a great heap of mux. 

My guess is that RA and RB are always read, pipeline stalls (re-execute) if a write to RA is pending, and for RB if the low two instruction bits differ (NFI why the instructions aren't organised so this is a single bit!). This gives the instrcution three 32bit words to play with, write-back to RB or RC is dependant on the decoded instruction. 

That makes me think that the readra/readrb bits of the custom instruction are ignored - but I've not done any experiments. 

--- Quote End ---  

 

 

I think that the readra/readrb bits are used in the forwarding and pipe-lock mechanism.  

 

Kazu
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

My point about readra/readrb is that the logic that handles the pipeline stall doesn't really want to look that far into the instruction decode.

0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi, 

 

 

--- Quote Start ---  

My point about readra/readrb is that the logic that handles the pipeline stall doesn't really want to look that far into the instruction decode. 

--- Quote End ---  

 

 

Maybe, Altera uses embedded memories for the Nios's register files, and sets those in the same clock phase of the 'Decode Stage'. To read the contents of the embedded memories, it takes one clock, so the RA & RB field data must be sent directly from the 'Fetch Stage' to the register file. The contents of the register file are always discharged to the following stage even when those are not used. But to avoid the 'RAW' hazard, the 'Decode Stage' must decide whether the RA and RB are really used or not. I don't know whether Altera uses readra and readrb bits for this purpose or not, because the 'R0' register can be used instead of those.  

 

Kazu
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi, 

 

I succeeded to make a Nios2 SMP system by using normal Nios2/f cores. It's still a little bit buggy, but I'm going to upload the sources and documents on Altera wiki in near future. 

 

Linux version 2.6.30-00471-g2e1b9d6-dirty (hamada@Messiah2) (gcc version 4.1.2 ( Wind River Linux Sourcery G++ 4.1-176))# 2077 SMP PREEMPT Sun Oct 21 16:34:00 JS T 2012 console enabled Early printk initialized Linux/Nios II-MMU Altera Nios II-MMU support (C) 2004 Wind River Systems. init_bootmem_node(?,0x3d9, 0x0, 0x8000) free_bootmem(0x3d9000, 0x7c27000) reserve_bootmem(0x3d9000, 0x1000) Detected 1 available secondary CPU(s) Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512 Kernel command line: kgdboc=ttyS0, 115200 kgdbwait NR_IRQS:32 PID hash table entries: 512 (order: 9, 2048 bytes) Console: colour dummy device 80x25 Dentry cache hash table entries: 16384 (order: 4, 65536 bytes) Inode-cache hash table entsucceedries: 8192 (order: 3, 32768 bytes) We have 32768 pages of RAM Memory available: 125808k/3938k RAM, 0k/0k ROM (1485k kernel code, 2452k data) Calibrating delay loop... 24.21 BogoMIPS (lpj=121088) Mount-cache hash table entries: 512 CPU1: Booted secondary processor Calibrating delay loop... 24.98 BogoMIPS (lpj=124928) Brought up 2 CPUs SMP: Total of 2 processors activated (49.20 BogoMIPS). init_BSP(): registering device resources bio: create slab <bio-0> at 0 msgmni has been set to 246 io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered (default) ttyJ0 at MMIO 0xa60a500 (irq = 2) is a Altera JTAG UART console handover: boot -> real ttyS0 at MMIO 0x8000060 (irq = 3) is a Altera UART init.rc Welcome to ____ _ _ / __| ||_| _ _| | | | _ ____ _ _ _ _ | | | | | | || | _ \| | | |\ \/ / | |_| | |__| || | | | | |_| |/ | ___\____|_||_|_| |_|\____|\_/\_/ | | |_| For further information check: http://www.uclinux.org/ BusyBox v1.14.2 (2012-09-17 02:11:26 JST) hush - the humble shell Enter 'help' for a list of built-in commands. /# bash# cat /proc/cpuinfo CPU: NIOS2 MultiCore MMU: ways:16 entries:512 FPU: none Clocking: <not supported> BogoMips: 24.98 Calibration: 12492800 loops CPU: NIOS2 MultiCore MMU: ways:16 entries:512 FPU: none Clocking: <not supported> BogoMips: 24.98 Calibration: 12492800 loops# cat /proc/interrupts CPU0 0: 25375 NIOS2-INTC timer 2: 130 NIOS2-INTC JTAGUART 3: 0 NIOS2-INTC UART 30: 1993 NIOS2-INTC IPI 0 31: 30282 NIOS2-INTC IPI 1# cat /proc/stat cpu 102 0 41939 10290 0 0 1 0 0 cpu0 67 0 19033 7103 0 0 0 0 0 cpu1 35 0 22906 3187 0 0 1 0 0 intr 59916 26203 0 166 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 220 5 31342 ctxt 15553 btime 943920000 processes 670 procs_running 2 procs_blocked 0# Kazu
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

 

--- Quote Start ---  

I succeeded to make a Nios2 SMP system by using normal Nios2/f cores.  

--- Quote End ---  

 

Sounds great.  

But how did you handle cache synchronization and the inter-CPU atomic operations that are necessary to do the Mutex API and the multiple Kernel-internal synchronization issues ?  

 

(I understand that this is close to impossible without modifying the CPU design.) 

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi, 

 

 

--- Quote Start ---  

 

But how did you handle cache synchronization and the inter-CPU atomic operations that are necessary to do the Mutex API and the multiple Kernel-internal synchronization issues ?  

 

(I understand that this is close to impossible without modifying the CPU design.) 

 

 

--- Quote End ---  

 

 

At first, I removed the normal data cache from the Nios2/f core (select the data cache <none> option in the SOPC builder) and added my original 1st (write-through) data cache. The cache synchronization method is the same one that is used in the clone's case. For atomic memory operations, I implemented the 'swap' that is controlled as a custom instruction. Unfortunately, we can't use cache non-cache information outside of Nios2 core, so I changed the kernel memory mapping like 

0xc0000000-0xcfffffff : cacheable 0xd0000000-0xdfffffff : non-cacheable 0xe0000000-0xefffffff : cacheable 0xf0000000-0xffffffff : non-cacheable

 

Kazu
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

 

--- Quote Start ---  

I implemented the 'swap' that is controlled as a custom instruction.  

--- Quote End ---  

 

AFAIK, that would include doing an additional memory interface for this instruction, as the infrastructure of the NIOS design does not allow using the processor's memory interface in a custom instruction. This of course prevents allowing for a cache within the processor. I suppose doing an external cache (aka 2nd level cache) instead of using the 1st leve cache provided by Altera will slow down the CPU a lot.  

 

 

--- Quote Start ---  

Unfortunately, we can't use cache non-cache information outside of Nios2 core, so I changed the kernel memory mapping  

--- Quote End ---  

 

 

Maybe you could use the old A31-trick (A31=1 -> cache bypassed). With that you could define non-cacheable regions using the MMU target address.  

 

But I don't think the problem with inter-CPU atomic instructions is solvable :(.  

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi,  

 

 

--- Quote Start ---  

AFAIK, that would include doing an additional memory interface for this instruction, as the infrastructure of the NIOS design does not allow using the processor's memory interface in a custom instruction. This of course prevents allowing for a cache within the processor. I suppose doing an external cache (aka 2nd level cache) instead of using the 1st leve cache provided by Altera will slow down the CPU a lot.  

 

--- Quote End ---  

 

 

To make a SMP system with normal Nios2, we must achieve next 2 points. 

 

1) Atomic read-write memory instruction. 

2) Coherency of 1st data caches  

 

For atomic memory instructions, it is a kind of the game 'Beach Flags' (in this case the amount of flags is only one and this corresponds to a locking variable). So the flag must be set in the 2nd cache or main memory, not in the 1st caches. This means that the bus lock for atomic instructions is required between 1st cache and 2nd cache, not between cpu and 1st cache, So we can achieve 1) without tampering the Altera's data cache. But for 2), there is no method to flush the aimed line by external hardware, so it's impossible to achieve it except removing the normal data cache. 

Of course, we must accept the disadvantage to add an external 1st data cache. It makes the cpu slow, but not a lot. Now to read and write between the cpu and external 1st cache, it takes 3 clocks in the case of cache-hit. But the codes are not fully occupied by 'load ' and 'store' instructions, so the bad influences are limited. (Less memory access is the major premise for RISC processors, though it is sometimes broken:D.)  

And there are some advantages to adopt the external 1st cache. We can make the caches all physically-indexed and physically-tagged type, so the 1st data cache size can be enlarged beyond 4Kbytes without synonym problems. Moreover the bus between the 1st and 2nd cache can be made original, e.g. wider bus width or simultaneously readable & writeable. I adopt 128bits bus size and the peak data rates reaches 1.6GBytes/sec(@100MHz). 

 

 

--- Quote Start ---  

 

Maybe you could use the old A31-trick (A31=1 -> cache bypassed). With that you could define non-cacheable regions using the MMU target address.  

 

--- Quote End ---  

 

 

Yes, I used A28-trick. 

 

 

--- Quote Start ---  

 

But I don't think the problem with inter-CPU atomic instructions is solvable :(.  

 

--- Quote End ---  

 

 

If it is unsolvable, the Linux never boot;). 

 

Kazu
0 Kudos
Altera_Forum
Honored Contributor II
798 Views

Hi, 

 

I uploaded the example files on 'Altera Wiki'. Please see 

http://www.alterawiki.com/wiki/nios2_smp_system.  

Later I will upload sources and explanations.  

Merry X'mas (and this new editor is stupid too.) 

 

Kazu
0 Kudos
Reply