- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, guys & Altera corp.
I'm now making a Nios2 SMP system for my research purpose. It's still a little bit buggy and slow, but I succeeded to boot Linux kernel and execute bash.
Linux version 2.6.30 (hamada@Messiah2) (gcc version 4.1.2 (Wind River Linux Sour
cery G++ 4.1-176))# 1915 SMP Tue Sep 4 18:16:32 JST 2012
console enabled
Early printk initialized
Linux/Nios II-MMU
Altera Nios II-MMU support (C) 2004 Wind River Systems.
init_bootmem_node(?,0x3d0, 0x0, 0x8000)
free_bootmem(0x3d0000, 0x7c30000)
reserve_bootmem(0x3d0000, 0x1000)
Detected 1 available secondary CPU(s)
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512
Kernel command line: kgdboc=ttyS0, 115200 kgdbwait
NR_IRQS:32
PID hash table entries: 512 (order: 9, 2048 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
We have 32768 pages of RAM
Memory available: 125824k/3902k RAM, 0k/0k ROM (1457k kernel code, 2445k data)
Calibrating delay loop... 19.55 BogoMIPS (lpj=97792)
Mount-cache hash table entries: 512
CPU1: Booted secondary processor
Calibrating delay loop... 19.96 BogoMIPS (lpj=99840)
Brought up 2 CPUs
SMP: Total of 2 processors activated (39.52 BogoMIPS).
init_BSP(): registering device resources
bio: create slab <bio-0> at 0
msgmni has been set to 246
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
ttyJ0 at MMIO 0xa60a440 (irq = 2) is a Altera JTAG UART
console handover: boot -> real
ttyS0 at MMIO 0x8000060 (irq = 3) is a Altera UART
ifconfig: socket: Function not implemented
ifconfig: socket: Function not implemented
Welcome to
____ _ _
/ __| ||_|
_ _| | | | _ ____ _ _ _ _
| | | | | | || | _ \| | | |\ \/ /
| |_| | |__| || | | | | |_| |/ \
| ___\____|_||_|_| |_|\____|\_/\_/
| |
|_|
For further information check:
http://www.uclinux.org/
Why came here? CPU0, task inetd pte c71f4c40, entry 07a0704b, address 2ab10000
BusyBox v1.14.2 (2012-06-26 16:39:29 JST) hush - the humble shell
Enter 'help' for a list of built-in commands.
/# ls
bin etc init mnt root sys usr
dev home lib proc sbin tmp var
/# bash # ls -lp
drwxr-xr-x 2 root root 0 Sep 4 2012 bin/
drwxr-xr-x 6 root root 0 Sep 4 2012 dev/
drwxr-xr-x 5 root root 0 Sep 4 2012 etc/
drwxr-xr-x 3 root root 0 Sep 4 2012 home/
lrwxrwxrwx 1 root root 10 Sep 4 2012 init -> /sbin/init
drwxr-xr-x 3 root root 0 Sep 4 2012 lib/
drwxr-xr-x 2 root root 0 Sep 4 2012 mnt/
dr-xr-xr-x 34 root root 0 Nov 30 00:00 proc/
drwxr-xr-x 2 root root 0 Sep 4 2012 root/
lrwxrwxrwx 1 root root 3 Sep 4 2012 sbin -> bin/
drwxr-xr-x 11 root root 0 Nov 30 00:00 sys/
drwxr-xr-x 2 root root 0 Nov 30 00:01 tmp/
drwxr-xr-x 5 root root 0 Sep 4 2012 usr/
drwxr-xr-x 7 root root 0 Nov 30 00:01 var/ # cat /proc/cpuinfo
CPU: NIOS2 MultiCore
MMU: ways:16 entries:512
FPU: none
Clocking: <not supported>
BogoMips: 19.96
Calibration: 9984000 loops
CPU: NIOS2 MultiCore
MMU: ways:16 entries:512
FPU: none
Clocking: <not supported>
BogoMips: 19.96
Calibration: 9984000 loops # cat /proc/interrupts
CPU0
0: 13931 NIOS2-INTC timer
2: 133 NIOS2-INTC JTAGUART
3: 0 NIOS2-INTC UART
30: 4875 NIOS2-INTC IPI 0
31: 17375 NIOS2-INTC IPI 1 # cat /proc/stat
cpu 124 0 27259 1939 0 0 2 0 0
cpu0 54 0 13448 1183 0 0 2 0 0
cpu1 70 0 13811 756 0 0 0 0 0
intr 38284 14687 0 163 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 515
9 18275
ctxt 33554
btime 943920000
processes 674
procs_running 2
procs_blocked 0 #
The cpu core is a clone of the genuine Nios2/f core and almost all features are implemented except the details of 1st data cache. If anyone has interest, I want to upload these to 'Altera Wiki', but it's a problem because the 'cpu' is a clone and Altera corp. has their copyright for Nios2's instruction set and its architecture. If Altera corp. kindly permit me to upload all including hardware's source codes, this is the best way. But if not so, how can we share these result? Kazu
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How are you getting around the problem that the nios has no locked bus cycles - so you can't implement any mutex or other atomic operations into normal memory?
Typically these need a minimum of a locked 'compare and exchange' instruction - which the avalon bus doesn't support. If you've re-implemented nios, you might notice that nios is basically a reimplementation of MIPS :-)- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Its not possible to do an SMP Linux system on Altera NIOS (see several discussions on this). Altera would need to implement SMP-save atomic instructions first. (See e.g. ARM's "load locked", "store conditional" instructions for a decent way to do this.)
I understand that you created a NIOS compatible CPU yourself. Of course here you can in fact implement such instructions, but I feel that a NIOS clone (done in Verilog or whatever) will be much slower than an Altera branded thingy that uses low-level optimizations that Verilog and friends don't provide. I agree that implementing a MIPS clone seems more appropriate than implementing a NIOS clone There are some free 32 Bit CPUs in Verilog code available in the Net. Of course cache synchronization is a huge task to do. -Michael- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
--- Quote Start --- How are you getting around the problem that the nios has no locked bus cycles - so you can't implement any mutex or other atomic operations into normal memory? Typically these need a minimum of a locked 'compare and exchange' instruction - which the avalon bus doesn't support. --- Quote End --- I implemented an instruction which swaps the values between the register and memory operand in atomic, like
swap ra, imm16
. Spinlocks are using this. And the 'compare and exchange' instruction is implemented with the combination of spinlocks like
static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
{
int ret;
unsigned long flags;
_atomic_spin_lock_irqsave(v, flags);
ret = v->counter;
if (likely(ret == old))
v->counter = new;
_atomic_spin_unlock_irqrestore(v, flags);
return ret;
}
in the Linux kernel. --- Quote Start --- If you've re-implemented nios, you might notice that nios is basically a reimplementation of MIPS :-) --- Quote End --- Of course, I know that Nios2 is a copy of ....:D --- Quote Start --- I understand that you created a NIOS compatible CPU yourself. Of course here you can in fact implement such instructions, but I feel that a NIOS clone (done in Verilog or whatever) will be much slower than an Altera branded thingy that uses low-level optimizations that Verilog and friends don't provide. --- Quote End --- Of course, clone core's fmax is a big problem if you want to achieve better performance than single core case. I compiled the source for my DE2-115 with switches 'less Optimizations' and 'Fast fit', and got the fmax result around 60MHz. By optimizing the details and compilation switches, may be I can get 75~80MHz, but over 100MHz is impossible. --- Quote Start --- I agree that implementing a MIPS clone seems more appropriate than implementing a NIOS clone There are some free 32 Bit CPUs in Verilog code available in the Net. --- Quote End --- Why Nios2? Because I love Nios2:) and Altera:D. --- Quote Start --- Of course cache synchronization is a huge task to do. --- Quote End --- To achieve the cache coherency, I implemented the 1st data cache as 'write through' one, and set the cache flush signals which are sent to other cpu's data cache whenever the write operations are done. If the target data has been cached in other caches, it's only flushed from the cache. The new data is filled at the next memory access. Kazu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sounds nice.
It would be great if a way could be found (maybe with Altera's help) to use the original NIOS implementation and add what's necassary. -Michael- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As we discussed it before, to make a clone cpu which has atomic instructions is 'one solution' to build the Nios2 SMP system. --- Quote Start --- It would be great if a way could be found (maybe with Altera's help) to use the original NIOS implementation and add what's necassary. --- Quote End --- By this research, I understand how to make a SMP system from the view points of hardware and software sides, so I'm now seeking the way to make it by using original Nios2 processors. Kazu- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you can do the locked read-write avalon bus cycle you should be able to generate one from a custom instruction - except it would have to use a separate avalon master and so bypass the data cache.
Actually you'll have no cache coherencey either - very grim! You'd have to use an external data cache. I've thought that the nios cpu isn't much more than a great heap of mux. My guess is that RA and RB are always read, pipeline stalls (re-execute) if a write to RA is pending, and for RB if the low two instruction bits differ (NFI why the instructions aren't organised so this is a single bit!). This gives the instrcution three 32bit words to play with, write-back to RB or RC is dependant on the decoded instruction. That makes me think that the readra/readrb bits of the custom instruction are ignored - but I've not done any experiments.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
--- Quote Start --- If you can do the locked read-write avalon bus cycle you should be able to generate one from a custom instruction - except it would have to use a separate avalon master and so bypass the data cache. Actually you'll have no cache coherencey either - very grim! You'd have to use an external data cache. --- Quote End --- Of course, we must add some custom instructions to do locked read-write memory operations, but we can't forget the existence of MMU. About the cache coherency, we can remove the original data cache and put special one. --- Quote Start --- I've thought that the nios cpu isn't much more than a great heap of mux. My guess is that RA and RB are always read, pipeline stalls (re-execute) if a write to RA is pending, and for RB if the low two instruction bits differ (NFI why the instructions aren't organised so this is a single bit!). This gives the instrcution three 32bit words to play with, write-back to RB or RC is dependant on the decoded instruction. That makes me think that the readra/readrb bits of the custom instruction are ignored - but I've not done any experiments. --- Quote End --- I think that the readra/readrb bits are used in the forwarding and pipe-lock mechanism. Kazu- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My point about readra/readrb is that the logic that handles the pipeline stall doesn't really want to look that far into the instruction decode.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
--- Quote Start --- My point about readra/readrb is that the logic that handles the pipeline stall doesn't really want to look that far into the instruction decode. --- Quote End --- Maybe, Altera uses embedded memories for the Nios's register files, and sets those in the same clock phase of the 'Decode Stage'. To read the contents of the embedded memories, it takes one clock, so the RA & RB field data must be sent directly from the 'Fetch Stage' to the register file. The contents of the register file are always discharged to the following stage even when those are not used. But to avoid the 'RAW' hazard, the 'Decode Stage' must decide whether the RA and RB are really used or not. I don't know whether Altera uses readra and readrb bits for this purpose or not, because the 'R0' register can be used instead of those. Kazu- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I succeeded to make a Nios2 SMP system by using normal Nios2/f cores. It's still a little bit buggy, but I'm going to upload the sources and documents on Altera wiki in near future.
Linux version 2.6.30-00471-g2e1b9d6-dirty (hamada@Messiah2) (gcc version 4.1.2 (
Wind River Linux Sourcery G++ 4.1-176))# 2077 SMP PREEMPT Sun Oct 21 16:34:00 JS
T 2012
console enabled
Early printk initialized
Linux/Nios II-MMU
Altera Nios II-MMU support (C) 2004 Wind River Systems.
init_bootmem_node(?,0x3d9, 0x0, 0x8000)
free_bootmem(0x3d9000, 0x7c27000)
reserve_bootmem(0x3d9000, 0x1000)
Detected 1 available secondary CPU(s)
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512
Kernel command line: kgdboc=ttyS0, 115200 kgdbwait
NR_IRQS:32
PID hash table entries: 512 (order: 9, 2048 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entsucceedries: 8192 (order: 3, 32768 bytes)
We have 32768 pages of RAM
Memory available: 125808k/3938k RAM, 0k/0k ROM (1485k kernel code, 2452k data)
Calibrating delay loop... 24.21 BogoMIPS (lpj=121088)
Mount-cache hash table entries: 512
CPU1: Booted secondary processor
Calibrating delay loop... 24.98 BogoMIPS (lpj=124928)
Brought up 2 CPUs
SMP: Total of 2 processors activated (49.20 BogoMIPS).
init_BSP(): registering device resources
bio: create slab <bio-0> at 0
msgmni has been set to 246
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
ttyJ0 at MMIO 0xa60a500 (irq = 2) is a Altera JTAG UART
console handover: boot -> real
ttyS0 at MMIO 0x8000060 (irq = 3) is a Altera UART
init.rc
Welcome to
____ _ _
/ __| ||_|
_ _| | | | _ ____ _ _ _ _
| | | | | | || | _ \| | | |\ \/ /
| |_| | |__| || | | | | |_| |/
| ___\____|_||_|_| |_|\____|\_/\_/
| |
|_|
For further information check:
http://www.uclinux.org/
BusyBox v1.14.2 (2012-09-17 02:11:26 JST) hush - the humble shell
Enter 'help' for a list of built-in commands.
/# bash# cat /proc/cpuinfo
CPU: NIOS2 MultiCore
MMU: ways:16 entries:512
FPU: none
Clocking: <not supported>
BogoMips: 24.98
Calibration: 12492800 loops
CPU: NIOS2 MultiCore
MMU: ways:16 entries:512
FPU: none
Clocking: <not supported>
BogoMips: 24.98
Calibration: 12492800 loops# cat /proc/interrupts
CPU0
0: 25375 NIOS2-INTC timer
2: 130 NIOS2-INTC JTAGUART
3: 0 NIOS2-INTC UART
30: 1993 NIOS2-INTC IPI 0
31: 30282 NIOS2-INTC IPI 1# cat /proc/stat
cpu 102 0 41939 10290 0 0 1 0 0
cpu0 67 0 19033 7103 0 0 0 0 0
cpu1 35 0 22906 3187 0 0 1 0 0
intr 59916 26203 0 166 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 220
5 31342
ctxt 15553
btime 943920000
processes 670
procs_running 2
procs_blocked 0#
Kazu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I succeeded to make a Nios2 SMP system by using normal Nios2/f cores. --- Quote End --- Sounds great. But how did you handle cache synchronization and the inter-CPU atomic operations that are necessary to do the Mutex API and the multiple Kernel-internal synchronization issues ? (I understand that this is close to impossible without modifying the CPU design.) -Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
--- Quote Start --- But how did you handle cache synchronization and the inter-CPU atomic operations that are necessary to do the Mutex API and the multiple Kernel-internal synchronization issues ? (I understand that this is close to impossible without modifying the CPU design.) --- Quote End --- At first, I removed the normal data cache from the Nios2/f core (select the data cache <none> option in the SOPC builder) and added my original 1st (write-through) data cache. The cache synchronization method is the same one that is used in the clone's case. For atomic memory operations, I implemented the 'swap' that is controlled as a custom instruction. Unfortunately, we can't use cache non-cache information outside of Nios2 core, so I changed the kernel memory mapping like
0xc0000000-0xcfffffff : cacheable
0xd0000000-0xdfffffff : non-cacheable
0xe0000000-0xefffffff : cacheable
0xf0000000-0xffffffff : non-cacheable
. Kazu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I implemented the 'swap' that is controlled as a custom instruction. --- Quote End --- AFAIK, that would include doing an additional memory interface for this instruction, as the infrastructure of the NIOS design does not allow using the processor's memory interface in a custom instruction. This of course prevents allowing for a cache within the processor. I suppose doing an external cache (aka 2nd level cache) instead of using the 1st leve cache provided by Altera will slow down the CPU a lot. --- Quote Start --- Unfortunately, we can't use cache non-cache information outside of Nios2 core, so I changed the kernel memory mapping --- Quote End --- Maybe you could use the old A31-trick (A31=1 -> cache bypassed). With that you could define non-cacheable regions using the MMU target address. But I don't think the problem with inter-CPU atomic instructions is solvable :(. -Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
--- Quote Start --- AFAIK, that would include doing an additional memory interface for this instruction, as the infrastructure of the NIOS design does not allow using the processor's memory interface in a custom instruction. This of course prevents allowing for a cache within the processor. I suppose doing an external cache (aka 2nd level cache) instead of using the 1st leve cache provided by Altera will slow down the CPU a lot. --- Quote End --- To make a SMP system with normal Nios2, we must achieve next 2 points. 1) Atomic read-write memory instruction. 2) Coherency of 1st data caches For atomic memory instructions, it is a kind of the game 'Beach Flags' (in this case the amount of flags is only one and this corresponds to a locking variable). So the flag must be set in the 2nd cache or main memory, not in the 1st caches. This means that the bus lock for atomic instructions is required between 1st cache and 2nd cache, not between cpu and 1st cache, So we can achieve 1) without tampering the Altera's data cache. But for 2), there is no method to flush the aimed line by external hardware, so it's impossible to achieve it except removing the normal data cache. Of course, we must accept the disadvantage to add an external 1st data cache. It makes the cpu slow, but not a lot. Now to read and write between the cpu and external 1st cache, it takes 3 clocks in the case of cache-hit. But the codes are not fully occupied by 'load ' and 'store' instructions, so the bad influences are limited. (Less memory access is the major premise for RISC processors, though it is sometimes broken:D.) And there are some advantages to adopt the external 1st cache. We can make the caches all physically-indexed and physically-tagged type, so the 1st data cache size can be enlarged beyond 4Kbytes without synonym problems. Moreover the bus between the 1st and 2nd cache can be made original, e.g. wider bus width or simultaneously readable & writeable. I adopt 128bits bus size and the peak data rates reaches 1.6GBytes/sec(@100MHz). --- Quote Start --- Maybe you could use the old A31-trick (A31=1 -> cache bypassed). With that you could define non-cacheable regions using the MMU target address. --- Quote End --- Yes, I used A28-trick. --- Quote Start --- But I don't think the problem with inter-CPU atomic instructions is solvable :(. --- Quote End --- If it is unsolvable, the Linux never boot;). Kazu- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I uploaded the example files on 'Altera Wiki'. Please see http://www.alterawiki.com/wiki/nios2_smp_system. Later I will upload sources and explanations. Merry X'mas (and this new editor is stupid too.) Kazu
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page