Nios2 SMP System

Nios2 SMP System

What is a SMP systemSMP system

When processor frequencies reach their limit, multiprocessor system is a natural idea to increase the system performance. Symmetric multiprocessing is a kind of multiprocessor(or multi-core) system where two or more identical cpus are connected to a single shared main memory and are controlled by a single operating system. Nowadays, several processors and OS support SMP system, and achieve high performances especially for multi-task or multi-threaded applications.(For more details, refer http://en.wikipedia.org/wiki/Symmetric_multiprocessing, http://www.ibm.com/developerworks/library/l-linux-smp/, etc.)

DSC01564.JPG (Click here for image)

Booting messages of 'Quad core Nios2'(4 tuxes = 4 Nios2 cpus are working).

Nios2 and SOPC Builder support multi-processor systems, but it does not mean SMP system. In this page, we discuss the method how to construct a Nios2 SMP system and to port Linux SMP systems on it.

What is required for Linux SMP system and How it works

At the booting, Linux will boot one CPU (called boot processor) and the boot processor does fundamental initializations. Then the boot processor will kick other CPUs (called application processores) by de-asserting their reset signals. Boot and application CPUs have own data region, e.g. 'runqueue' and those first tasks 'swapper' are executed independently. To link the individual data region and the processor, each CPU must have its unique ID number. When the timer interruption occurs, CPUs enter the kernel mode and run the scheduling algorithm to get a runnable task. Each CPU executes the same kernel code separately as if there is only one CPU in the system, but some kernel data must be shared and can not be accessed simultaneously. So Linux adopts the spin-lock mechanism with locking variables and prevents CPUs from entering a same critical region. The 'spin-lock' dose its locking like a 'beach flag game' (i.e. competitors dash to a flag standing on beach and one person who grasps and picks it up is the winner). From the view point of memory bus accesses, the grasp corresponds to 'memory read' and the picking up to 'memory write', respectively. To make the judgement clear, no other person can grasp the flag, during someone grasps and tries to pick it up. This means that the action of grasp and picking up must be 'atomic'. There are several ways, but to equip a 'SWAP' instruction between register and memory is one method to achieve atomic instructions. Also each CPU has own 1st data cache for the efficiency. Writing a new value to a shared variable will not affect other CPUs because those read cached data, even though the 1st data caches have 'write-through' property. Linux kernel does not support this kind of cache coherency problems (this must be supported by the hardware). But there are some other synchronization problems, e.g. MMU TLB synchronization, etc. So the kernel has the IPI (Internal Processor Interface) mechanism, and each CPU can force other CPUs to execute the indicated function. This is a kind of 'Message Box' and the hardware must support it. To send a message to a particular CPU, the unique CPU's ID is also used.

What is required for Nios2 SMP system

As discussed above, we need next hardware features to implement an SMP Linux system by using Nios2 CPUs.

Reset signal for each CPU.
Unique ID for each CPU.
Atomic memory access for spin locks.
Cache coherencies for each CPU's (1st)data cache.
IPI hardware.

For 1) and 2), these are implemented easily through SOPC's options. 5) is only a special interruption mechanism with some message data. So the remaining difficult problems are 3) and 4). As mentioned above, spinlocks of Linux are only barriers of entrances for special area and does not mean real 'bus lock' mechanism. Moreover, its spinlock implementation is hardware dependant, so we can make those by using shared hardware mutexes which are supported by Nios's peripheral, but here, we will try to do it by implementing 'SWAP' instruction. Unfortunately, we can not flush the Nios's original data cache from its outside. But fortunately, there is the option to eliminate it, so we will design a new 1st data cache and solve the coherency problem. These features except 1) are combined one and formed as a kind of adapter (Dual_Core_Adapter or Quad_Core_Adapter) which can be used on SOPC builder easily. The adapter has mainly 4 components, 1st data cache, 2nd cache, common bus and multi-controller.

Dual_Core_Adapter.JPG (Click here for image)

1st Data Cache

Designed new 1st data cache has next properties

Direct-mapped Write-through cache.
Physically indexed and tagged.
Cache bypass mechanism involved (28bit trick is used).
Cache flush mechanism from outside.
16bytes line size.

The reason to adopt the write-through cache is for the sake of simplicity. Because the write transaction must be serialized between the 1st caches and 2nd one, so we can use the arbitration mechanism also for cache synchronization one. And write-through cache makes flush instructions simple. Note that the addresses which is supplied to the cache has already be transformed to physical value through MMU. Unfortunately we can not export the cached/uncached distinction, we use the same 31bit technique of no-MMU Nios2 for 28th bit. So the kernel memory region is divided into 4 like

Cached 0xC0000000 ~ 0xCFFFFFFF

Uncached 0xD0000000 ~ 0xDFFFFFFF

Cached 0xE0000000 ~ 0xEFFFFFFF

Uncached 0xF0000000 ~ 0xFFFFFFFF

and we need special instructions to flush this cache from the software. To do this, we use a custom instructions component and discuss about it later. When the write access of one CPU to its 1st cache is executed, succeeding write-through transaction will get the bus arbitration and the write address is delivered to other CPUs. If other 1st data cache have same address cache line, those are simply flushed from the line. The new updated data is filled from the 2nd cache or main memory, when the corresponding CPU accesses the line.

2nd Cache

2nd data cache has

2 way set-associative write-back cache.
Atomic 128bits read and write accessing bus with SWAP transaction.
Cache bypass mechanism involved.
16bytes line size.

properties. 2nd and 1st (instruction and data) caches are connected with special 128bits bus through the component 'Common Bus'. Atomic read and write access is done when a miss-hit write access of 1st data cache is occurred. A new write data (<= 32bits) is sent to the 2nd cache and the line data (=128bits) which is updated by the new written data is returned within one transaction. Swap transaction is done by returning the old line data before update.

Common Bus

The common bus do the arbitration between 1st caches and serialize the bus transaction to the 2nd cache. The policy of arbiter always gives higher priorities for instruction fetches against data accesses. Also the priority is rotated among CPUs. When one 1st data cache do its write-through access, the address and 'flush' signal are sent to other CPU's 1st data cache from this component.

Multi-Controller

Multi-Controller has several functionalities.

Reset register.
CPU information register.
Interrupts distribution registers.
IPI registers.

The reset register deliver the reset signal to each CPU except the boot CPU. For example, writing 1 to bit 1 position of this register will de-assert the reset signal of CPU1. The CPU information register shows the amount of CPUs. The interrupts distribution registers indicate the distribution of external hardware interrupts. In the quad case, writing 0 to bit 1 position for the lower interrupts distribution register and writing 1 for the same bit position of the higher register, IRQ 1 is delivered to CPU2. (Note that the interrupts distribution shares the load of interruption, but it's not essential for the SMP system.) The IPI registers are used to send the IPI message between CPUs. There are 4 IPI registers in the quad core case, IPI0, IPI1, IPI2 and IPI3 generate IRQ30, IRQ31, IRQ28 and IRQ29, respectively. Writing 1 to the bit 0 position of the register asserts the interrupt signal. Other bits of the register can be used for IPI messages.

Register Mappings

	0	+4	+8	+C
Multi-Controller Base	Reset Register	CPU information Register
+10	Interrupts Dest. Reg.0	Interrupts Dest. Reg.1
+20	IPI0	IPI1	IPI2	IPI3

Custom Instructions Component

Custom Instructions Component receives custom instructions and controls the data cache. Inside the component, one control register and one swap value register exist, one swap and 4 cache flush instructions are supported.

	31 ... 4	3	2 1 0
Control Reg. c0		swap	flush ctrl
Swap Value Reg. c1	Swap Value

flush ctrl bit0 If this bit is set 1, line flush is done with cache-hit.

flush ctrl bit1 If this bit is set 1, line flush is done anyway.

flush ctrl bit2 If this bit is set 1, write-back is done when the line is dirty.

swap This bit is automatically set 1 by writting a data to Swap Value Reg., swap operation is done between the Swap Value Reg. and specified memory region by the following memory instruction.

Swap Value Reg. 32bits data to swap.

The swapping address is specified by the next normal memory read or write instruction and Control Reg. bits are automatically cleared after its execution. So the 'SWAP' instruction is done by

int swap(int i, int *p) /* swap i <-> *p */

{

int tmp;

__asm__ __volatile__ (

"custom 0, c1, %1, r0\n"

"ldw %0, (%2)\n"

: "=r" (tmp) : "r" (i), "r" (p) : "memory"

);

}

Unfortunately, the instruction "custom 0, c1, %1, r0\n" and the following "ldw %0, (%2)\n" are NOT atomic, this function must run under the interruption is disabled. Moreover, custom instructions can not be 'supervisor-only' ones, we must clear Control Reg. bits at the top of exception service routines for tamper resistance. Because the 1st data cache is write-through, the flush ctrl bit2 has no meaning for it. The flush ctrl bits setting and normal Nios2 data flush instructions correspond as follows.

	ctrl bit2	ctrl bit1	ctrl bit0
inind	0	0	1
initda	0	1	0
flushd	1	0	1
flushda	1	1	0

(Note: the setting 'flushd' has no effect to the 2nd cache under the current implementation.)

Sample Source File

Hardware

Next archived files are hardware sample sources of Nios2 SMP system. The pins are assigned for Terasic VEEK/t-Pad or DE2-115 boards.

File:NiosII Dual Core.qar (Dual core Nios2 = 2 CPUS Example.)

File:NiosII Quad Core.qar (Quad core Nios2 = 4 CPUS Example.)

These sources are developed on the QuartusII and SOPC Builder of version 9.1sp2.

By the (stupid) restriction of SOPC Builder, we must use some cheating techniques and edit the generated files a little to create a new system.

Open the file /IP/Dual_Core_Adapter/Dual_Core_Adapter.bdf or /IP/Quad_Core_Adapter/Quad_Core_Adapter.bdf on your QuartusII, and generate verilog source file dual_core_adapter.v or quad_core_adapter.v by selecting 'File->Create/_Update->Create HDL Design File for Current File'.
Start your SOPC Builder, select a suitable core adapter, dual_core_adapter or quad_core_adapter (these are located in the 'Bridges and Adapters'), and add it to your system.
Add necessary amount of CPUs (Nios2/f core) without original data caches, and set those parameters properly.
Connect tightly coupled onchip-memories for exception routines.
Set the core adapter's base addresses, 0x00000000 for instruction slaves, 0x80000000 for data slaves.
Set the core adapter's IRQ No. for IPI interruptions, 30 for cpu0, 31 for cpu1, etc.
Connect remaining components all and generate your system.
After the generation, copy the contents of the file dual_core_adapter.cheat.v or quad_core_adapter.cheat.v to dual_core_adapter.v or dual_core_adapter.v, respectively.
Edit the wrapper file e.g. dual_core_adapter_0.v or quad_core_adapter_0.v (this will depend on your module name selection), and enlarge the bus width of instructions from 24 to 28 bits (search [23: 0] and rewrite it to [27: 0])
Edit the top velilog file. For details, refer the diff file of these samples.
Make a top bdf file, put the symbol of generated system and connect the exported signals like examples.
Locate necessary pins and compile it.

File:Pl nniosii system.dual.diff.tar.gz

File:Pl nniosii system.quad.diff.tar.gz

The files of these examples are presented with modified form.

Sopc_builder.JPG ‎ (Click here for image)

The dull and low-quality 'JTAG Debugger' will soon make 'Verify failed', so the old 'GERMS' monitor is implemented for software downloadings. You can down load your new kernel by

nios2-terminal < ***.srec

using your JTAG download cable. The GERMS monitor is implemented in the file 'memory_monitor.hex', and you must delete it or change the reset vector of the boot cpu, if you want to pass it.

Software

Next archived file contains architecture dependent part(nios2-linux/linux-2.6/arch/nios2) of Nios2 for Linux Ver.2.6.30.

File:Nios2.tar.gz

The files 'smp.c', 'spinlock.h', 'entry.S' and 'head.S', etc. are revised largely. To enable the adapter, define

#define CONFIG_MULTICORE_ADAPTER

somewhere(e.g. in the menuconfig). Of course, you must also define CONFIG_SMP and CONFIG_NR_CPUS (2 or 4).

Download of Compiled Samples

These files are samples for Terasic VEEK/t-Pad boards. Use the 'Application Selector' for loading.

Hardwares.

§ File:Nios2 dual HW.bin.tar.gz (Dual core Nios2 = 2 CPUS Example.)

§ File:Nios2 quad HW.bin.tar.gz (Quad core Nios2 = 4 CPUS Example.)

Software.

§ File:Linux.initramfs.smp SW.bin.tar.gz (This Kernel supports up to 4 CPUS.)

The peripherals of hardwares are same as www.alterawiki.com/wiki/Linux_with_MMU_on_VEEK/t-Pad. You can run X Window or other applications on it.

Please note this sample is still little bit buggy when you press 'CTRL+ALT+DEL' and reboot the system.

For the locations of pins, please refer www.alterawiki.com/wiki/File:Nniosii.pin.tar.gz. This hardware distributes the same image to the VGA connector, so you may use your normal DE2-115 instead of VEEK/t-Pad boards.

Performance of this SMP System

There are some disadvantages to adopt external 1st data caches. Especially, it makes the memory access slow. Benchmark of 'dhrystone' shows that each core's performance is reduced to 70% against normal Nios2/f core. But the response for multi-task schedulings is extremely improved, e.g. when we run X window system on it. Next screen-shot shows that two 'mpeg2dec' programs are executed simultaneously and smoothly with 4 CPU cores.

Mpeg.png (Click here for image)

CPU Shielding

CPU shielding is a method on multiprocessor or multi-core systems to protect high priority (for example, real-time ) tasks from low priority (for example, non-real-time) tasks. Linux is able to shield CPUs by setting the affinity for both processes and interrupts.

CPU affinity

NOTE: Before to try the following, you need an SD card and mount it that is indicated in the page

http://www.alterawiki.com/wiki/Linux_with_MMU_on_VEEK/t-Pad

One convenient way to set the CPU affinity or 'cpuset' is to use the 'Cpuset Management Utility'(cset).

https://rt.wiki.kernel.org/index.php/Cpuset_Management_Utility/tutorial

To set up a shield for CPU3, issue the following command.

cset shield -c 3

If you want to move all movable kernel threads into the unshielded system, try

cset shield -k on

. And to execute a new process in the shield, try

cset shield -e mp3play <your music file>

for example.

Interrupt affinity

You can set the interrupt affinity value for a particular IRQ number by changing the one which is stored in the associated /proc/irq/IRQ_NUMBER/smp_affinity file. Note that the stored hexadecimal value in this file is a bit-mask for all CPU cores in the system. To shield the CPU0 from the interrupt, reset the bit 0 . For example, the next commands shield the CPU0 from timer's interrupt (in usual, this interrupt is assigned to CPU0 at boot sequences).

root@nios2:/# cd /proc/irq/0/

root@nios2:/proc/irq/0# cat smp_affinity

1

root@nios2:/proc/irq/0# echo e > smp_affinity

root@nios2:/proc/irq/0# cat smp_affinity

e

Known Bugs

Set current date and time

date <date time year>

to avoid hr_timer's bugs, otherwise the touch panel cursor will occasionally freeze for about 25 ~ 40 seconds.