Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12589 Discussions

Using custom instruction from uCLinux user app on nios2mmu

Altera_Forum
Honored Contributor II
1,804 Views

Hello, 

 

Is it possible to use a custom instruction from an user application running on NIOS2MMU - uCLinux. I'm trying to run the CRC design example on uCLinux. When I try to compile the software application I get "macros undefined error" which I thought would be a part of the cross compilers standard header files. Creation of BSP which in turn generates "System.h" file will fail as the BSP tool doesn't support NIOS2 with MMU. 

 

Is there anyway to achieve this? 

 

Thanks, 

Chetan
0 Kudos
16 Replies
Altera_Forum
Honored Contributor II
469 Views

Since a BSP is not used, you have to write the macros for the custom instructions yourself or use the builtins, which should be defined by the cross compiler. 

 

For example: 

/* Opcode for the byteswap custom instruction provided by Altera. * This may change if other custom instructions are present. */ # define ALT_CI_BYTESWAP_N 0x00 # define ALT_CI_BYTESWAP(x) __builtin_custom_ini(ALT_CI_BYTESWAP_N, (x))
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

Ykozlov, 

 

Thanks for the reply. The kernel compiled fine after manually defining the macros. 

 

Chetan
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

Do you really want to use the custom instructions in the Kernel, not just in a userland application ?  

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

Michael, 

 

I meant to say the user application compiled without any errors after defining the macros. I'm not using custom instruction in the kernel, but only in the user application.  

 

Thanks, 

Chetan
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

FWIW, if you are doing CRC16 (the usual one for hdlc comms) then a custom instruction for the following C can be used: 

static __inline__ uint32_t crc_step(uint32_t crc, uint32_t byte_val) { uint32_t t = crc ^ (byte_val & 0xff); t = (t ^ t << 4) & 0xff; return crc >> 8 ^ t << 8 ^ t << 3 ^ t >> 4; }The 4 levels of xor easily execute in a single clock. 

My notes suggest that the above C compiles to 11 instructions, and a lookup table version to 7 (with the table base in a global register).
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

Why not do a real hardware CRC generator in a custom instruction ?  

 

I would not do this just with bytes but have the custom instruction use a 32 value and (optionally) do four bytes at a time. This could be done with 32 single shift steps or in a more optimized way with cascades XORs. Calculation time should not be relevant, as it could run in the background having the processor only stall when the next value is inserted and the logic is still busy or when the result is extracted and the logic is still busy.  

 

So I would do those custom instructions:  

 

1) define polynomial  

2) reset / set start value 

3) insert value (one Register = value, one Register = bit count 1..32) (blocks when busy) 

4) get current result (blocks when busy) 

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

If you are going to do that, I'd make require the software have to wait itself - then you can use a combinatorial custom instruction and avoid the register file delays on the read value. 

You could also use the 'rB' field as a sub-opcode (set poly, new value etc). 

 

In my case I was doing 64 channels of hdlc from a TDM stream - so had individual bytes to process. 

 

If you have a buffer, then you want a dma based engine....
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

 

--- Quote Start ---  

If you have a buffer, then you want a dma based engine.... 

--- Quote End ---  

 

+1 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

To correct myself ... (if the whole thing works the way I suspect it does) 

You can't let a combinatorial custom set state (it can return state) as the logic is 'executed' for every instruction fetched (there is no 'enable' line to the instruction). The custom opcode (etc) only affect the value selected and written back to the register file. 

 

Every time I think about the custom instructions, I get more and more convinced that the 'rA' and 'rB' bits are ignored by the nios cpu core.
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

??? 

The complete custom instruction is "ignored" by the NIOS CPU core. Your hardware just is supplied with the rA, rB and rC information. As well the three number as the values of 2 of the appropriate registers, and it is (optionally) supposed to output the value, the NIOS implementation is supposed to write into the third register. 

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

I meant to say that the 'ra' and 'rb' bits (readra and readrb) are ignored. 

 

Think of what happens during the 'Decode' pipeline phase: 

- opcode bits 31-27 read register file (M9K) port 'a'. 

- opcode bits 26-22 read register file port 'b' (dual ported reads) 

- D phase stall is detected (write pending to either register [1]). 

 

Now we have three 32bit values which are fed into all the ALU functions during the 'Execute' pipeline phase (including the combinatorial custom instructions), all will generate their result based on the 96 input bits. 

The opcode bits 5-0 (opcode) and bits 13-6 (custom code) act as a big 'mux on the result of all the instruction logic and a 'write-back' flag (bit 14 for custom) these are latched for writing to the register file next clock [2]. 

 

[1] Careful inspection of the opcode table shows that a stall on the A read is needed for everything except 'call' and 'jmpi' [3], and on the B read if the opcode bits 0 and 1 differ (bit 2 set would be less logic!). I really can't believe there is also check dependant on the custom opcode value. 

 

[2] A write then would miss the next instructions, I suspect there is a two entry fifo with a fast-path into the decode phase of the next instructions. 

(The write can be done in the same clock as two reads.) 

 

[3] Quite a few instructions will actually read register 0 - hopefully there isn't a write pending! I've not tried writing to R0!
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

Hi, 

 

Did you hack Nios2 core?:D 

 

 

--- Quote Start ---  

I meant to say that the 'ra' and 'rb' bits (readra and readrb) are ignored. 

 

Think of what happens during the 'Decode' pipeline phase: 

- opcode bits 31-27 read register file (M9K) port 'a'. 

- opcode bits 26-22 read register file port 'b' (dual ported reads) 

- D phase stall is detected (write pending to either register [1]). 

 

Now we have three 32bit values which are fed into all the ALU functions during the 'Execute' pipeline phase (including the combinatorial custom instructions), all will generate their result based on the 96 input bits. 

The opcode bits 5-0 (opcode) and bits 13-6 (custom code) act as a big 'mux on the result of all the instruction logic and a 'write-back' flag (bit 14 for custom) these are latched for writing to the register file next clock [2]. 

 

 

--- Quote End ---  

 

 

The Nios2/f CPU pipeline is composed of 6 stages like 

 

Fetch -- Decode -- Execute -- Memory -- Align -- Write Back. 

 

Each instruction must get operands before it enters the 'Execute' stage. But the register files are made from 'Embedded Memories', so I think that the register files are always read from the 'Fetch' stage even for the instructions which do NOT need operand values, because the Embedded Memory needs 1 clock for its read (and write) access. 

 

Kazu
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

The 'Fetch' cycle reads the opcode word from memory, the register values must be read in the following cycle - the 'Decode' cycle - in order to be available during 'Execute'. 

This read will be unconditional, the only question is the actual condition(s) for a 'D' phase stall (ie a re-execute for the same opcode word).
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

AFAIK, an instruction can use registers that are modified by the previous one (at least for calculations, maybe when used as an address in a load or store instruction this might cause a hazard and thus stall the pipeline).  

 

So there seems to be a shortcut for the register values and they don't need to be physically saved before being read by the next instruction.  

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

Yes, I think the results of the combinatorial ALU block are written into a 2? entry fifo along with the register number as well as being written to the register file itself. 

The values from this fifo take precidence over the values read from the register file itself. 

This makes the values from single cycle ALU instructions available in the following instruction. 

The results of load and potentially multi-cycle instructions are not fed into this fifo - so force pipeline stalls (it is possible that the results aren't ready early enough in the clock cycle to do this without significantly reducing fmax). 

Load and store instructions are always fully synchronous - they both wait for the Avalon bus transfer to complete. I'm sure the bus interface could trivially do a single async write (would give an asyc fault on error). Async read is somewhat harder - a pipeline stall would be needed to do the delayed write to the register file. 

Possibly they could have done non-delayed reads from tightly coupled data memory - after all the memory read of 'rA + imm16' can be scheduled unconditionally for all tightly coupled data blocks.
0 Kudos
Altera_Forum
Honored Contributor II
469 Views

Hi, 

 

 

--- Quote Start ---  

 

This read will be unconditional, the only question is the actual condition(s) for a 'D' phase stall (ie a re-execute for the same opcode word). 

--- Quote End ---  

 

 

Of course, the 'D' phase stall is evoked in the case of 'Data Hazard'. 

 

 

--- Quote Start ---  

AFAIK, an instruction can use registers that are modified by the previous one (at least for calculations, maybe when used as an address in a load or store instruction this might cause a hazard and thus stall the pipeline).  

-Michael 

--- Quote End ---  

 

 

And of course, Nios2/f core has the 'forwarding mechanism'. May be, those paths are from the output of 'Execute', 'Align', and 'Write Back' stage, and I think (may be) the 'Memory' stage doesn't have one for the sake of simplicity, because the load instruction needs at least 2 clocks when the core uses the data cache and only the 'Align' stage can make stall after 'Execution' stage (this means that the 'Memory' stage is a dummy stage for simple instructions, for example, add or sub). May be the Nios2/f has 'Score Board' algorithm and the latest value is supplied from forwarding paths when the target operand is existing in these stage, so I think the 'D' phase stall is evoked when the next instruction needs the result of memory read or the result of 'Memory' stage. 

 

Kazu
0 Kudos
Reply