Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12589 Discussions

Help in CRC custom instructions please.

Altera_Forum
Honored Contributor II
2,757 Views

Hello, 

 

I am a new bie in the custom instruction development for Altera and was looking into the way the CRC method is implemented as 

per the link altera.com/support/examples/nios2/exm-custom-instruction.html 

 

I see the main C file which calls the custom instruction as crc_main.c and the custom instruction is crcCI. 

That function is defined in the file ci_crc.c. In that function, the macro CRC_CI_MACRO is called with the arguments. 

The macro is defined in the same file as:# define CRC_CI_MACRO(n, A) __builtin_custom_ini(ALT_CI_CRC_INST_N + (n & 0x7), (A)) 

 

The macro is being assigned to a built in gnu gcc custom function (if my understanding is correct). 

My questions are as follows: 

 

Where can I see the gnu built in custom function (I mean, in which file). It should expand like a driver level function I guess. 

And where is the macro ALT_CI_CRC_INST_N being defined? 

 

If all the custom gnu gcc functions are in-built, then how is it really a custom instruction? 

 

I am struck at this point of thought, please give me some pointers. That would be a huge help.
0 Kudos
29 Replies
Altera_Forum
Honored Contributor II
737 Views

The definition of __builtin_custom_ini() is in the gcc sources, there isn't anything 'special' about it you could write it yourself. It expends to a single nios instruction. 

I've a very similar one: 

__attribute__((always_inline)) static __inline__ unsigned int custom_inic(const unsigned int op, unsigned int a, const unsigned int b) { uint32_t result; __asm__ ( "custom\t%1, %0, %2, c%3" : "=r" (result) : "n" (op), "r" (a), "n" (b)); return result; } 

for custom instructions where I use the 'regb' field as a sub-opcode 

(In particular for supporting all the byte and bit swaps in a single instruction). 

 

The custom instruction number is allocated by sopc builder/qsys when you add the instruction, unfortunately you don't seem to be able to choose your own number [1]. 

 

The instruction number is then written to one of the generated .h files. 

 

[1] This is a problem if you are trying to define multiple nios cpu with different (overlapping) sets of custom instructions. Or want to be able to compile code without the sopc/qsys generated header file.
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

 

--- Quote Start ---  

The definition of __builtin_custom_ini() is in the gcc sources, there isn't anything 'special' about it you could write it yourself. It expends to a single nios instruction. 

I've a very similar one: 

__attribute__((always_inline)) static __inline__ unsigned int custom_inic(const unsigned int op, unsigned int a, const unsigned int b) { uint32_t result; __asm__ ( "custom\t%1, %0, %2, c%3" : "=r" (result) : "n" (op), "r" (a), "n" (b)); return result; } 

for custom instructions where I use the 'regb' field as a sub-opcode 

(In particular for supporting all the byte and bit swaps in a single instruction). 

 

The custom instruction number is allocated by sopc builder/qsys when you add the instruction, unfortunately you don't seem to be able to choose your own number [1]. 

 

The instruction number is then written to one of the generated .h files. 

 

[1] This is a problem if you are trying to define multiple nios cpu with different (overlapping) sets of custom instructions. Or want to be able to compile code without the sopc/qsys generated header file. 

--- Quote End ---  

 

 

Thanks a lot for your reply. You mean to say the __builtin_custom_ini() function is in-built and we cannot modify them. 

One question I have is, how does the application programmer know the macro CRC_CI_MACRO implements the CRC functionality? 

From his point of view 'ini' can be any function which takes in an integer as the input and returns another integer. 

 

So does he look into the generated .h file , take the instruction number and use it for the macro ALT_CI_CRC_INST_N ? 

Even if he does that how does he know the built in custom ini function implements the CRC? 

 

Any comments towards these lines are really appreciated. 

 

Thank You, 

Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

There is no point quoting the entire previous message! 

 

__builtin_custom_ini() is just a way of getting a 'custom' opcode into your program's object code. There is nothing special about it, it doesn't generate anything different/special by being 'builtin' - there is no requirement to use it to generate a 'custom' instruction, you can use a gcc asm statement of your own with exactly the same effect. 

 

Knowing what a particular custom instruction actually does is something that you (as a system designer) need to know. If you build with the altera IDE scripts (etc) then the relevant constants are propogated through the system for you. 

Otherwise you have to make sure that the values match. 

For instance you might request the hardware engineer set custom instruction 0 to calculate a crc and instruction 1 to 'frob' (whatever that might be) two values. This is not really any different from defining the register layout for an avalon slave (etc). 

 

FWIW, if you actually want to calculate CRC16 (used on most hdlc) there is a simple piece of combinatorial logic to add in a byte. There might we a similar reduction for CRC32 (ethernet).
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hello, 

 

Since I am a beginner in Altera and FPGAs I did not understand some of the points you made here. 

I quote what I did not understand below: 

 

--- Quote Start ---  

 

 

Knowing what a particular custom instruction actually does is something that you (as a system designer) need to know. If you build with the altera IDE scripts (etc) then the relevant constants are propagated through the system for you. 

Otherwise you have to make sure that the values match. 

 

--- Quote End ---  

 

What do you mean by propagating the relevant constants with altera IDE scripts build? 

 

My requirement here, is to implement an arbitrary precision integer logic on Altera DE2 FPGA. 

So I was wondering with the help of NIOS ii to add some custom instructions which can operate on multi clock cycles. 

 

Example usage: RSA algorithm implementation, where the key size can even go to 1024 bits. Since the register size is 32 bits, 

we can select 32 * 32 bit data chunks. 

 

So, is it the hardware engineer who implements the custom instructions on a NIOS ii? If so, on which layer I can see those 

instructions? (on a C language or a hardware verilog/vhdl). Can I be the one who can write a custom instruction on the NIOS ii core? 

 

Kindly excuse me here if I am asking dumb questions, still a learner. 

 

Thank You, 

Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

I don't use the IDE.... but the ALT_CI_CRC_INST_N constant is defined with the value qsys/sopc assigned to the custom instruction when it was added to the cpu definition. 

So if you# include the correct header, and the custom instruction opcode is generated with that constant, and you've downloaded the correct fpga image it will work. 

 

I'm not sure how you'll get any help with RSA though. 

The custom instructions can't access memory (not strictly true since their logic could contain an Avalon master! - but you can't interwork with the data cache). 

The only inputs (from the cpu) are the 32bit opcode word, and the values of two 32bit registers indexed by the 'A' and 'B' fields. 

 

A combinatorial custom instruction could use the value of other vhdl registers (etc) but it can't have any side effects because it executes every clock (or rather the logic is fed the relevant inputs on every clock, and the output value is fed into a big mux and used if the opcode actually selects the relevant custom instruction). 

 

Possibly you could use a clocked, multi-cycle (single-cycle) custom instruction to latch values into some logic - use the 'C' field (with readrc unset) to determine where to write (you can write both the A and B register values). 

With 'readrc' set you return a value, use the 6 bit 'A' and 'B' register numbers to determine what to read (or use a 32bit value and the 'B' rg number). 

Using a separate combinatorial custom instruction for the reads would save the 2 clock 'late result' penalty on the result - making coding easier.
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Thanks a lot for the valid input. I was looking into the custom instruction pdf of the NIOS ii which is at the link: 

altera.com/literature/ug/ug_nios2_custom_instruction.pdf 

 

There I see External Interface Custom Instructions Figure 1–9. Custom Instructions Allow the Addition of an External Interface. 

The page states: 

"Custom instruction logic can perform various tasks such as storing intermediate results or reading memory to control the custom instruction operation. The conduit interface also provides a dedicated path for data to flow into or out of the processor. For example, custom instruction logic with an external interface can feed data directly from the processor’s register file to an external first-in first-out (FIFO) memory buffer." 

 

So if we are able to interface a memory buffer with the custom instruction logic, we will be able to store the intermediate values. 

 

Say we would like to add 1024 + 1024 bit data. Cannot we use the above concept to add 32 + 32 bits * 32 times and storing the intermediate results 

in a memory buffer (of course the carry bit has to propagate) and later stitching all the bits together to form the final result? 

Looks like the above External Interface can be implemented in multi-clock cycle (32) custom instruction method. 

 

If my above assumption is correct, please read on. 

Is it possible to come up with a custom instruction, like ADD1024 (looks like the normal ADD) which implements the above functionality. 

If possible, what is the method to create a custom instruction? From an upper level the application programmer can use the built-in-macros. 

However I guess the SOPC builder has to create the custom instruction from a lower level? If so, what is the process? 

From where I can start? 

 

Really appreciate for the response. 

 

- Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

I've not done it for a while, but the sopc builder has a 'wizard' that will created the template for a custom instruction. 

You then have to write the vhdl to do the actual operation. 

I'd start with a simple combinatorial custom instruction - not a bad environment to learn how to write small bits of vhdl since the inputs can be easily set and the outputs printed. 

It is a shame that the only Altera docs/examples I found concentrated on ticking the GUI boxes to add the existing instrcutions (and mostlf the FP ones). 

 

It is certainly easy to write a 32bit 'add with carry' custom instruction (add r0 to r0 to clear the saved carry), ideally you'd need to save/restory the carry bit during interrupt entry/exit - but for simple code avoiding using the instruction during interrupts would be enough. 

Whether it is worth having an internal wide accumulator (1024 bits in your case) depends on whether you can reuse the provious result - and that depends on exactly what you are trying to achieve. 

I've only written combinatorial custom instructions (all I needed),
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hi dsl, 

 

Really appreciate the inputs. Now I understand the SOPC build process and came to know that that custom instructions could only 

be implemented through an HDL (In my project I will use Verilog, my professor demands that). I have decided to go ahead and 

try playing around with these concepts. Initially I thought the custom instructions could be implemented with a high level language like 

C or C++. However now I understand that all those Macros we discussed above are just like an interface to the application programmer (who can program in C) and the usage is entirely decided by the SOPC builder. 

 

Now what I plan to do is to implement the arbitrary precision logic custom instructions where the user can define the block numbers. Say the user says 32 blocks, the ALU is supposed to operate on 32 blocks of 32 bit data (s). 

 

An ADD Example: ADDCUSTOM with block size 4. This input asks the custom logic to add two operands of 128 bits, where 32 bits are added at one time. As usual I came up with a couple of questions here as well:  

1) Inside the custom instruction, is it possible to call a NIOS ii normal instruction, like ADD (which can add two operands rC ← rA + rB) with the carry detection instruction added (CMPLTU). After a block operation we should be able to store the Carry bit for the next block for the ADD instruction. 

2) Shall I come up with a separate ALU for this custom logic or is it possible for me to assign the NIOS ii ALU to do this logic, since the normal ADD instruction can be executed in the NIOS ii ALU itself? 

3) Third question is about how to store the carry bit and the intermediate results. Is it possible to ask the custom ALU which might have an ACCUMULATOR to hold these values? 

 

 

Thank You, 

Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

The custom instruction has to contain all of it's own logic. 

A multicycle custom instruction has clock and enable inputs, so can save values between cycles (but an OS context switch would have to save the internal state, or you have to disable context switches over instruction sequences that need the saved state.) 

 

It is worth realising that the nios cpu ALU works (effectively) by feeding the rA and rB register values into separate combinatorial logic for every opcode, and then uses a great big mux to select the required result. 

 

A simple combinational custom instruction would be one that replicates one of the standard ALU instructions (or even simpler, just returns one of the input registers). 

For multicycle something that returned one of the inputs the previous time the instruction was executed. 

But I don't remember seeing such examples! 

 

I didn't have any difficulty with the combinatorial ones - and I hadn't written any vhdl before (I have soldered TTL chips together). 

This is the vhdl of my crc16 instruction: 

-- crc16.vhd -- Parrallel CRC generator for CRC16 (most hdlc). -- Nios2 Custom instruction -- Implements following C: -- t1 = crc ^ data; -- t2 = (t1 ^ t1 << 4) & 0xff; -- return crc >> 8 ^ t2 << 8 ^ t2 << 3 ^ t2 >> 4; library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity crc16 is port ( data : in std_logic_vector(31 downto 0); -- 8 bit character value crc_in : in std_logic_vector(31 downto 0); -- Old 16bit crc crc_out : out std_logic_vector(31 downto 0) -- Updated crc value ); end entity crc16; architecture rtl of crc16 is signal t1, t2: std_logic_vector(7 downto 0); begin t1 <= crc_in(7 downto 0) xor data(7 downto 0); t2 <= t1 xor t1(3 downto 0) & B"0000"; crc_out <= X"0000" & (X"00" & crc_in(15 downto 8)) xor (t2 & X"00") xor (B"00000" & t2 & B"000") xor (X"000" & t2(7 downto 4)); end architecture rtl;  

 

The relevent part of crc16_hw.tcl is: 

set_interface_property crc_16 clockCycle 0 set_interface_property crc_16 operands 2 set_interface_property crc_16 ENABLED true add_interface_port crc_16 data dataa Input 32 add_interface_port crc_16 crc_in datab Input 32 add_interface_port crc_16 crc_out result Output 32
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hello dsl, 

 

I was wondering which would be a good development environment for the Altera DE1 board? 

(I have the DE1 board with me on which I plan to bring up the NIOS II processor and add to experiment the custom instructions) 

I mean, which version of Quartus, the NIOS II EDS etc will suit the board or can I use the latest versions? 

 

 

Thank You, 

Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Probably doesn't really matter at all. 

Just start from a working version of one of the simple Altera configs. 

Unfortunately none of those seem to pass the timing constraints - so you can't (easily) tell if your logic is too slow! I remember having to drop the clock from 100MHz to 50MHz to get rid of some very strange errors! 

 

Although I wrote the custom instruction vhdl (the only vhdl I've actually written) and tested it on one of the cyclone III boards, it got built into out main image which runs on an arria II (I think) by the hw team. 

 

Writing custom instructions is probably a good way for reasonably experienced software engineers to get to understand VHDL. 

If there were better examples it would help!
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hello, 

 

I have experimented with the altera examples of CRC and was successful in bringing up the custom instruction on the DE1 board. 

Now I have questions on the way in which I should implement the custom instruction for arbitrary precision integer addition (say 128 bit) 

What I plan is to do a 32 bit + 32 bit addition (since the dataa and datab signals are 32 bit, I dont intend to change the base unit size, let it be 32 bits) 

for a clock cycle and do the same for 4 clock cycles, so that 32 * 4 = 128 bits in total. 

Hence I can try to implement a multicycle custom instruction which runs for four clock cycles and after that, returns the 'result' signal. 

 

Here, I would have to save the 'carry out' bit and the 'result' signals for each and every stage. I think since the 'result' port is an output port, 

that data can be saved somehow. However, the 'carry out' bit has to propagate to the next stage of adder as the 'carry in' bit.  

I was wondering how to do this? 

 

Do I have to use External Interface Custom Instructions ? I read that a multicycle Custom Instructions Allow the Addition of an External Interface. 

Figure 1.9 in the NIOS II custom instruction user guide. I sentence which captivated me was "Custom instruction logic can perform various tasks such as storing intermediate 

results or reading memory to control the custom instruction operation." 

 

Please give me some pointers here, that would be a great help. 

 

 

Thank You, 

Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

You could generate logic that has a single accumulator, with separate instructions to add/subtract (etc) the input A:B (or B:A !) value from it. 

Use the rC field (with writerc unset) to decide what to do. 

These instructions could return 'done' immediately, and carry on any processing in the subsequent clocks. You might want to add a 'stall' if the previous instruction hasn't finished. 

To read the result use the A or B field (unset readra/readrb) to select which 32bit result to return, and set writerc to the value is actually written to the register file.
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hello dsl, 

 

Can you please explain to me a little bit more? I am still trying to understand the ways in which it can be implemented. 

I believe you intend me to use the 'Internal Register File Custom Instructions' than the 'External Interface Custom Instructions'. 

 

And by pulling the 'writerc' signal low, we can save the SUM output value to an internal register which can be addressed by c[4:0]. 

Please note that my full adder module can have a 'Carry Out' bit as well, which has to be propagated from one adder module to another. 

I was wondering how to store this? Since there is only one output port to a custom instruction (that is 'result') and that port is already used to store the SUM. 

 

The design I planned to use was something like this: 

Say if I need a 128 bit adder, then I will cascade 4 32 bit full adders. Please see the attached Full Adder Design text file which is attached. 

I am having a lot of questions here, 

like is it possible to cascade the four custom instruction blocks, or  

a better approach is to have a counter inside the custom design and do the instructions for four clock cycles, in each and every clock, read the cout bit and update the cout  

bit. In this case I guess I will have a space inside my accumulator where I can store intermediate carry out and result values. And raise the 'Done' signal after counting 4 clock 

cycles and doing the operations with it. 

 

Please advice me here, that will be a huge help. 

 

 

Thank You, 

Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

If your instruction has writerc low, then the cpu fabric doesn't do anything with any value on the 'data out' lines at the end of the instruction. 

What the c[4:0] bits are then used for is entirely up to the implementation of your custom instruction. You could choose to use them to index some local register file, OTOH you could use them for anything else you want to - maybe as an internal opcode. 

 

I'd consider using the c[4:0] bits (with writerc low) to determine what to do with the A and B values. 

And the b[4:0] bits (with writerc high, readrb might as well be low) to determine which value to return. 

Or some similar scheme. 

You might want to use a second combinatorial custom instruction for the reads, that would avoid the 'late result' penalty.
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hello dsl, 

 

Thank you for the reply! I think I am able to understand what you are saying here. I believe I have to come up with some sort of accumulator design 

to store the intermediate values between the clocks. 

 

I have a couple of questions here as well. 

 

1) How to give the input ports c[4:0] some value when we call the corresponding macro from the custom instruction from the application C code in NIOS II IDE? 

I designed a small verilog with c[4:0] and writerc ports in it. However after a build in NIOS II I looked into the 'system.h' file and see my custom instruction macro as: 

# define ALT_CI_CUSTOM_COMPONENT_ADD_INST(n,A,B) __builtin_custom_inii(ALT_CI_CUSTOM_COMPONENT_ADD_INST_N+  

(n&ALT_CI_CUSTOM_COMPONENT_ADD_INST_N_MASK),(A),(B)) 

I was wondering how the processor will act as the master and gives the signals like clk, clk_en, reset, start, writerc, c[4:0] etc to the verilog modules. 

Is there some way in which I can set those values while running the code from NIOS II or even from the Quartus SOPC builder?  

The custom instruction manual does not give any clue regarding these. 

 

2) Is there a pin trace (like gtk wave utility) from NIOS II using which I can see the port values while running the code from NIOS II IDE. That will really help me a lot 

to understand the signals. 

 

I hope the questions are clear here. 

 

Thanks in advance for your response. 

 

 

Thank You, 

Akhil Kalathungal
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hello, 

 

I think I have to stop looking into the custom instruction implementation for the arbitrary precision logic since the custom instructions are not really flexible. In my case if the custom instruction lasts for four clock cycles, I need to sample the new values for dataa and datab every clock cycle. 

 

However, the definition of a multicycle custom instruction needs the operands dataa and datab to remain constant for that many number of clock cycles 

for which the custom instruction runs. So in this case, I might have to wait for four clock cycles to get the result of first 32 bit + 32 bit addition which 

does not serve the purpose of accelerating the speed of the instruction. 

 

 

Another approach is to make the custom instruction hardware modules combinatorial and write to memory every time the sum and the carry and read from memory (the earlier carry) for the next set of 32 bit operands addition. This will also have the bottleneck of making the verilog hardware modules 

talk to the memory for each and every clock cycle. 

 

So I guess I have to look into some other mode of Hardware Acceleration. 

I think I shall try to implement some hardware accelerators (IP cores) from the SOPC builder. 

 

Please try to correct me if I am going in a wrong direction here. Any inputs are appreciated. 

 

 

Thank you, 

Akhil
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

Hello, 

 

Can someone reply in this thread please? I would really appreciate if I get some ideas. 

Since I am a beginner (dumb :() in this area, I would like someone to comment which will be really helpful. 

 

Thanks.
0 Kudos
Altera_Forum
Honored Contributor II
737 Views

I agree with your assessment, using custom instructions can be very restrictive. You will have a lot more flexibility if you design your own SOPC component with an Avalon Memory Mapped slave interface. You can have as many registers as you want and use as many cycles as you want to perform your operation. 

If you have big transfers between your core and the main memory you could also implement a Memory Mapped master interface that can directly read/write to the memory but it is a bit more tricky to implement. I think the SOPC user manual as an example with a checksum hardware core that does exactly this.
0 Kudos
Altera_Forum
Honored Contributor II
647 Views

Hello Daixiwen, 

 

Thank you for the reply. And I think I have seen that checksum hardware accelerator example in the SOPC user manual. 

The issue with that is, that is too much a high level for a beginner to start with. 

 

It does not explain how to create a sw.tcl file and the *.c and *.h driver files (I think all those files are to be hand-coded). 

Also the build process explains a NIOS II SBT build, and which is specific for a Cyclone IV device. 

 

I use the DE1 board here, which has the Cyclone II device in it. 

 

So is there any simple example for the SOPC builder component which I can refer to? That would be a huge help. 

 

Thank You, 

Akhil
0 Kudos
Reply