Re: New IP development

Altera_Forum · ‎04-09-2013

Hello,

I am interested in developing some instructions which can do the bignum operations (like adding two operands of 1024 bit)

and hence by plan to implement a simple Full Adder like this: {carry,sum} = OP_A + OP_B;

I am a beginner in Altera and realized that I can develop an IP component with Avalon MM slave interface which can talk with the NIOS II processor.

I was wondering how to give the bignum values as the operand from the NIOS II processor (master) to the IP component (slave) from the application code?

I see only these macros in the generated 'io.h' file: # define IOWR_32DIRECT(BASE, OFFSET, DATA)

io_write((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), DATA, (BASE) + (OFFSET))

# define IORD_32DIRECT(BASE, OFFSET)

io_read((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), (BASE) + (OFFSET))

I guess these are the 32 bits write and read instructions. So, do I get to clock in the 1024 bits as a single instruction?

Or do I have to wait for 32 clock cycles (sampling the 32 bits in a clock cycle, do for 32 clocks. Please say a no to this..) !

Hope that question is clear and someone can respond. Really appreciate it.

Thank You,

Akhil

Altera_Forum · ‎04-09-2013

Hello,

Can someone be of assistance to the above question please? That would be a big help.

If the question is not clear I can explain it again.

Any inputs are appreciated.

Thank You,

Akhil

Altera_Forum · ‎04-10-2013

Hello,

I gentle bounce here. Someone please help me.

Thank You,

Akhil

Altera_Forum · ‎04-10-2013

The Nios II CPU uses a 32 bit data bus, and a 32-bit data path internally, so it won't be able to process a 1024-bit word with a single instruction. You will have to divide your 1024-bit word in 32*32-bit words when accessing your IP component.

Altera_Forum · ‎04-10-2013

Hello Daixiwen,

In the Altera user manual for Avalon interface (sopc builder, not qsys), I have seen the MM interface section. There the 'readdata' and 'writedata' ports can have a data width up to 1024 bits. So does that mean the NIOS II processor (master) can write or read 1024 bit data to the slave component

in 32 clock cycles, 32 bit word every clock cycle? Earlier I thought that since the data width is 1024, it might be possible to clock in/out that much data every clock! (I am dumb :( ). And does this also mean that I can have an accumulator (kind of register) in my hardware IP module which keeps sampling the write data for the entire 32 clock cycles? And for read data, which reads the 1024 bits read register, 32 bits every clock cycle, and does the same for 32 clock cycles.

I sense a low throughput for performance here (but, with a proper functionality). Please correct me if my assertion is wrong.

Thank You,

Akhil

Altera_Forum · ‎04-10-2013

You are right that you can have an Avalon MM bus with a 1024-bit data bus size, and if you connect a master and a slave that both have a 1024-bit data bus size, they will be able to transfer 1024 bits on each clock cycle (see, you aren't dumb ;) ). You need to be careful when using such a bus size, as you increase considerably the resources used on the FPGA.

The problem in your case is that the CPU itself is 32 bit. So even if you connect it to a component with a 1024 bit data bus, the CPU will only be able to read or write 32 bits at a time (using byte enables).

If you have a lot of operations to perform on those big words, an idea could be to implement your IP as a full ALU with a bank of 1024 bit registers. Then the CPU would only need to transfer the actual values at the beginning and end of the algorithm, but for all the intermediary steps it would only need to send instructions, and the values itself would stay in your IP component. Of course it depends entirely on what you want to do and how many intermediary values you need.

Altera_Forum · ‎04-10-2013

Hello,

Appreciate the response :). I will explain my problem description.

I was planning to make the NIOS II processor do some bigdigit instructions so that it can support the encryption schemes (like an RSA) from an application code.

An RSA needs to handle some bigdigits operation like arbitrary precision integer multiplication, modulus operation etc and there are FPGA hardware

modules which implements that. However there is no work done in the area where one can execute the RSA from a NIOS II application code (from the Eclipse IDE).

So I thought it will be nice if we can develop some instructions the NIOS II processor can call and execute for doing some bigdigits math. I tried to look into more custom instructions but those were very much constrained. Then I came across the IP development concept, however the base issue is the same as you pointed out. It is not possible to give a data type which is more than 32 bit wide from NIOS II to any component since the processor itself is 32 bits wide (the data path). One approach is to give 32 * 32 data bits and have logic inside my custom IP to handle those. However waiting for 32 clock cycles for a 1024 bit data can affect the throughput of the custom IP instruction that I am planning to implement.

Will discuss this matter with my professor and I can update you as well. Hope my problem description is clear.

Thank You,

Akhil

Altera_Forum · ‎04-10-2013

It is also worth remembering that the Nios cpu stalls during Avalon MM transfers - IIRC this is at least 2 extra clocks.

In addition any read value isn't available for the next two clocks.

So if you actually want to do high performance wide arithmetic you need may not get the performance you expect.

If you are just doing an acedemic excercise then it doesn't matter.

Altera_Forum · ‎04-11-2013

You probably have your work cut out for you just getting started, and toward that end going simple with e.g. an Avalon-MM Slave interface which your NIOS software will use to write the 32-bit registers one at a time is probably simplest. As dsl said, if it's academic you can be satisfied in knowing that you could always make it faster if you chose.

Once you get it working, the performance you will achieve will depend on how much work (complication) you want to invest. As Daixiwen already noted, the 32-bit nature of the NIOS is a significant bottleneck. Although SGDMA is a bit better, you are probably on the right track thinking about Avalon-MM Master interfaces with larger bus width. For example, you could implement bursting master with 64/128/256-bit width to stream operands (and opcodes, if you like) from SDRAM. Ideally, limit the NIOS to control activity only.

If you only have a handful of operands you want to use (but frequently), then you could possibly look into using dual port onchip memory, with the NIOS/SGDMA on one port for reading/writing results, and the other port dedicated for your Avalon-MM Master to use. If it will fit in your device, a RAM width of 1024-bits might provide the highest throughput.

As far as how to control your new IP, you could do something like add custom instructions which take 32-bit addresses (pointers) to the operands which your new logic would independently fetch/store results.

Altera_Forum · ‎04-11-2013

Hello,

Really appreciate for the response.

--- Quote Start ---

If you only have a handful of operands you want to use (but frequently), then you could possibly look into using dual port onchip memory, with the NIOS/SGDMA on one port for reading/writing results, and the other port dedicated for your Avalon-MM Master to use. If it will fit in your device, a RAM width of 1024-bits might provide the highest throughput.

--- Quote End ---

Akhil>> I did not clearly understand the above concept. is it possible for you to explain it please?

--- Quote Start ---

As far as how to control your new IP, you could do something like add custom instructions which take 32-bit addresses (pointers) to the operands which your new logic would independently fetch/store results.

--- Quote End ---

Akhil>> I did not clearly understand the above as well. is it possible for you to explain it please?

Best,

Akhil

Altera_Forum · ‎04-11-2013

Attached below is a quick/simplistic diagram that maybe better explains.

The NIOS/SGDMA store operand data into the onchip RAM, and the "BIGNUM" block implements custom instructions for add/subtract/multiply/divide operations on those operands.

The purpose of the dual port memory is to keep all the x32 masters on one port, and x1024 masters on the other.

Use of the custom instructions in C code might end up looking like:

bignum_t dst, srcA, srcB;
...
...
BIGNUM_ADD(&dst, &srcA, &srcB);
BIGNUM_MPY(&dst, &srcA, &srcB);
...
...

This is similar to what Daixiwen mentioned, with a bank of 1024-bit registers; except you're just using RAM (with associated latencies) instead.

The simplistic diagram is the same if you use a bursting master and external memory (SDRAM, DDR) except you would not have the BIGNUM component implement a 1024-bit master width (x64 or x128 is more manageable, but depends on your memories).

Altera_Forum · ‎04-11-2013

Hello Ted,

Thank you for explaining the design! I think I understand what you had been doing with the dual port on chip memory and the way of connecting it to the NIOS II processor. Your design almost works the same way as the one explained by Daixiwen. However I see a gain in the 'read' performance.

In your scenario, the two operands (those are 32 bits) have to be clocked-in for 32 clock cycles into the on chip memory so that the BIGNUM module can read the operands from it. So there will be some latency to get the operands value.

However I see a small issue with the BIGNUM module being a custom instruction. The custom instruction is an instruction which we implement for the NIOS II processor, right ? So is it possible for us to modify the data path for a custom instruction like this? i.e, reading from the On chip memory than from the NIOS II data bus? Also the output from a custom instruction goes to the NIOS II (which is again 32 bit data path, I am not sure it can return 1024 data bits and we might have to wait for 32 clocks), and unfortunately a custom instruction has only one O/P signal port. For an adder there will be two signal outputs, which is the SUM and the CARRY. For the above reasons I had thought of coming up with an IP core like design which is more flexible.

I am really interested in the SGDMA module (thanks for pointing out the ST interface!), using which I think I can clock-in upto 256 data bits in a clock cycle. So if I need to clock in a text (that has to be encrypted) or an operand that is 1024 bits, I think I can do it in just 4 clock cycles. (more efficient than using 32 clocks). Please correct me if I am going wrong.

Thank You,

Akhil

Altera_Forum · ‎04-11-2013

A (non-combinatorial) custom instruction can read/write internal fpga memory just the same as any other logic.

What you could do is dual port some internal memory to the custom instruction logic and a tightly coupled data port.

You custom instruction would then take memory addresses as it's inputs and perform the operation on the shared memory block.

The cpu reads/writes of the memory would be as fast any other way of getting the data into your logic.

Altera_Forum · ‎04-11-2013

--- Quote Start ---

In your scenario, the two operands (those are 32 bits) have to be clocked-in for 32 clock cycles into the on chip memory so that the BIGNUM module can read the operands from it. So there will be some latency to get the operands value.

--- Quote End ---

Correct. It's kind of like a cache, and the "cache miss" penalty is quite large. It only makes sense to use the memory and keep the operands around if you think you will be using them more than once.

--- Quote Start ---

However I see a small issue with the BIGNUM module being a custom instruction. The custom instruction is an instruction which we implement for the NIOS II processor, right ? So is it possible for us to modify the data path for a custom instruction like this? i.e, reading from the On chip memory than from the NIOS II data bus?

--- Quote End ---

My suggestion is to implement a single new IP component which has two interfaces: an Avalon-MM master, and a custom instruction interface. The Avalon-MM is for the data path (1024-bit operands), and the custom instruction is for the control path (opcodes).

--- Quote Start ---

Also the output from a custom instruction goes to the NIOS II (which is again 32 bit data path, I am not sure it can return 1024 data bits and we might have to wait for 32 clocks), and unfortunately a custom instruction has only one O/P signal port. For an adder there will be two signal outputs, which is the SUM and the CARRY. For the above reasons I had thought of coming up with an IP core like design which is more flexible.

--- Quote End ---

It's your component and you can do whatever you like to meet your needs, but in my diagram I had been thinking that the output would have been written by BIGNUM back to the memory, and not traverse the instruction interface. In other words, the NIOS tells the BIGNUM where to put the result.

--- Quote Start ---

I am really interested in the SGDMA module (thanks for pointing out the ST interface!), using which I think I can clock-in upto 256 data bits in a clock cycle. So if I need to clock in a text (that has to be encrypted) or an operand that is 1024 bits, I think I can do it in just 4 clock cycles. (more efficient than using 32 clocks). Please correct me if I am going wrong.

--- Quote End ---

It's only going to theoretically go as fast as the interfaces it is connected to. If you're DMA'ing from an 32-bit SDRAM, the 256-bit DMA will only emit a word on 1/8th of the clocks since it has to buffer them up in 32-bit increments.

Altera_Forum · ‎04-11-2013

Hello,

I am thankful for the reply.!

--- Quote Start ---

Correct. It's kind of like a cache, and the "cache miss" penalty is quite large. It only makes sense to use the memory and keep the operands around if you think you will be using them more than once.

--- Quote End ---

Akhil>> Okay that makes sense. I have to think about the design (sometimes it might be needed to pass the operands just once to the IP module and then

let the IP do the rest). I will take care of it once I have a clear idea about what to do with the core.

--- Quote Start ---

My suggestion is to implement a single new IP component which has two interfaces: an Avalon-MM master, and a custom instruction interface. The Avalon-MM is for the data path (1024-bit operands), and the custom instruction is for the control path (opcodes).

--- Quote End ---

Akhil>> What do you mean by a custom instruction interface? Is it an interface to fetch the corresponding operand (+, - ,..) from the NIOS II? If so how to do

that from my IP module? (please note that passing those 1024 bits may not be a good idea if we have to take 32 clock cycles to do that)

--- Quote Start ---

It's your component and you can do whatever you like to meet your needs, but in my diagram I had been thinking that the output would have been written by BIGNUM back to the memory, and not traverse the instruction interface. In other words, the NIOS tells the BIGNUM where to put the result.

--- Quote End ---

Akhil>>Here you mean to say like the way BIGNUM module accepts 1024 bits data from the on chip memory, there should be a way to write 1024 result bits back to the on chip memory and the NIOS II gets to decide the address.

--- Quote Start ---

It's only going to theoretically go as fast as the interfaces it is connected to. If you're DMA'ing from an 32-bit SDRAM, the 256-bit DMA will only emit a word on 1/8th of the clocks since it has to buffer them up in 32-bit increments.

--- Quote End ---

Akhil>> I think I understand the above concept. All my experimentation will be on a DE1 board which has a 16 bit SDRAM. Hence-by a 256-bit DMA transfer will still have to wait for 16 clock cycles (since it has to buffer all those 256 bits in 16 bit increments) and hence the whole 1024 bits will take 1024 clocks. This gives me no such edge over a normal MM interface transfer. The best case scenario is a 64 bit SDRAM (which is the highest data width for an SDRAM in an SOPC builder) and use it as a 64-bit DMA transfer. In the above case I guess it will push 64 bits in a single clock and hence it should take only 16 clocks to clock in the entire 1024 bits. I will specify it as a future enhancement for the time being.

Altera_Forum · ‎04-11-2013

--- Quote Start ---

Akhil>> What do you mean by a custom instruction interface? Is it an interface to fetch the corresponding operand (+, - ,..) from the NIOS II? If so how to do

that from my IP module? (please note that passing those 1024 bits may not be a good idea if we have to take 32 clock cycles to do that)

--- Quote End ---

See this document: http://www.altera.com/literature/ug/ug_nios2_custom_instruction.pdf

and this example: http://www.altera.com/support/examples/nios2/exm-custom-instruction.html

You do not pass the 1024-bit operands through this interface. You supply register references and those 32-bit registers would contain the address locations in RAM of where the 1024-bit operands are located. See Figure 2-1 of the .pdf.

Altera_Forum · ‎04-12-2013

Hello,

Thank you for pointing out the CRC example! I understand what you are trying to say. The dataa and datab register references will go to the custom instructions, and those point to the RAM where 1024 bit operands are stored sequentially.

I was looking into the modules CRC_Custom_Instruction.v and CRC_Component.v. I have some questions on the implementation that I might have to do for my design.

I think that the dataa signal of the module CRC_Custom_Instruction.v has to remain as a signal of width 32 (input [31:0] dataa;) since the NIOS II data width is 32 and the register dataa is in its data path. However the writedata signal in the CRC_Component.v module can be a signal of 1024 bit width so that I can perform operations on the entire 1024 bit. Is this what you are trying to say? There is no other way to refer memory directly (like pointers) from a hardware description language, right?

If this was a language like C/C++ (I know a bit of those) I would have had pointers pointing to the memory locations and would have read sequentially from the memory and done the operations.

I am not an expert in the HDLs , so I have to ask these questions. I am sorry for the trouble.

Altera_Forum · ‎04-12-2013

Hello Ted,

I think when I gave a thought now, I understood what you are trying to say. The NIOS II just gives a control signal to the IP to say which operand to use (like a ADD, SUB etc).

Based on that, the logic to do the corresponding operations can be built in the IP core itself. So the macros like BIGNUM_ADD(&dst, &srcA, &srcB); will be just used to specify the addresses for the IP cores and the corresponding custom instruction essentially does nothing. It just passes those addresses to the IP core and the IP core does the processing (the corresponding operation inside the IP core can be selected with the help of a control signal from the custom instruction, like 'n').

After the processing of data, the IP core can write values to the &dst which is already provided by the custom instruction.

Is this what you were telling or am I going too much? If this is not what you were explaining, please see my above post.

Thank You,

Akhil

Altera_Forum · ‎04-12-2013

If you restrict the operand addresses to a single internal memory block then your IP can directly access the block (rather than having to wait for any avalon transfers to complete).

Also you only have two 32-bit inputs to your custom instruction - so you can't specfy full addresses for all three operands.

Probably worth using the rC field as a sub-opcode.

Altera_Forum · ‎04-12-2013

--- Quote Start ---

Is this what you were telling or am I going too much?

--- Quote End ---

Yes, I think you got it.

Good luck!

Altera_Forum · ‎04-12-2013

Hello dsl and Ted,

Really appreciate for helping me here. I think I will give NIOS II the option to pass the register addresses to the IP core. So I guess custom instruction will look like this: pass addresses of operand_A, operand_B and the result register in each clock cycles to the IP core (custom instruction will be like O/P = I/P) and let the IP core do the rest. And for selecting the specific operator, I can use the 'n' signal (which can take upto 256 values, in my case I need only a few instructions) from the custom instruction to the IP core.

Please correct me if I am going wrong here.

Thank You,

Akhil