New IP development - Page 3

Altera_Forum · ‎04-09-2013

Hello,

I am interested in developing some instructions which can do the bignum operations (like adding two operands of 1024 bit)

and hence by plan to implement a simple Full Adder like this: {carry,sum} = OP_A + OP_B;

I am a beginner in Altera and realized that I can develop an IP component with Avalon MM slave interface which can talk with the NIOS II processor.

I was wondering how to give the bignum values as the operand from the NIOS II processor (master) to the IP component (slave) from the application code?

I see only these macros in the generated 'io.h' file: # define IOWR_32DIRECT(BASE, OFFSET, DATA)

io_write((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), DATA, (BASE) + (OFFSET))

# define IORD_32DIRECT(BASE, OFFSET)

io_read((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), (BASE) + (OFFSET))

I guess these are the 32 bits write and read instructions. So, do I get to clock in the 1024 bits as a single instruction?

Or do I have to wait for 32 clock cycles (sampling the 32 bits in a clock cycle, do for 32 clocks. Please say a no to this..) !

Hope that question is clear and someone can respond. Really appreciate it.

Thank You,

Akhil

Altera_Forum · ‎04-30-2013

This thread is almost ( but not quite ) gone full circle.:p

I think you can solve this problem ("I need good bignum performance") any number of ways, including software optimization using traditional techniques like what dsl has described.

But backing up to the original post in this thread: you already know that this is a 32-bit processor and to operate on (1024-bit) data you need [at least] 32x of each operation.

With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.

If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.

The selling point for C2H is that if you are not a HDL person, and this performance is inadequate, then MAYBE you can point and click (aka lower cost and faster time to market) and use FPGA resources to achieve your performance needed. The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.

Of course if you are an HDL person, you will think the software person is nuts for doing any of this: you know up front that you want a 1024-bit wide ALU, so go ahead and create one.

Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.

Altera_Forum · ‎04-30-2013

My problem with the performance counter block is that it is a 16bit slave - so it takes a lot of accesses to do anything.

A 32bit version would be much more friendly.

The lack of 'add with carry' could well be a problem, but one of the cmp instructions would do the job.

So code like:

        sum = a + b + carry;
        carry = sum < a;

is amost right (needs an extra check for b[i] + carry == 0).

So 5 instructions (+ memory access) including 1 branch that can be staticly predicted 'not taken'.

With care and a combinatorial custom instruction (or 2) could be 3 instruction (I think).

Altera_Forum · ‎04-30-2013

--- Quote Start ---

My problem with the performance counter block is that it is a 16bit slave - so it takes a lot of accesses to do anything.

--- Quote End ---

I'm not sure about the history of the block, but at least in the files sitting on my hard drive right now, it is a 32-bit slave.

--- Quote Start ---

With care and a combinatorial custom instruction (or 2) could be 3 instruction (I think).

--- Quote End ---

I'm not following you: I see two loads, one store, and one math function (4 instructions, or 128 total).

I'm not intimately familiar with the instruction set: does NIOS have a double-load or add-and-store instruction? (how did you compress it down to 3?)

Altera_Forum · ‎05-01-2013

I wasn't counting the load and store instructiions....

Altera_Forum · ‎05-01-2013

--- Quote Start ---

This thread is almost ( but not quite ) gone full circle.:p

With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.

If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.

--- Quote End ---

Akhil>> Please note that this is an academic thesis project and hence by I plan to implement hardware modules for RSA and BIGNUM which could assist an application developer (who may program in a NIOS II SBT, in C/C++). It is not the other way around, which can make a circle as you pointed out :P. I have an example code with me which does the software implementation for RSA. I was wondering any means of converting the software implementation to an HDL RTL design and comparing the performances. And please remember my tentative design diagram, that is not a finalized one. As of now I plan to generate the primes for RSA inside my RSA IP core and hence by I may not have to load the NIOS II CPU for that. Also the communication between the RSA <-> BIGNUM should be okay (I guess?) since I plan to make both the data buses as 1024 bits (even the ALU s).

--- Quote Start ---

The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.

--- Quote End ---

Akhil>> The C2H has been discontinued for QSYS and as you pointed out, that is specific for NIOS II, which supports a 32-bit architecture. So I may not be using this to convert my C to a hardware accelerator.

--- Quote End ---

--- Quote Start ---

Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.

--- Quote End ---

Akhil >> I am interested to profile my code for finding out the bottlenecks. However the challenge here is to convert my C code to a hardware efficiently after identifying the bottlenecks in the code. Please advice me if you know an efficient tool or if I have to hand-code everything.

Best,

Akhil Kalathungal

Altera_Forum · ‎05-01-2013

--- Quote Start ---

I was wondering any means of converting the software implementation to an HDL RTL design and comparing the performances.

--- Quote End ---

http://en.wikipedia.org/wiki/c_to_hdl

It looks like today there is some free options (NISC, c-to-verilog.com, ROCCC, ...) as well as big name commercial (Mentor) to choose from. As far as Altera's offering goes, I think the situation is that it "works" it is just not being further developed. If you are willing to jump through some hoops with current releases in order to use it, you can.

I understand this is not where you are aiming, but once you wrap your hands around using tools like this, it seems like a useful topic of for a paper/wiki might be a brief summary with 2013 view point of the output you get from variety of tools using your standard well-known RSA math as the test case. I think the challenge would be making any of them understand that you are working with 1024-bit operands and not arrays of 32-bit words, for example.

Here is one relatively current paper (obviously, biased toward one tool):

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5474060

You may get additional / more useful feedback from this forum with a new thread with different topic soliciting opinions, since there may be people with relevant first hand knowledge of current offerings.

Altera_Forum · ‎05-13-2013

First of all I'm guilty of not reading this entire thread but are you essentially trying to perform a vector operation (same operating across multiple data in memory)? If so and assuming the algorithm isn't particularly complicated a DMA engine with your transform hardware in the middle should suffice. For example if you wanted to perform a vector addition of two blocks in memory you just need this:

read master --> adder transform block --> write master

If you don't want to use a DMA but need memory masters to do this you could take a look at the read and write masters in the Qsys tutorial and just shove a custom controller on them. That's an overly simple algorithm and chances are it doesn't offer much in terms of speedup over a processor since you are performing two reads and one write for ever computation in the vector (i.e. memory limited). You realistically get a speedup when the algorithm is data independent or feed forward so that you can perform a bunch of calculations in a wide or pipelined data path.

A lot of the C to HDL products out there have gone the way of the dinosaur since it's pretty difficult for compilers to find parallelism in sequential C code. As far as I know the only commercial player left in that space is ImpulseC.