Re: From Custom Instruction to Custom Peripheral

Altera_Forum · ‎05-16-2011

I'm trying to simulate how a camera works by generating lines of data and then apply some kind of spatial 3x3 convolution filter to the data. I have done one method by using Custom Instruction where I assume a 2-D image array is already available in NIOS and I then I used two for-loops to send 8 pixels x 8 bits each time to be processed in a combinatorial hardware module. The module was basically this code(with the inputs combined to be 2x32 inputs):

http://edge.kitiyo.com/2009/codes/sobel-core-verilog-module.html

Now instead of doing two for-loops with a Hardware Instruction for each pixel, I want to do be able to send an entire row to emulate a camera, and then wait for 3 entire rows before doing the processing in hardware.

I'm trying to do something like the Figure 2 in this paper.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.8743&rep=rep1&type=pdf

So I think my NIOS C code will look something like this.

# define NUM_ROWS 4# define NUM_COLS 8
alt_u8 array_image = { /* fill in values */};
 
alt_u8* pointer8bit  = 0;
alt_u32* pointer32bit  = 0;  
int num_32bit_values = NUM_COLS / 4;   // Num 32bit values in one row.
 
for(row=0; row<(NUM_ROWS); row++)     
{
   pointer8bit  = &(array_image);  
   pointer32bit = (alt_u32*) pointer8bit;    // Convert pointer to interpret memory in chunks of 32 bits
   for(n=0; n<(num_32bit_values); n++)   
   {
 
       HARDWARE_CUSTOM_INSTRUCTION(pointer32bit);  // Moves through array in 32bit steps transferring the row
   }
}

But unlike with the Custom Instruction I did before, I need the HARDWARE_CUSTOM_INSTRUCTION to give a return value only when the transfer of a row has finished. The hardware module should also 'store' two rows of data at any time while it waits for the third one come and then do the processing.

For example, with my 4x8 array above, there has to be 6 32-bit data transfers (2x32-bit transfer per row) to occur before I can do the Sobel operation in hardware. My questions are:

- How do I store the 6 words in hardware and call part of each to do calculation?

- How do I modify the original sobel code to achieve this? I need to have like a for-loop in hardware to go across the three rows.

- After the initial 3 rows have arrived and calculations are done, then I need to drop the 'oldest' and bring it a new row. Any advice on how to do that?

I don't think I can use Custom Instructions now right? I have to use Custom Peripherals with Avalon MM for example? I have tried sketching some verilog code for this (see below) but I don't know how to do the buffers in hardware. I am not sure how to use the data_en signal either. I only added it because I was trying to make a Avalon MM component and it asked for a write_n. Can somebody please give some guidance on this?


module my_sobel_test_mm (
 // Inputs
 clk,
 reset,
 data_in, //writedata
 data_en, //write_n
 // Outputs
 data_out //readdata
);
 
// Inputs
input clk;
input reset;
input  data_in;
input data_en;
// Outputs
output  data_out;
/*****************************************************************************
 *                 Internal wires and registers Declarations                 *
 *****************************************************************************/
 
 wire  data_in_buffer_1;
 wire  data_in_buffer_2;
 wire  data_in_buffer_3;
 
// Internal Registers
reg  line_1;
reg  line_2;
reg  line_3;
 
//11 bits because max value of gx and gy is 255*4 and last bit for sign      
reg signed  gx,gy;
//Find the absolute value of gx and gy     
reg signed  abs_gx,abs_gy;
//Max value is 255*8. here no sign bit needed.  
reg  sum;   
reg    result;
 
// Integers
integer    i;
 
/*****************************************************************************
 *                             Sequential logic                              *
 *****************************************************************************/
// Sobel Operator
// 
//                 
// Gx         Gy   
//                 
//
// |G| = |Gx| + |Gy|
always @(posedge clk)
begin
 if (reset == 1'b1)
 begin
  for (i = 2; i >= 0; i = i-1)
  begin
   line_1 <= 8'h000;
   line_2 <= 8'h000;
   line_3 <= 8'h000;
  end
  gx <= 11'h000;
  gy <= 11'h000;
  abs_gx <= 11'h000;
  abs_gy <= 11'h000;
 
  result    <= 8'h000;
 end
 else if (data_en == 1'b1)
 begin 
 
 ////// Dont know how to do this section //////////////
 line_1 <= data_in_buffer_1;line_1 <= data_in_buffer_1;line_1 <= data_in_buffer_1;
 line_2 <= data_in_buffer_2;line_2 <= data_in_buffer_2;line_2 <= data_in_buffer_2;
 line_3 <= data_in_buffer_3;line_3 <= data_in_buffer_3;line_3 <= data_in_buffer_3;
        //////////////////////////////////////////////////////
 
 //sobel mask for gradient in horizontal direction 
 gx <=((line_1-line_1)+((line_2-line_2)<<1)+(line_3-line_3));
 //sobel mask for gradient in vertical direction 
 gy <=((line_3-line_1)+((line_3-line_3)<<1)+(line_3-line_1)); 
 // Absolute value of gx 
 abs_gx <= (gx? ~gx+1 : gx);
 // Absolute value of gy  
 abs_gy <= (gy? ~gy+1 : gy); 
 // Sum 
 assign sum = (abs_gx+abs_gy);  
 // Max value 255   
 result <= (|sum)?8'hff : sum; 
 end
end
/*****************************************************************************
 *                            Combinational logic                            *
 *****************************************************************************/
assign data_out = result; 
endmodule

Altera_Forum · ‎05-16-2011

Some use multiple calls to the custom instruction to do what you are trying to do. Before going down that road will you be doing this same calculation over an entire frame of data? If so I recommend that you implement this as a hardware accelerator that can master the memory since it should be much more efficient (and the CPU can be doing something else in the meantime). You can also build your hardware to perform just the transform and use DMAs to shoved data in and out of your hardware.

Here are some examples of what I'm talking about:

http://www.altera.com/support/examples/nios2/exm-accelerated-fir.html

http://www.altera.com/support/examples/nios2/exm-checksum-acc.html

http://www.altera.com/support/examples/nios2/exm-crc-acceleration.html

Altera_Forum · ‎05-17-2011

Thanks for the recommendation. Yes, I will have to do the same calculation over the entire frame eventually. But for now I want to work on 3 lines of data at a time because I want to emulate how a line-scan camera works. I am currently doing multiple calls (one for each loop iteration in software) on data generated from NIOS, but I want to do this by looping within hardware itself. Or probably I didn't understand what you meant by multiple calls.. could you please explain?

When this is done, then I will move to whole frame processing with DMA. Actually I had seen this Accelerating FIR with DMA example before, and my plan was to substitute the transform_block.v with my own. But because I didn't know exactly how to write this block due to my inexperience with hardware programming, I got stuck! At first I tried a ready-made hardware block that I got from the University Program (this is when I used your SGDMA suggestion) but I could not get it working. I have been able to do a DMA memory to memory transfer without any processing block in between, and verfied the data at tx and rx buffers etc. But what I need is the data to get transformed it between....

Then I thought I might as well learn from scratch and build up my knowledge slowly. So using that basic Sobel example, I tried PIOs first, and then used Custom Instruction. This was already a significant speed up, but still not yet to justify using FPGA over a normal computer. Until I get to the stage of me being able to write a sobel transform block to work with DMA (unlikely to happen soon), I have to stick to the custom instruction and I am now in a worrying impasse ... Any help on how to write this transform sobel block for dma usage is greatly appreciated.

Altera_Forum · ‎05-17-2011

If your hardware takes longer to process the data than Nios II can input the data then perhaps what you could do is put a FIFO in your custom instruction so that your code just keeps calling the custom instruction shoving operands into it without reading the results. Then you can start reading the results back out. At this point though I would have switched it over to be a memory mapped component.

In your HDL at the top one suggestion I have is break out your registers into separate always blocks. Anything that doesn't need to be a register I recommend coding as a wire using an assign statement. The way your HDL is coded currently it looks like it'll take 5 clock cycles to complete one result. With some buffering you can get it to perform a calculation every clock cycle (keep stuffing the result into the FIFO to be read later). Also assuming your logic is functionally correct you are practically 90% of the way to having a streaming component.

Altera_Forum · ‎05-19-2011

Thanks BadOmen again for the input. While searching a bit on FIFO and shift registers, I came across the altshift_taps component and after reading its documentation, I thought it might be useful for me in this case. But before I go down that road, I want someone to give me an advice if this component is worth using for my purpose and if yes, is my following argument correct?

a) Assume my image to be processed consists of 4 rows x 12 columns 8-bit values stored in SSRAM or SDRAM of my development board. I use the altshift_taps with parameters Width = 32 bits, Number of taps = 4 and tap distance = 3.

b) Then I start 'feeding' the values from SDRAM into the altshift component in chunks of 32-bits.

c) Now when all the 48 8-bit (or 12 32-bit) values are inside the shift register, at each of the next 3 clock cycles, the output taps will each give 32 bits (i.e repsenting 4 8-bit values of each row), or 128 bits in total.

d) These 128 bits are in fact a '2-D' array of 4x4 8-bit values to which I can apply 4 Sobel filter instantiations to give 4 ouput 8-bit values. This value is then sent back as a 32-bit chunk to memory.

Is my reasoning above any good? Will I still be able to use the DMA transfer method in this way? Will I be able to extend this 4 rows to full frame later?

Also concerning the hardware logic design itself, how do I know when all the values are inside the altshift components and hence taps ready to be used?

Sorry for basic questions again...but I don't have any expertise around me to ask to and I need your help :)

PS: I don't have the 'Create groups for each tap output' option in my altshift_taps MegaWizard. Why is this?

Altera_Forum · ‎05-19-2011

Hi,

Here is my situation,

I have an electronic board with an EP2C20F484I8N and EPCS4N. The programmation occurs properly and when i test the output pins, all of them are tristated. When i put to usb blaster on his connector, it works correctly and i received some good signals.

The problem occurs when the Usb Blaster is not connected to the board.

Could you help me?

Thanks for advance

Altera_Forum · ‎05-23-2011

The DMA makes sense if you want to be able to start sending the next frame into the custom hardware while it's still working on the current frame. I don't know much about the algorithm you are attempting to implement but if you are going to feed the hardware using a DMA then it probably makes sense to pull the frame out of memory, store it in registers and process the output quickly then move onto the next frame.

Altera_Forum · ‎06-01-2011

I have tested a small verilog script that does a Sobel edge detection (see attached folder - Sobel_DE2_Cam.v and a few Megawizard generated files) by running it on a video stream coming from a camera attached to my DE2 board, and I can see the edges being picked up on the VGA display. SOPC Builder and NIOS SBT were not involved at all.

But now instead of using the camera, I want to do the same operation on a single image which I will read from my host computer and store in the SDRAM of the DE2 board. So I tried to make an SOPC Peripheral out of the verilog file to make use of DMA controllers as suggested by many forum members.

Currently, apart from clk and reset, I have all of them as avalon slave with the following signal types:

in_valid - write

in_data - writedata

out_data - readdata

Are these signal type assignments correct? Now to be able to use it with DMA controllers, what other slave signals do I need? I saw in other examples that 'address' and 'chipselect' signals are needed. But I don't know how to use them within my code! Can someone please help me modify that Sobel_DE2_Cam to a 'DMA friendly' one to be used with one single frame of data? I've been pulling my hairs with this task for many weeks now! I badly need help.

I then intend to slot that component between two DMA controllers, and use C code along these lines below in NIOS SBT to do the processing.

# define ROW 480# define COL 640# define IMAGE_BUF_BYTES ROW * COL # define LOAD_DMA_BASE  DMA_1_BASE# define STORE_DMA_BASE DMA_2_BASE# define FILTER_BASE SOBEL_DE2_BASE
 
// -- Set up the buffers
 tx_buf = (alt_u8 *)malloc(IMAGE_BUF_BYTES);
 rx_buf = (alt_u8 *)malloc(IMAGE_BUF_BYTES);
 
//Read image from computer and assign to tx_buf
 
 run_filter(tx_buf, rx_buf); // Do the Processing with DMA controller taking charge
 
//--------- functions to be used to do DMA transfer
void run_filter(alt_u8* tx_buf, alt_u8* rx_buf)
{
  load_filter(tx_buf);
  store_filter(rx_buf);
  while ((IORD_ALTERA_AVALON_DMA_STATUS (LOAD_DMA_BASE) & ALTERA_AVALON_DMA_STATUS_DONE_MSK) != ALTERA_AVALON_DMA_STATUS_DONE_MSK)
  {
   //printf("\n In 1st loop");
  }
  while ((IORD_ALTERA_AVALON_DMA_STATUS (STORE_DMA_BASE) & ALTERA_AVALON_DMA_STATUS_DONE_MSK) != ALTERA_AVALON_DMA_STATUS_DONE_MSK)
  {
   // printf("\n In 2nd loop");
  }
}
 
void load_filter(alt_u8* tx_buf)
{
 // Clear any pending interrupts
  IOWR_ALTERA_AVALON_DMA_CONTROL (LOAD_DMA_BASE, 0);
  IOWR_ALTERA_AVALON_DMA_STATUS (LOAD_DMA_BASE, 0);
  IOWR_ALTERA_AVALON_DMA_RADDRESS (LOAD_DMA_BASE, tx_buf);
  IOWR_ALTERA_AVALON_DMA_WADDRESS (LOAD_DMA_BASE, FILTER_BASE);
  IOWR_ALTERA_AVALON_DMA_LENGTH (LOAD_DMA_BASE, IMAGE_BUF_BYTES);
  IOWR_ALTERA_AVALON_DMA_CONTROL (LOAD_DMA_BASE,ALTERA_AVALON_DMA_CONTROL_WCON_MSK |ALTERA_AVALON_DMA_CONTROL_GO_MSK |ALTERA_AVALON_DMA_CONTROL_LEEN_MSK |ALTERA_AVALON_DMA_CONTROL_WORD_MSK);
}
 
void store_filter(alt_u8* rx_buf)
{
 // Clear any pending interrupts
  IOWR_ALTERA_AVALON_DMA_CONTROL (STORE_DMA_BASE, 0);
  IOWR_ALTERA_AVALON_DMA_STATUS (STORE_DMA_BASE, 0);
  IOWR_ALTERA_AVALON_DMA_RADDRESS (STORE_DMA_BASE, FILTER_BASE);
  IOWR_ALTERA_AVALON_DMA_WADDRESS (STORE_DMA_BASE, rx_buf );
  IOWR_ALTERA_AVALON_DMA_LENGTH (STORE_DMA_BASE, IMAGE_BUF_BYTES);
  IOWR_ALTERA_AVALON_DMA_CONTROL (STORE_DMA_BASE,ALTERA_AVALON_DMA_CONTROL_RCON_MSK |ALTERA_AVALON_DMA_CONTROL_GO_MSK |  ALTERA_AVALON_DMA_CONTROL_LEEN_MSK |ALTERA_AVALON_DMA_CONTROL_WORD_MSK);
}

Altera_Forum · ‎06-01-2011

I would recommend using two slave ports so that you can push data into your block and read it out concurrently. So one port would be write only (write, writedata, address) and the other would be read only (read, readdata, readdatavalid, address). Some of those signals I list are optional, I would look them up in the Avalon spec to see if you need them or not.

Altera_Forum · ‎06-13-2011

I would suggest taking a look at www.terasic.com (http://www.terasic.com) and downloading the example files for the 5mp camera module.

It is not written for a NIOS interface, but does use the ALTSHIFT_TAPS. From my limited analysis of the example, it appears to do a bayer pattern to rgb colorspace conversion in real time after reading the first 3 lines of the ccd sensor. Perhaps this could help answer some questions and guide you closer to a solution.

Be aware that I absolutely HATE the way they write their examples, as they are almost never commented, and they use fairly cryptic register and wire names, that only become clear after you analyze the code for a little while.