Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
20693 Discussions

Move data from NIOS to hardware and then back to NIOS

Altera_Forum
Honored Contributor II
1,638 Views

I am trying to do a little calculation in hardware but I'm stuck on how to get started. I'm a complete newbie, but after reading a lot of manuals and demos, I figured out that I could have a soft processor NIOS that could accept C code. So using a basic computer configuration of the the DE2 (it comes with the University Program), I then created a NIOS BSP project. After trying many times, I managed to get everything compiled and the code below to run on the target hardware. 

 

void main(void)  

 

int line1[] = {1, 2, 3, 4, 5, 6, 7}; 

int line2[] = {1, 2, 3, 4, 5, 6, 7}; 

 

//--- Do this in verilog!  

int i; 

int store_val = 0; 

 

for(i=0; i<7; i++) 

store_val = store_val + (line1*line2); 

//--- verilog_end 

 

Now the next step is that I want to do the calculation in the for-loop section in hardware using verilog rather than C. I have three main queries: 

1) How to write that for-loop function in verilog? 

2) Where do I save the verilog module? 

3) When I have the verilog equivalent, how do I move the values contained in the two C arrays to the verilog module to do the calculation and then send back 'store_val' to the BSP project? I read about DMA in other posts but I don't know how to do that process.  

 

This is confusing me. Any help is greatly appreciated. Thank you very much
0 Kudos
14 Replies
Altera_Forum
Honored Contributor II
554 Views

i havent used a for loop in verilog before but this link shows you how to 

http://www.asic-world.com/verilog/verilog_one_day2.html 

 

you have to save your verilog module in the same project that you created your processor. after that you can right click on that file in the files tab and select 'create symbol for this file' 

after that you can use that symbol in the bdf editor just as any other module.(double click on the bdf editor and your module should be in the project folder) 

 

as for sending data from your processor to your verilog module is tricky. 

what you can do is add a 16bit PIO device to your processor and send data through it integer by integer. Because it is very impractical to send the whole array at once. 

if i understood your program right what you can do is create two 16bit PIO devices and send line1 and line2 separately two at a time. and write your verilog to process data as it comes in. This way you dont need a for loop in your verilog
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

Hi vivasam, 

 

1) for loops in verilog aren't synthesizable, they are only working in simulation. You have to write a process with a clock input. On every clock cycle it will be processed one time - so you can do the mathematical operations. Maybe you need a second process to count the cycles and abort the loop. 

 

2) the verilog module have to be saved in the quartus-project folder or a subfolder. You can use the quartusII software to create the verilog module, and modelsim to simulate it. 

 

3) DMA is not the right way to communicate between nios and hardware. You can use FIFOs or DUALPORT-RAMs to communicate. FIFOs are easier to use. You can implement this functions bei using the MEGAWIZARD in quartusII software, both are standard modules. When you got different clocks for your nios and the verilog-module you should use clock-crossing FIFOs. As normad said you can also use the PIO (a module for sopc-builder), but be careful with the clocking.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

For loops are synthesizable but may give different results that what you expect. In this case it would create 8 multipliers and 8 adders that would all work in parallel to do the operation in one clock cycle. If you want to have only one multiplier and do the operation in 8 cycles instead, then yes you need to get rid of the loop and do a clocked process.

0 Kudos
Altera_Forum
Honored Contributor II
554 Views

Thank you everybody for the tips but I am now more lost than before :)! OK let's start somewhere. Please correct me where I am wrong. 

 

1) Starting with normad's PIO suggestion, I inserted three 16-bit PIO ports (2 in and 1 out) in SOPC Builder. I called them line_1_in, line_2_in and result_out respectively. Each one has a Base and End address assigned by SOPC, e.g. 0x08200000-0x0820000f for the line_1_in. Then I declared these ports in the top level .v file. as: 

 

input LINE_1_DATA; 

input LINE_2_DATA; 

output RESULT_BACK; 

 

and added these lines also to the internal modules: 

.in_port_to_the_line_1_in (LINE_1_DATA), 

.in_port_to_the_line_2_in (LINE_2_DATA), 

.out_port_from_the_result_back_out (RESULT_BACK), 

 

My new queries: 

- What clock do I assign to these PIOs? Now they are at 50 MHz like all other components. 

- To go back to the C-code now, how do I tell NIOSII to send data to these PIOs? I am assuming that I need to make use of pointers to read the memory address, but then I don't know how to move to the next step. I am not too sure how to get get rid of the for-loop either. Any suggestions please? 

 

volatile int * line_1_ptr = (int *) 0x08200000; // Port_in_1 address 

volatile int * line_2_ptr = (int *) 0x08200010;// Port_in_2 address 

volatile int * result_back_ptr = (int *) 0x08200020;// Port_out_result  

 

*line_1_ptr = line1; // But line1 is an array 

*line_2_ptr = line2; // line2 is an array 

store_val = * result_back_ptr; 

 

2) How do I write a process with a clock input as suggested by nophutwern? 

 

3) This is my first attempt at writing a verilog module. I don't even know how to compile it :), but I just wanted to know if this the type of code that will do the calculation as the data comes in : 

 

module my_sum_prod ( 

// Inputs 

clk, 

 

line_1_in, 

line_2_in, 

// Output 

result_out 

); 

 

//Port Declarations  

 

// Inputs 

input clk; 

input [15:0] line_1_in; 

input [15:0] line_2_in; 

 

// Output 

output [15:0] result_out 

 

//Internal registers 

reg [15:0] original_line_1; 

reg [15:0] original_line_2; 

reg [15:0] temp_sum; 

reg [15:0] final_result_out; 

 

temp_sum <= 16'd0; 

final_result_out < 16'd0; 

 

always @(posedge clk) 

begin 

temp_sum <= original_line_1 * original_line_2; 

final_result_out <= final_result_out + temp_sum; 

end 

assign result_out = final_result_out ; 

endmodule 

 

4) When I manage to do a proper verilog module, where and how I do call it in my main .c file such that it accepts the data from the NIOS processor through the PIOs and return a value again? 

 

Thank you again. I'm just entering this whole FPGA world and I need step-by-step help. 

 

NB: To nophutwern: I will try the FIFO as soon as I get this methodology working. It's part of my learning process.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

 

--- Quote Start ---  

 

DMA is not the right way to communicate between nios and hardware.  

--- Quote End ---  

 

 

The reason I talked about DMA is because I saw from a Reference Design (StratixII_DSP_Kit-v1.0.0) that the author reads an image from the flash memory card into a DMA buffer, apply an edge detector operation to the data, then sends it back to the DMA to be displayed on the VGA. 

 

Suppose my data is not just 2 arrays of 7 integers but instead 2000 arrays of 1000 integers or a 2-D array of [2000][1000] if it is an image, do I still use FIFOs or DUALPORT-RAMs to communicate with hardware? 

 

My idea was to make an SOPC component with Avalon MM slave interfaces out of the verilog code (another thing I don't know how to do!) and 'slot it' in the data flow from the CPU data master in between two DMA buffers as in that reference design mentioned above. But I guess I have to do it the PIO, FIFO,or DUALPORT way for now.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

It depends on the kind of performance that you want. You can create an Avalon slave interface and have the CPU write all the integers one by one to your hardware. But if the only thing your hardware does is a multiplication I don't think it ill be a lot faster than doing it in software. 

If you want a bettor performance it makes sense to use a DMA that will read the integers from your memory and give them to your custom hardware. 

I'm not a Verilog expert, but I think that if you do a search on the forum you should find some Verilog templates for Avalon slave interfaces.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

 

--- Quote Start ---  

It depends on the kind of performance that you want. You can create an Avalon slave interface and have the CPU write all the integers one by one to your hardware. But if the only thing your hardware does is a multiplication I don't think it ill be a lot faster than doing it in software. 

If you want a bettor performance it makes sense to use a DMA that will read the integers from your memory and give them to your custom hardware. 

I'm not a Verilog expert, but I think that if you do a search on the forum you should find some Verilog templates for Avalon slave interfaces. 

--- Quote End ---  

 

 

This simple 'sum of products' example is a starting step for me. Later on, there will be larger arrays and more calculations, something similar to an image processing task. So if I create a dma and access the data using the dma address in nios, how do i interface that data with the simple 'sum of products' verilog module? Do i still need those PIO, fifo, etc? Sorry for asking such basic questions.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

No, if you use a DMA you don't need PIOs or dual part memories. One way to do it would be to use an Avalon Stream interface between the DMA and your hardware. You may need a FIFO between the two in some cases but this can be added in SOPC builder. 

The PIOs are fine for slow systems with little amount of data but they are less efficient than a DMA. 

Using a dual port memory and an Avalon slave interface should be easier to make, but it seems that you will handle a lot of data so it may be impossible to use on-chip memory to store your data. That's why I'd recommend a DMA. 

 

If you are new to Verilog you may want to start with something simpler than this project though.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

 

--- Quote Start ---  

No, if you use a DMA you don't need PIOs or dual part memories. One way to do it would be to use an Avalon Stream interface between the DMA and your hardware. You may need a FIFO between the two in some cases but this can be added in SOPC builder. 

The PIOs are fine for slow systems with little amount of data but they are less efficient than a DMA. 

Using a dual port memory and an Avalon slave interface should be easier to make, but it seems that you will handle a lot of data so it may be impossible to use on-chip memory to store your data. That's why I'd recommend a DMA. 

 

If you are new to Verilog you may want to start with something simpler than this project though. 

--- Quote End ---  

 

 

Thank you for the advice. I will use nophutwern's advice to use modelsim to simulate my verilog module. It is true that I don't know about Verilog and this is why I wanted to do a simple calculation of sum and product as as a starting phase. I can't change my project now..too late :( 

 

Anyway, to go back to your suggestion of using Avalon Streaming interface, I found a similar example to the StratixII_DSP_Kit-v1.0.0 design example but this time it has ST interface. This one a is a ready-made SOPC component in Verilog available in the Altera UP Video cores called 'altera_up_avalon_video_edge_detection'. It is available from the altera_upds_setup.exe file in on the ftp download area of Altera Univeristy Program. 

 

I don't care what calculations that component is making but at least we know it is a module written by experts. This module takes in 8-bit data of an image and has Avalon Stream Sink and Source. From what I found on this foum, I have to use SGDMA to interface with this module, something like CPU-> SGDMA(Mem to Stream) -> Altera UP module -> SGDMA (Stream to Mem).  

 

My question is how to initiate the SGDMA transfer to stream data through the UP module? I read about alt_avalon_sgdma_open, _start, _stop etc but I am not sure how to call these functions in my NIOSII C code. Can somebody please give me an example C code that will do that process assuming that I have a 2-d array of integers already initialized at the top of my C code? I also have the CPU connected to the SDRAm and SRAM available on my board. 

 

Thanks a lot.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

You might have an easier time using this instead: http://www.alterawiki.com/wiki/modular_sgdma 

 

Assuming the UP IP core has an ST port into the core and another ST port coming out of the core then you could just slide it in between the read and write masters and do the transfer all in one shot. Just look at the Nios II software code to find out how to setup a descriptor.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

 

--- Quote Start ---  

You might have an easier time using this instead: http://www.alterawiki.com/wiki/modular_sgdma 

 

 

--- Quote End ---  

 

 

Yeah the UP core has a ST port into the core and another ST port coming out of it. Before I slot the ST component in between the read and write masters and assuming that my original data is generated by the CPU and stored in the SDRAM, and the result after going through the UP core will also go in the SDRAM, do I need: 

- modular dispatcher (MM to MM), Read master and Write master like in the example file  

or  

- modular dispatcher (MM to ST) & Read master, and modular dispatcher (ST to MM) & Write master  

 

My feeling is that it is the first option but I just want a confirmation because it is my first time using a DMA type component.  

 

If suppose instead I had used the standard SG-DMA component available in SOPC, would I then need two SG-DMA components, i.e. MM to ST at transmission and ST to MM at receiver? 

 

OK, in the meantime I tried to adapt the example given in the website to my DE2-115 board but the msgdma keeps spinning and my NIOS program gets stuck at loop 'while (sgdma_interrupt_fired == 0) {}'. How do I find the cause of the error? Could this error due to the fact that I have not fully copied the SOPC example design file because I have not used the Avalon-MM Pipeline Bridge and the 'DDR SDRAM Contorller with ALTMEMPHY'? I just have a CPU, SDRAM controller and the Modular SGDMA components (all running at same clock rate) at the moment. I did not add those components because I don't understand their purpose but if needed, I will do it. 

 

Now assuming I get that example file working after adding all these extra components and clocks, what changes will I need to bring to the parameter setttings of my Read Master and Write Master components if the data to be transferred is now a 3 x 12 array of u8 integers, i.e. 36 bytes, defined as below in my C code. 

alt_u8 my_2d_array[3][12] ; //Fill in the values 

source_buffer = &(my_2d_array[0][0]); 

 

The Data Width setting goes to 8. How about Length Width and FIFO Depth, and the other settings? What are the major changes in the example 'main.c' file to I need to do? 

 

Thanks
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

That's correct, to do this with the SGDMA on the ACDS you would need to control a pair of SGDMAs for doing MM-->ST and ST-->MM. With the mSGDMA if you wedged your block between the read and write masters then your control would be just triggering a normal DMA transfer (assuming for every input there is one output from your block). 

 

The pipeline bridge isn't necessary, it's just included to add some additional pipelining to the design. If you are running the included software from the mSGDMA design make sure you change the settings near the top of the main.c file to represent your own system memory address base and span. I did a minor update to that software to fix a bug which would cause the source and destination buffers to overlap so make sure you have the latest main.c file. 

 

You probably don't need to modify many of the settings. The length width parameter just dictates the maximum number of bytes you can transfer in a single transfer. For example if you chose 20 bits that means you can transfer slightly less than 1MB of data in a single descriptor. The only reason why it's a parameter and not hardcoded to be 32-bits is that when a timing critical path shows up in one of the masters it can typically be solved by just reducing that length register width to something more sensible (being able to transfer ~4GB in one shot doesn't make sense in SOPC Builder which only has a 4GB space per master anyway). 

 

Since you are dealing with MM to ST the symbol size is 8-bits. So if your component has an 8-bit input/output you can still setup the mSGDMA for a wider data path. This will help increase your memory throughput since multiple symbols can be fetched every clock cycle. I would use the C code as a guideline, most of it has nothing to do with the mSGDMA so it may end up leading to confusion. If you want to see a simplier application check out this design example which is configured for MM --> ST which performs frame buffering to an LCD. The only difference is you would use MM --> MM and setup the descriptors slightly differently. 

 

http://www.alterawiki.com/wiki/modular_sgdma
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

 

--- Quote Start ---  

 

If you are running the included software from the mSGDMA design make sure you change the settings near the top of the main.c file to represent your own system memory address base and span.  

 

--- Quote End ---  

 

 

OK, I found my mistake. I had the CPU Reset and Exception vectors on the SDRAM. I have now changed the DATA_SOURCE_BASE address to start further into the memory, and I can see the source and destination memory being the same after I run the program in debug mode.  

 

Just a quick question: Is the statement ' test_counter = 0;' not supposed to be before the start of the 'do' loop? Because '(test_counter < NUMBER_OF_TESTS)' in the 'while condition' will always be true otherwise. 

 

The next step for me is to try to populate the source buffer with the 36 data values contained in my_2d_array, but I am struggling to do this. I've set MAXIMUM_BUFFER_SIZE 36 and NUMBER_OF_BUFFERS 2 but I am not too confident with the pointer notation to get first 36 addresses after DATA_SOURCE_BASE to contain those values. Can anybody please help me with this?  

 

 

 

--- Quote Start ---  

If you want to see a simplier application check out this design example which is configured for MM --> ST which performs frame buffering to an LCD. The only difference is you would use MM --> MM and setup the descriptors slightly differently. 

 

http://www.alterawiki.com/wiki/modular_sgdma 

--- Quote End ---  

 

 

Where is this design example? I can't find it on that website.  

 

 

 

--- Quote Start ---  

With the mSGDMA if you wedged your block between the read and write masters then your control would be just triggering a normal DMA transfer (assuming for every input there is one output from your block). 

 

--- Quote End ---  

 

 

Now that you mention to assume every input = output, I am realizing it is not the case with that UP IP block. This block takes in one 8-bit value at a time and uses altshift_taps shift register to get the data in the right format to do its processing. Anyway, this is not a problem for now as I just want to get those 36 values in my array moved around using Avalon ST and I will have to write my own core later.
0 Kudos
Altera_Forum
Honored Contributor II
554 Views

Good catch, I never noticed that bug since I have always used the code in an infinite loop mode. I'll update the design some time this week to correct that. 

 

I would probably make your own code to test your own hardware since most of the mSGDMA code is setup to setup random test buffers and all kinds of non-practical stuff. Your code should end up being a faction of the length if you code it for your own application. You could run an uncached malloc (see Nios II software handbook for more details) to allocate a pointer to some location in the heap and then dereference that point to populate the buffer. 

 

Sorry I linked the wrong page, I meant this one: http://www.alterawiki.com/wiki/modular_sgdma_video_frame_buffer 

 

If the read and write lengths are not the same then you'll need to use a pair of DMAs since a DMA typically performs the same number of reads as writes. It would be possible to hack the dispatcher HDL to support different read and write lengths but I wouldn't recommend attempting that just yet.
0 Kudos
Reply