IOWR and IORD with PIOs

Altera_Forum · ‎05-03-2011

I have a basic question concerning PIOs. Currently, I am sending 4 8-bit integers over 4 ouput PIOs from NIOS to hardware to do a calculation on them and return another 8-bit integer value. This is achieved as below.


alt_u8 val_1 = 10;
alt_u8 val_2 = 15;
alt_u8 val_3 = 20;
alt_u8 val_4 = 25;
alt_u8 result_val;
 
IOWR_ALTERA_AVALON_PIO_DATA(DATA_OUT1_BASE , val_1);
.
.
IOWR_ALTERA_AVALON_PIO_DATA(DATA_OUT4_BASE , val_4);
 
result_val = IORD_ALTERA_AVALON_PIO_DATA(RESULT_IN0_BASE,0);

Now instead of using 4 PIOs of width 8, I want to use 1 PIO of width 32 to send my four integer values. What are the IOWR commands that I should write assuming my new 32-bit PIO base address is DATA_OUT_32_BASE?

Also on the hardware side where the calculation is done, the input port will now be a 32-bit input, i.e. [31:0] data_32_in;. How do I separate that input signal into its 4 component integers to use in the calculation?

Altera_Forum · ‎05-03-2011

It would be something like this:

IOWR_ALTERA_AVALON_PIO_DATA(DATA_OUT_32_BASE , (val_1<<24) | (val_2<<16) (val_3<<8) | val_4 );

But this is very inefficient!

Still better the 4 byte writes.

Regarding the hw side, every 8bit component will simply get the 8 bits it needs

i.e. (Verilog)

assign data_8_in1[7:0] = data_32_in[7:0];

assign data_8_in2[7:0] = data_32_in[15:8];

assign data_8_in3[7:0] = data_32_in[23:16];

assign data_8_in4[7:0] = data_32_in[31:24];

Altera_Forum · ‎05-03-2011

--- Quote Start ---

It would be something like this:

IOWR_ALTERA_AVALON_PIO_DATA(DATA_OUT_32_BASE , (val_1<<24) | (val_2<<16) (val_3<<8) | val_4 );

But this is very inefficient!

Still better the 4 byte writes.

--- Quote End ---

Thank you very much for the code.

Could you please tell me why it is inefficient? I wanted to do it this way because I could potentially need to send 16 alt_u8 integers at a time for another calculation. Instead of having 16 8-bit PIOs, I wanted 4 32-bit ones just to make things less cumbersome in SOPC builder. So do you think I should keep my 16 separate 8-bit PIOs or is there a better way to send these integers?

Altera_Forum · ‎05-03-2011

If you already have the 32bit value (or in general a data whose width matches the port width) the single write would be efficient.

But if you must manually build the 32bit word from single bytes (as in my example), you generally need cpu work, unless your bytes are already packed in memory in the correct order.

Your idea is correct, but you must take care to declare the 16 byte data so that in memory it is equivalent to 4 32-bit words. Then you'll make a trick with pointers without requiring any cpu effort.

For example:

alt_u8 array8b[16];

alt_u32 * array32b;

array32b = (alt_u32*)array8b;

Now any reference to array32b[n] is equivalent to

(array8b[n*4] | (array8b[n*4+1]<<8) | (array8b[n*4+2]<<16) | (array8b[n*4+3]<<24))

Altera_Forum · ‎05-04-2011

OK I understand the IOWR inefficiency, and thanks for the new code. I don't think I would have been able to do that on my own so quickly... pointers and me tend not to go well together.

Now suppose I have a 2-D array of alt_u8 of size 4 x 10, which represents an image. Then I want to apply a convolution filter to it by 'sliding' a 3x3 window on it to collect 8 values each time to send to hardware to do its calculation. Is there a smart way to use pointers here also as data are already packed in memory in the correct order with this also. I could not apply your suggested way and ended up doing it your first way (see code below).


alt_u8 array_image; // fill in values
 
for(Y=0; Y<(4-2); Y++)  {
       for(X=0; X<(10-2); X++)  {
 
// hardware calculation
IOWR_ALTERA_AVALON_PIO_DATA(DATAA_OUT1_32BITS_BASE,(array_image<<24)|(array_image<<16)|(array_image<<8)|array_image);
 
IOWR_ALTERA_AVALON_PIO_DATA(DATAA_OUT2_32BITS_BASE,(array_image<<24)|(array_image<<16)|(array_image<<8)|array_image);
 
// results back from hardware
store_array= IORD_ALTERA_AVALON_PIO_DATA(RESULTA_IN_8BITS_BASE);
   }
}

I tried using the pointers method but am I right to say that I will need to re-assign 8 values to the alt_u8 array8b inside the two for-loops at each iteration loop and hence require more cpu work?

I have new questions concerning IOWR.

-When I have two IOWRs sequentially as in my code above, do these two 'writes' happen at the same time, i.e. does the hardware module who is waiting fot the data get them all at once ?

- To increase calculation speed, my next step will be to instantiate 2 calculation modules in my top level .v file, and modify the for-loops such that my image is divided into 2 (i.e. 4x5 and 4x5), and data from each sub-image goes to its respective hardware module instantiation. To do this I will need another set of PIOs (2x32bits out and 1x8bits in). Is this the right way to do it? If keep dividing my big image into more sub-images e.g 4, do I add another 2 sets of PIOs? I will end up having many sets of PIOs if my image is bigger and I want to sub-divide more!

Is there a better way to transfer the data from NIOS to the different hardware modules? Another forum member has suggested to create a wrapper module for all my instantiations and create an Avalon MM to interface to the array, but I am completely lost! Can somebody please guide me a bit through this?

Altera_Forum · ‎05-05-2011

--- Quote Start ---

-When I have two IOWRs sequentially as in my code above, do these two 'writes' happen at the same time, i.e. does the hardware module who is waiting fot the data get them all at once ?

--- Quote End ---

In such cases the writes happen sequentially, but the time between them could be umpredictable. It depends from your hw and sw. For example you can have another Avalon master acquiring the control of the bus between them; or your program have to serve an irq.

I don't think in any case the PIO would switch all at once; you need a latch module if you want to sinchronize them.

--- Quote Start ---

- To increase calculation speed, my next step will be to instantiate 2 calculation modules in my top level .v file, and modify the for-loops such that my image is divided into 2 (i.e. 4x5 and 4x5), and data from each sub-image goes to its respective hardware module instantiation. To do this I will need another set of PIOs (2x32bits out and 1x8bits in). Is this the right way to do it? If keep dividing my big image into more sub-images e.g 4, do I add another 2 sets of PIOs? I will end up having many sets of PIOs if my image is bigger and I want to sub-divide more!

--- Quote End ---

I don't understand clearly what you mean and if it is really correct for your purpose. Probably yes. And normally dividing the main task and replicating hardware in such way brings to a great improvement in speed, but also to a great increase of fpga resource utilization; you must seek the optimal configuration for your actual system.

--- Quote Start ---

Is there a better way to transfer the data from NIOS to the different hardware modules? Another forum member has suggested to create a wrapper module for all my instantiations and create an Avalon MM to interface to the array, but I am completely lost! Can somebody please guide me a bit through this?

--- Quote End ---

What the other member suggested is generally exact. PIOs are very inefficient if you want to write an external register and aMM slave interface is more convenient.

But in your case I see you don't have a 'real' memory interface, being your external module totally asynchronous (and maybe combinatorial?): so I don't see any improvement in switching from PIO to Avalon MM.

However giving you informatio about Avalon MM here is not feasible, since it would require a lot of space. You'll learn more by browsing Nios/sopc documentation or searching in the forum.

Regards

Cris

Altera_Forum · ‎05-11-2011

--- Quote Start ---

What the other member suggested is generally exact. PIOs are very inefficient if you want to write an external register and aMM slave interface is more convenient.

But in your case I see you don't have a 'real' memory interface, being your external module totally asynchronous (and maybe combinatorial?): so I don't see any improvement in switching from PIO to Avalon MM.

However giving you informatio about Avalon MM here is not feasible, since it would require a lot of space. You'll learn more by browsing Nios/sopc documentation or searching in the forum.

Regards

Cris

--- Quote End ---

Thanks Cris. Yes it is combinatorial with a the module just doing some addition/subtraction on 8 8-bit numbers and giving out an 8-bit result. Concerning the lack of improvement in switching from PIO to MM, I have to say that my application is for learning purposes and so I want to explore all the possibilities.

Anyway, from reading literature, I found I could create a custom instruction that would work with combinational circuits. The custom instruction module accepts two 32-bit inputs and give out a 32-bit output, which is more or less what my module needs. So i implemented that. But now I don't know how to pass the inputs to that custom logic module. Given that my inputs are for example 1, 2,.., 8, I thought I could have the data in two arrays and then send these arrays as pointers by calling the builtinn_custom_inpp as:

# define ALT_CI_MYCALC_C_INSTR_N 0x0# define ALT_CI_MYMACRO(A,B) __builtin_custom_inpp ALT_CI_MYCALC_C_INSTR_N (A), (B))
 
 
alt_u8 small_1= {1,2,3,4};
alt_u8 small_2= {5,6,7,8};
 
alt_u8 *array_1 = &small_1;
alt_u8 *array_2 = &small_2;
 
 
int custom_int_res = 0;
custom_int_res = ALT_CI_MYMACRO((void *)array_1,(void *)array_2);

But I am not getting the right result this way. Is the use of pointers correct in this situation? If yes, can somebody please help me find where I have gone wrong in my usage of pointers?

Thanks

Altera_Forum · ‎05-11-2011

Pointers and arrays are *almost* the same thing in C.

&(small_1) == ??? // something that ain't what you want. (pointer to a pointer to an array?)

&(small_1[0]) == small_1 // small_1 without an index is a pointer to the zero'th item in the array. The [] does much the same thing that the & un-does.

also, (small_1+0), (small_1+1), etc. point to individual items in the array, regardless of the size of the items

...so there isn't any need to create a new variable. Try just going

ALT_CI_MYMACRO((void *)small_1, ...

if you *really* want, you can probably use &(small_1[]), or &(small_1[0]) instead of small_1 to make a point, but that's unnecessary. Then you would have (void *) &(small_1[?])... Array, pointerize, typecast ... whew!

Sort of like

if (!!!true!=!false)... // what?

besides, your new variables are pointers to alt_u8, not arrays of alt_u8 (it's different). You eventually typecast to void* (so who cares?), but it's still bad form.

Altera_Forum · ‎05-12-2011

Thanks for the explanation Donq. I have to master this pointer business quickly! I tried the changes but I am still not successful in what I am trying to do.


 
alt_u8 small_1= {1,2,3,4};
alt_u8 small_2= {5,6,7,8};
 
alt_u32 custom_int_res = 0;
custom_int_res = ALT_CI_MYMACRO((void *)small_1,(void *)small_2);

Perhaps I am doing something wrong in the custom logic which is as shown below where I am expecting my answer to be 1 + 2 + 3 +5 = 11.


module mycalc_c_instr( dataa, datab, result);
input   dataa; 
input   datab; 
output  result;  
 
assign result=(
                  (dataa + dataa + dataa + datab)
);
endmodule

Or should I try returning a pointer value instead of integer (i.e. using

*__builtin_custom_pnpp )?

It's probably a stupid mistake again from me, but I can't figure it out! Any help please?

Thanks

Altera_Forum · ‎05-12-2011

What result do you get instead of he expected 11?

Try with this alternate data:

alt_u8 small_1[4]= {0x01,0x02,0x04,0x08};

alt_u8 small_2[4]= {0x10,0x20,0x40,0x80};

With this trick you can easily see which data has been added in the result: every bit set to 1 will indicate a specific addendum.

Altera_Forum · ‎05-12-2011

--- Quote Start ---

What result do you get instead of he expected 11?

Try with this alternate data:

alt_u8 small_1[4]= {0x01,0x02,0x04,0x08};

alt_u8 small_2[4]= {0x10,0x20,0x40,0x80};

With this trick you can easily see which data has been added in the result: every bit set to 1 will indicate a specific addendum.

--- Quote End ---

Thanks Cris for the handy trick. Well, I found out that no matter what data I had in the array, I was always getting 930 as result. In memory, this appeared as A2 and 03. So I deduced something was definitely wrong with my passing of inputs as pointers. Then I tried passing them as 32-bit values, i.e. using __builtin_custom_inii and


alt_u32 input1_32bits = (small_1<<24) | (small_1<<16) |(small_1<<8) | small_1 ;
 
alt_u32 input2_32bits = (small_2<<24) | (small_2<<16) |(small_2<<8) | small_2 ;
 
custom_int_res = ALT_CI_MYMACRO(input1_32bits ,input2_32bits );

This one gave the expected result of 11! I will keep using the builtin_custom_inii now and next thing is I will try your trick of earlier post with pointers which didn't require any cpu effort. But knowing me, be prepared to hear from me soon as pointers and I will never be friends :)

Altera_Forum · ‎05-12-2011

Like you already found out, I was going to suggest that you might still have some other problems with pointers. If you pass a pointer, you have to dereference the pointer in the code/macro/firmware before you can get the info that the pointer points to. You were probably getting the results of the addresses of the data, rather than the values of the data.

Altera_Forum · ‎05-13-2011

I assumed your ALT_CI_MYMACRO parameters were pointers and this was wrong.

From your last post I understand it needs 32bit values.

Then change the code this way:


alt_u8 small_1= {1,2,3,4};
alt_u8 small_2= {5,6,7,8};
 
alt_u32 custom_int_res = 0;
custom_int_res = ALT_CI_MYMACRO( *((alt_u32 *)small_1), *((alt_u32 *)small_2) );

The (alt_u32*) forces compiler to interpret small_1 as a pointer to a alt_u32 value instead of the alt_u8 you defined. Then the * references to the pointed value (which is now seen as u32)

Regards

Cris

Altera_Forum · ‎05-13-2011

--- Quote Start ---

Then change the code this way:


alt_u8 small_1= {1,2,3,4};
alt_u8 small_2= {5,6,7,8};
 
alt_u32 custom_int_res = 0;
custom_int_res = ALT_CI_MYMACRO( *((alt_u32 *)small_1), *((alt_u32 *)small_2) );

--- Quote End ---

Thank you so much again Cris and with the very good explanation too. I tried your method and this operation happened with less ticks than when using the '<<'.

Now I have another problem :) Those small_1 and small_2 arrays have changing values, coming from a bigger array. The 8 values are being collected from as a 'sliding window' as shown in code below.


 # define ROW 3# define COL 5
 
// 2-D array containing unsigned 8-bit integers
alt_u8 array_image = { /* fill in values */};
 
alt_u8 small_1;
alt_u8 small_2;
 
for(y=0; y<(ROW -2); y++)  {
               for(x=0; x<(COL-2); x++)  {
 small_1 = array_image; small_1 = array_image;
 small_1 = array_image; small_1 = array_image;
 
 small_2 = array_image; small_2 = array_image;
 small_2 = array_image; small_2 = array_image;
 
 store_result = ALT_CI_MYMACRO( *((alt_u32 *)small_1), *((alt_u32 *)small_2) );
 
               }
 }

But the fact that I am populating the small arrays within the loops requires nios cpu time. Is there instead a smart way again using the pointers to do this? For example, suppose my 3x5 data is as such:

10 20 30 40 50

11 21 31 41 51

12 22 32 42 52

I want the following values in each small array at each loop iteration.

at x = 0, small_1 = {10,20,30,11} and small_2 = {31,12,22,32}

at x = 1, small_1 = {20,30,40,21} and small_2 = {41,22,32,42}

at x = 2, small_1 = {30,40,50,31} and small_2 = {51,32,42,52}

Altera_Forum · ‎05-13-2011

Well, conceptually you can something like this:

alt_u8 array[8] = { 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08 };

Then

*((alt_u32*)array) returns 0x04030201

*((alt_u32*)(array+1)) returns 0x05040302

*((alt_u32*)(array+2)) returns 0x06050403

...

and so on

For clarity, since you are not used to handle pointers, consider that

*((alt_u32*)(array+n)) is the same as *((alt_u32*)&(array[n]))

In practical, I can't remember if Nios requires the 32bit alignment in order to do it as supposed. This depends from the data bus architecture.

Being this true or not, you'll have very different performance, because in one case the processor can do the bytes to 32bit packing with a single memory access, while in the other it must still handle the single bytes as you do now.

Since you already measure cpu time, you can simply put this in your code and test if you obtain an improvement.

Altera_Forum · ‎05-15-2011

Thanks for the explanation. Well, I realize that my method is not possible (at least not straighforward), and it is not my final goal anyway to send data this way to hardware for processing. But it got me to understand a bit about Custom Instructions and a lot more about pointers thanks to Cris :)