Code in tightly coupled memory

Altera_Forum · ‎06-07-2010

Hi all,

I placed a function in a section different from the main code. More exactly, my code is mapped to sdram and I linked a function testcall() into a tightly coupled memory section which I previoulsy defined in sopc builder.

So I declared:

int testcall(int n) __attribute__ ((section (".tc_code")));

Then, following the suggestion I found in another thread I called ALT_LOAD_SECTION_BY_NAME(tc_code) before calling testcall().

My questions are:

- is the ALT_LOAD_SECTION_BY_NAME(tc_code) mandatory? The code seems to work even if I don't use it.

- if tc_code section is located in internal fpga memory (cyclone III M9k blocks), how much is the code speed improvement I could roughly obtain with respect to sdram (assume I'm using 2 tse with sgdma).

Thank you

Cris

Altera_Forum · ‎06-07-2010

I guess that ALT_LOAD_SECTION_BY_NAME(tc_code) copies the code from where ever the normal loader put it into the M9K block.

Whether this is necessary depends on how the program image was linked, and how it was loaded.

The JTAG loader reads the elf program headers - so can write into multiple disjoint areas - in which case no copy is required (if the program headers are 'correct').

In other cases, a more simple loader might have written the code somewhere else - so the copy is needed.

Putting code (and data) in tightly coupled memory areas gives the same access times as if the data were resident in the instruction/data cache.

Altera_Forum · ‎06-07-2010

Thank you for your answer and the explanation of the ALT_LOAD_SECTION behavior.

--- Quote Start ---

Putting code (and data) in tightly coupled memory areas gives the same access times as if the data were resident in the instruction/data cache.

--- Quote End ---

This was already clear to me. I simply wondered if I can expect any significative speed improvement in placing frequently accessed code/data in a dedicated tightly coupled memory.

Note that I already use a Nios II/f code with cache: anyway I want very fast execution of this function, so I'd like to exclude cache delays due to loading/flushing.

I add one more issue to this thread:

I tried to place in tc_code section a more complex function, namely the actual function I need to speed up.

Now the processor gets stuck whenever this funciton is called! The previous call to testcall() is still present and it has no problem. Why?

Regards

Cris

Altera_Forum · ‎06-07-2010

If you want a function to execute as fast as possible you probably need to inspect the generated code and examine the control flow and memory accesses to avoid execution stalls and unwanted memory accesses.

Depending on the code I'd guess you can gain 20-30% by optimising the C so that gcc generates better code.

Altera_Forum · ‎06-07-2010

I want give just my experience on working with tightly coupled memory. I'm using TCM for an ISR and a few time critical functions. First I forgot to put the variables used by the functions in TCM also outside from SDRAM to TCM ( __attribute__ directive just as for the variables).

In my case it was necessary to disable IRQs during functio execution:

alt_irq_context context __attribute__ ((section (".onchip_ram")));
context = alt_irq_disable_all();
//time critical code
alt_irq_enable_all(context);

Apart from performance increase another benefit of transfering functions in TCM is an uninterrupted access to SDRAM for other (custom) bus masters.

Jens

Altera_Forum · ‎06-07-2010

hi Jens,

In your case, was disabling interrupts for TCM code necessary for the correct execution of code itself or only for performance?

In my case I'd allow interrupts since I need to speed up the "average" function execution time and I don't mind if it is sometimes interrupted; think that this function is almost always running, so it DOES need to be interrupted.

Can this be the reason why the simple function works while the complete one doesn't?

Same question above about placing variables in TCM.

I finally agree with your last remark: I'd expect performance increase not for the TCM itself but because sdram is now intensively accessed by other masters through dma.

Cris

Altera_Forum · ‎06-07-2010

Cris, I use the function in TCM to cyclical control and setup SGDMA descriptors. Data from a sensor must be copied in limited time to SDRAM. If I allow IRQs then I don't receive all the data from sensor. In this time any other master access on the SDRAM leads to loss of data.

Furthermore I use global variables for that function because stack and heap are in SDRAM.

I mean functionality should not affected if IRQs allowed or not but I couldn't test it.

An other effect I had in conjunction with dual port RAM. First I put TCM and descriptors together in one dual port RAM. In this case I also had loss of data.

How do you create the tc_code section? The TCM section I'm using has the same name like the SOPC component (onchip_ram). This naming has effects to the generated linker script. If you will use additional sections than you have to provide a custom linker script.

Jens

Altera_Forum · ‎06-08-2010

--- Quote Start ---

Data from a sensor must be copied in limited time to SDRAM. If I allow IRQs then I don't receive all the data from sensor. In this time any other master access on the SDRAM leads to loss of data.

--- Quote End ---

So in your case the IRQ issue would be present even if you didn't use TCM and mapped the code into sdram, sram or anything else. Right? I tried with irq disable and infact this is not my case.

--- Quote Start ---

How do you create the tc_code section? The TCM section I'm using has the same name like the SOPC component (onchip_ram).

--- Quote End ---

That's what I did.

I still have the same problem: simple function in TCM works, complete function hangs.

- simple function: for loop which sums the first n integers, where n is the function parameter; 0x2C bytes code

- complete function: 0xbc bytes code; this function works perfectly if I map it into sdram

- tc_code : 8kbytes space

I will try now to progressively increase the size of the simple test function in order to find out if the problem is with code size or with function content

Cris

Altera_Forum · ‎06-08-2010

Yes, thats right. In my case it was a timing problem. Send me your functions if you want. I would have a look on these.

Jens

Altera_Forum · ‎06-08-2010

This works:

int testcall(int n)   __attribute__ ((section (".tc_code")));
int testcall(int n)
{
    int i;
    i = n;
    while (i > 0) {
        n += i--;
        IOWR_ALTERA_AVALON_PIO_DATA(IO24V_PIO_BASE, n);
        printf(".");
        }
    IOWR_ALTERA_AVALON_PIO_DATA(IO24V_PIO_BASE, 0x55);
    return n;
}

I inserted the IOWR and printf to increase code size and have a call to a function in sdram, like the 'real' case.

This DOES NOT work:


int pkt_send(char *data, int len) __attribute__ ((section (".tc_code")));
struct buffer_t 
{
    int len;
    unsigned char data;
} buffer_tx;
int pkt_send(char *data, int len) 
{
    short index;
    // is next buffer valid?
    index = pkt_send_count & 0x07;
    if (buffer_tx.len != 0) 
        return 0;
    if (len >= 0x600)
        len = 0x600;
    memcpy(buffer_tx.data, data, len);        
    buffer_tx.len = len;  
    ++pkt_send_count;
    return len;
}

Remarks:

- buffer_tx resides in sdram

- pkt_send() works perfectly when I remove the attribute directive

- testcall() is called in the very beginning of program; pkt_send after uC RTOS tasks have been initialized

Thank you for any help

Cris

Altera_Forum · ‎06-08-2010

More info:

I switched to Debug mode and neither the testcall() function works anymore; nor I can debug inside the functions located in tcm section (but maybe this is a normal debugger limit)

Altera_Forum · ‎06-08-2010

I have several functions and many variables in onchip RAM and have no problems with any code or data I moved into it. The speed increase over SDRAM is very substantial. In fact, it can be too fast - I need interpacket delays sending a UDP data stream to the PC or I overrun my 3+GHz quad-core PC.

I have 10k of onchip RAM for code and data so a not insignificant portion of my program is stored there. I did not use ALT_LOAD_SECTION_BY_NAME.

Bill

Altera_Forum · ‎06-09-2010

Cris, I tested your function in an example design on my Altera StratixII Devkit.

(C:\altera\91\nios2eds\examples\vhdl\niosII_stratixII_2s60_RoHS\TSE_SGDMA

with the hello_ucosii software example)

I have done some minor changes in your code and putted them into the example.

There are two tasks which are called periodically. Before they initiated by the OS the function testcall() is called. Then every time when task2 is executed the function pkt_send() is called. Both works fine. Please check the attached projects. Is this that what you wanted to do?

Jens

Altera_Forum · ‎06-09-2010

@BillA

I'm also sending (hardware generated) UDP packets to a PC. The best results are achieved if I have a point to point connection from hardware to PC with a separate network adapter just for the hardware.

For some Intel NICs (Pro 1000 series) you can increase the number of receive descriptors (see attached jpg).

This reduces in my case significant loss of packets.

Jens

Altera_Forum · ‎06-09-2010

Thank you Jens,

your sample is basically what I do. The only difference being that I have all the other application stuff.

Nevertheless I can't make mine work!

Latest tests I performed:

- same behaviour if these functions are mapped to tcm, sram or any other section different from sdram where main code is stored

- whenever I compile in debug mode, none of the function works; in release mode usually testcall is working.

- I also have a tc_data section, similar tightly coupled memory as tc_code, but for data. I used it both for system stack and some app data and I had no problem.

Altera_Forum · ‎06-09-2010

Hmm, that's tricky. Did you move the __attribute__ stuff into the header? Next you can try to unload the cache before (or after?) you call the function. Could you test execution of code from the other memories by changing the .text (heap, stack) section mapping in system libraries properties? There can be many other reasons ... task stack size to small?

I think I would start with reduced functionality in hardware and software. (like in the example). Then try to extend it step by step.

Jens

Altera_Forum · ‎06-09-2010

I discovered that code is NOT actually loaded into tc_code section (all reads 0xff)

The small testcall works in release mode because the optimizer inlines it in the caller. The bigger pkt_send function, on the other hand, is really in tc_code.

In debug mode there is no call optimization, so both function don't work.

The behavior is actually very strange: the linker map file tells me that testcall has been placed in tc_code but if I step the assembly code with the debugger I see it executes as if it is inlined into the caller.

!?!??!??

So, the ultimate question is: how can I force to load the tc_code (or sram, or whatever) section?

ALT_LOAD_SECTION_BY_NAME(tc_code) is useless.

Probably tc_code is ignored because in syslib properties .text is mapped to sdram and then only this memory is loaded?

Altera_Forum · ‎06-09-2010

I have attached the objdump file from the example. There you can see that the functions are mapped into tc_code (debug mode).

You can enable generating objdump files in the Nios IDE (Window->Preferences->NiosII->create objdump file)

The syslib properties does not have effect to the __attribute__ directive.

Could you try to execute a small programm from any other RAM than SDRAM to check if tc_code onchip RAM (SSRAM, ...) is working correct? Then you must change the system lib properties.

Jens

Altera_Forum · ‎06-09-2010

Try adding __attribute__((noinline)) to the function prototype. As in:

static void foo(void) __attribute__((noinline));
static void foo(void)
{
    ....
}

That will force a function call.

You might have the code body existing for any external calls - with the local call being inlined.

Altera_Forum · ‎06-09-2010

--- Quote Start ---

@BillA

I'm also sending (hardware generated) UDP packets to a PC. The best results are achieved if I have a point to point connection from hardware to PC with a separate network adapter just for the hardware.

For some Intel NICs (Pro 1000 series) you can increase the number of receive descriptors (see attached jpg).

This reduces in my case significant loss of packets.

Jens

--- Quote End ---

Thanks Jens - this is very helpful. My 2nd NIC is an Intel-based device and also does have this RX (and TX) buffers setting. I will do some testing. I have a reliable UDP stream protocol and it should handle lots of data without loss but I was having to stall it at (go figure) every 256 packets (which is the Intel default and might be the Broadcom default which is my other NIC). I might still have to stall at this boundary but the stalls can be farther apart which is an improvement.

Bill

Altera_Forum · ‎06-09-2010

--- Quote Start ---

Could you try to execute a small programm from any other RAM than SDRAM to check if tc_code onchip RAM (SSRAM, ...) is working correct? Then you must change the system lib properties.

--- Quote End ---

I definitely think I have some problem with loader or with configuration of jtag debugger or inside the project.

I followed your advice and compiled your same hello_world sample. This what I obtained:

case 1

Conditions: Same memory mapping as my original project; all memory sections mapped to sdram and attribute directive used to map the 2 functions to tc_code

Result: same as my original project; tc_code not loaded

I verified that tc_code can be writted and read

case 2

Conditions: mapped .text section to sram; others to sdram

Result: sram loaded and executing; functions mapped to tc_code with attribute don't. Added a (supposed) initialized variable with attribute(... tc_data): this is not initialized, too.

case 3

Conditions: same as case 2 but mapped stack section to tc_data

Result: same as before for code, but now the tc_data variable is correctly initialized.

Now, during the loading process, I can see the tc_data addresses in the download progress log shown in ide console; before I didn't and I saw only sdram addresses.

Conclusion:

Apparently only section which are explicitly used in sys library properties are actually loaded. Other sections mapped with attribute directive are ignored, unless the referred section is used for anything else.

Please, any help would be greatly appreciated because I've been stuck on this weird point for about a week.