Re: Code in tightly coupled memory - Page 2

Altera_Forum · ‎06-07-2010

Hi all,

I placed a function in a section different from the main code. More exactly, my code is mapped to sdram and I linked a function testcall() into a tightly coupled memory section which I previoulsy defined in sopc builder.

So I declared:

int testcall(int n) __attribute__ ((section (".tc_code")));

Then, following the suggestion I found in another thread I called ALT_LOAD_SECTION_BY_NAME(tc_code) before calling testcall().

My questions are:

- is the ALT_LOAD_SECTION_BY_NAME(tc_code) mandatory? The code seems to work even if I don't use it.

- if tc_code section is located in internal fpga memory (cyclone III M9k blocks), how much is the code speed improvement I could roughly obtain with respect to sdram (assume I'm using 2 tse with sgdma).

Thank you

Cris

Altera_Forum · ‎06-09-2010

Have you checked what the elf program headers say? (objdump -p)

I've had the JTAG loader correctly loading code to multiple memory areas.

But I did have to hack at the linker script to get the code linked into the correct places.

The Altera linker script (9.0) appended all the sections onto the end of the normal data, and added function calls to copy the data into the relevant areas into the startup code.

Most of my code is linked using a completely custom linker script and loaded into internal memory blocks via a PIO interface from another processor on the board (no JTAG in sight). Once the code is loaded we remove the soft reset from the cpu.

In fact I have 2 nios - and one releases the reset on the other.

All works fine!

Altera_Forum · ‎06-09-2010

--- Quote Start ---

Have you checked what the elf program headers say? (objdump -p)

--- Quote End ---

This is what objdump says

Program Header:
    LOAD off    0x000000b4 vaddr 0x02000000 paddr 0x02000000 align 2**0
         filesz 0x000189f0 memsz 0x000189f0 flags r-x
    LOAD off    0x00018aa4 vaddr 0x020189f0 paddr 0x020189f0 align 2**0
         filesz 0x00001ec4 memsz 0x0000c378 flags rw-
    LOAD off    0x0001a968 vaddr 0x03008000 paddr 0x02024d68 align 2**0
         filesz 0x00000160 memsz 0x00000160 flags r-x
    LOAD off    0x0001aac8 vaddr 0x0300a000 paddr 0x02024ec8 align 2**0
         filesz 0x00000020 memsz 0x00000020 flags rw-

tc_code section address is 0x03008000

tc_data section address is 0x0300a000

They are present here

--- Quote Start ---

I've had the JTAG loader correctly loading code to multiple memory areas.

But I did have to hack at the linker script to get the code linked into the correct places.

--- Quote End ---

In my case it is the JTAG loader which doesn't correctly load code.

Linking seems to be ok.

Latest test:

I inserted again these in the beginning of main()

1. ALT_LOAD_SECTION_BY_NAME(tc_data);

2. ALT_LOAD_SECTION_BY_NAME(tc_code);

Now the variable in tc_data is initialized correctly after line 1 is executed, while I find tc_code still empty after execution of line 2.

<<Update>>

I discovered that I can't write tc_code memory. No error on write but it always reads 0xFF.

Why?!?

In sopc is defined ram 32bits and it is exactly like tc_data. Same clock; the only difference is that tc_code is connected to Nios tc instruction port and tc_data to Nios tc data port.

<<end Update>>

Altera_Forum · ‎06-10-2010

The linker output:

    LOAD off    0x0001a968 vaddr 0x03008000 paddr 0x02024d68 align 2**0
         filesz 0x00000160 memsz 0x00000160 flags r-x
    LOAD off    0x0001aac8 vaddr 0x0300a000 paddr 0x02024ec8 align 2**0
         filesz 0x00000020 memsz 0x00000020 flags rw-

probably isn't what you want.

The 'vaddr' (virtual address) value is the address that the code is linked for. This is the address that other parts of the nios code will use in order to reference this area.

The 'paddr' (physical address) is the address where the loader will write the data.

Note that the 'paddr' follow on from the previous segment - rather than matching the 'vaddr'.

This might be what you want if you are trying to get the program loaded into a single memory block (ie a ROM image) - but something at run time has to copy the data from whatever virtual address the 'paddr' gets mapped to onto the correct virtual address.

If you are loading directly from the elf file - as the JTAG loader does - you probably want the 'paddr' and 'vaddr' values to match.

If you can't read/write the tc_code area it may be that you don't have the correct access setup in the sopc builder.

You need to give the nios cpu data port access to the tightly coupled instruction memory (at the same address) to allow the code to be written to it (you mustn't do this for the tightly coupled data areas, since that makes two slaves try to respond to the same address).

If you are loading code by some external means (ie not by running code in the nios cpu), and move all the .rodata (etc) sections from instruction memory to data memory then you don't need data access to the code area.

If fact, any build that uses tightly coupled instruction and data memory (so probably doesn't contain a data cache, and may have a minimal instruction cache to allow jtag (etc) bootstrap) really needs the readonly data linked with the initailised read/write data and placed in the tightly coupled data memory, not the tightly coupled instruction memory. This removes the Avalon MM transfers used when accessing the rodata.

Altera_Forum · ‎06-10-2010

Yahoo!!! I eventually solved the problem!

My tc memory had only access to the Nios tc instruction port. I had to redefine it as dual port ram and connect the secondary port to Nios data master in order to allow code loading.

I resume for anyone interested:

- tightly coupled code memory MUST have both a access connection to Nios TC instr. port and a connection to Nios data master, with same addresses

- tightly coupled code memory need only the connection to Nios TC data port.

- ALT_LOAD_SECTION_BY_NAME() *MUST* be called for all section not explicitly mentioned in system library properties (namely if you map something into them with the attribute directive); infact loader (jtag too) places these objects at paddr addresses, while the application must take care to copy it to correct vaddr

Thank you Jens and Dsl for your support and suggestions

Cris

Altera_Forum · ‎06-10-2010

--- Quote Start ---

- tightly coupled code memory MUST have both a access connection to Nios TC instr. port and a connection to Nios data master, with same addresses

--- Quote End ---

Not strictly true. If you put the .rodata elsewhere, and load the code from another Avalon master (or from within the fgpa image itself), then the nios cpu doesn't need data access to its own code.

--- Quote Start ---

- tightly coupled code memory need only the connection to Nios TC data port.

--- Quote End ---

I presume you meant 'data'. True - provided nother external nees to initialise it.

--- Quote Start ---

- ALT_LOAD_SECTION_BY_NAME() *MUST* be called for all section not explicitly mentioned in system library properties (namely if you map something into them with the attribute directive); infact loader (jtag too) places these objects at paddr addresses, while the application must take care to copy it to correct vaddr.

--- Quote End ---

This is because your linker script is wrong.

You should be able to get the linker to put the data items such that both paddr and vaddr are correct (the current vaddr value).

For instance, if you want to initialize a large block of M9K memory, and your main code is in TC memory. then you don't want the initialisation data appended to your code - there just isn't space there.

Altera_Forum · ‎06-10-2010

--- Quote Start ---

Not strictly true. If you put the .rodata elsewhere, and load the code from another Avalon master (or from within the fgpa image itself), then the nios cpu doesn't need data access to its own code.

--- Quote End ---

Right. I had understood it. But in this case I'd need another master connected to tcm (which is substantially the same I did now through nios data master port) or I'd have to recompile Quartus design with the generated tcm initialization hex file EVERY TIME I modify my code (which is not really advisable for debugging...)

--- Quote Start ---

I presume you meant 'data'

--- Quote End ---

Exactly. Sorry

--- Quote Start ---

This is because your linker script is wrong.

You should be able to get the linker to put the data items such that both paddr and vaddr are correct (the current vaddr value).

For instance, if you want to initialize a large block of M9K memory, and your main code is in TC memory. then you don't want the initialisation data appended to your code - there just isn't space there.

--- Quote End ---

That's right, too. But I didn't want to use a custom linker script. I used the default setting and simply mapped a few object to tcm.

Is there any sample of linker script available? I'd like to perform these operations in this way but I don't want to read the whole linker manual.

Another question: can the attribute directive be extended to a group of functions/variables without tagging every function?

I conclude (I hope) this long thread, sharing the final results of moving the code to tcm.

I'll give the critical code execution times worst/better case.

Original design, everything in sdram:

205/274

Moved most critical functions to tightly coupled code memory:

160/194

Like above but also moved stack to tightly coupled data memory

145/158

I can say it's worth it.

Again, thank you all

Cris

Altera_Forum · ‎07-28-2010

Hi !

Where I can find the tcm.zip example, mentioned in "Using Tightly Coupled

Memory with the Nios II Processor" ?

Actually, I can load my ISR function to the tightly_coupled_instruction_memory. But the tightly_coupled_data_memory seems to be unused. Til now, i haven't any performance improvements.

My ISR function is really short, it executes only "OSSemPost(IRQSem);". Is Tightly Coupled Memory even usefull or is VIC more applicable ?

Thanks any help very much.

Regards,

R2-D2

Altera_Forum · ‎07-28-2010

--- Quote Start ---

Where I can find the tcm.zip example, mentioned in "Using Tightly Coupled

Memory with the Nios II Processor" ?

--- Quote End ---

I don't know where that example can be found.

You can refer to this tutorial if you haven't it yet.

http://www.altera.com/literature/tt/tt_nios2_tightly_coupled_memory_tutorial.pdf

--- Quote Start ---

Actually, I can load my ISR function to the tightly_coupled_instruction_memory. But the tightly_coupled_data_memory seems to be unused.

--- Quote End ---

Linker places in TCM all data you tell him to place there.

Have you specified any?

--- Quote Start ---

Til now, i haven't any performance improvements.

My ISR function is really short, it executes only "OSSemPost(IRQSem);". Is Tightly Coupled Memory even usefull or is VIC more applicable ?

--- Quote End ---

If ISR function is that short I wouldn't expect great performance improvements with TTC. Probably most of the performance losses are due to ISR dispatcher.

For longer functions I REALLY DO have about 30% execution speed improvements.

Regards

Altera_Forum · ‎07-28-2010

--- Quote Start ---

If ISR function is that short I wouldn't expect great performance improvements with TTC. Probably most of the performance losses are due to ISR dispatcher.

--- Quote End ---

Definitely be sure to include the interrupt vector custom instruction. This replaces slow C code which is calculating and jumping to the vector.

I put ISRs in TCM because I guessed that the ISR would rarely if ever be in cache and I wanted it to execute as fast as possible. TCM executes as if it's in cache.

Bill

Altera_Forum · ‎07-29-2010

--- Quote Start ---

Linker places in TCM all data you tell him to place there.

Have you specified any?

--- Quote End ---

I don't know how. Do you mean that following lines of code ? If yes, where it belongs ?

# Locate the exception stack to tightly coupled data memory.

set_setting hal.linker.enable_exception_stack TRUE

set_setting hal.linker.exception_stack_memory_region_name tightly_coupled_data_memory.

aset_setting hal.linker.exception_stack_size 1024

--- Quote Start ---

If ISR function is that short I wouldn't expect great performance improvements with TTC. Probably most of the performance losses are due to ISR dispatcher.

For longer functions I REALLY DO have about 30% execution speed improvements.

--- Quote End ---

So there are no performance improvements expected ? My program code isn't that big and the cache size is 8kbyte. I'm thinking that the ISR already lies in the on-chip ram. What can I expect when i use a seperated expection stack ? The paper mentioned some overhead. Is the tightly coupled data memory only for a seperated exception stack needed ?

Thanks

Regards,

Altera_Forum · ‎07-29-2010

For a single variable I need in TCM I simply use the attribute directive after declaration.

int array_in_tcm[128] __attribute__ (section (".tc_data"));

Same for a single function:

int function_in_tcm(int param) __attribute__ (section (".tc_code"));

tc_data and tc_code are the names I assigned to tcm blocks in sopc builder.

I'm not sure, but don't think exception stack placement is so important for performance. This is executed only in case of exceptions, not in normal operation.

Application runtime stack is important, since its data is continuosly accessed in function calls, especially if you use a lot of local variables.

Altera_Forum · ‎07-29-2010

Ok, now I know how to address tc_data, thanks.

But I intend to use TCM for ISRs (Interrupt Service Routine). I need to know, how to define an "Separate Hardware Interrupt Stack". I haven't found any examples.

Better I explain my intentions in more detail.

My ISR looks like this:

static void handle_CPU2_interrupts(void* context, alt_u32 id) __attribute__ ((section (".exceptions")));

static void handle_CPU2_interrupts(void* context, alt_u32 id)

{

OSSemPost(IRQSem); // <-- benefit from a seperated exception stack ?

IOWR_ALTERA_AVALON_PIO_EDGE_CAP(CPU1_INTHAND_BASE, 0x1);

}

void task(void* pdata)

{

[... init ...]

while (1) {

OSSemPend(IRQSem, 0, &err); <-- event waiting

[ ... do something ... ]

}

SoPC:

I set the exception address, on the Core Nios II tab, in the

Exception Vector: Memory: list, to tightly_coupled_instruction_memory_s1.

RTOS:

I use µC/OS-II.

Problems:

1. ISR response takes ~16 µs.

2. From OSSemPost, in the ISR, to OSSemPend, in task, it takes ~34 µs.

Any chances to reduce these times ?

My possible solutions:

-> Try TCM

-> Try VIC -> will reduce ISR response certainly (<<16µs)

If I had examples I wouldn't struggle with TCM and I can test possible performance improvements. I would appreciate any comments / suggestions. I need this for my bachelor thesis.

Thanks any help very much.

Regards,

R2-D2

@chris72

your thread "timer interrupt with rtos-ii"

similar to my project.

Altera_Forum · ‎07-29-2010

--- Quote Start ---

Problems:

1. ISR response takes ~16 µs.

2. From OSSemPost, in the ISR, to OSSemPend, in task, it takes ~34 µs.

Any chances to reduce these times ?

--- Quote End ---

Did you use the custom opcode for interrupt vectors? It changes interrupt response an opcode to do the dispatch from a C function to lookup the vector.

OSSemPost and OSSemPend delays are part of uC/OS-II. Unless you don't have compiler optimization on (-O2 or -O3) there's not much you can do - unless you put OSSemPost and OSSemPend in TCM.

Bill

Altera_Forum · ‎07-29-2010

Hi R2-D2,

I have exactly the same situation you described: code placement, RTOS, even same isr and OSSem usage.

So, my response times are similar, too.

In addition, I use NicheStack for TCP/IP and when I load the system with network traffic I have worst case response times up to 150us.

In these cases I also see an increase in isr response to about 40us: I think this is due to OS (or TCP stack) disabling irq in some critical pieces of code.

I wrote a post asking if this behaviour is normal, but I had no answers.

For BillA

I use custom opcode for int vectors and O3 optimization level.

Then I believe R2-D2 does, too.

Regards

Cris

Altera_Forum · ‎07-29-2010

--- Quote Start ---

Did you use the custom opcode for interrupt vectors? It changes interrupt response an opcode to do the dispatch from a C function to lookup the vector.

--- Quote End ---

Sry, I don't understand this. What do you mean with custom optcode ? Taking the whole bunch of the ISR directly in the VIC, so that I don't need alt_irq_register ?

--- Quote Start ---

OSSemPost and OSSemPend delays are part of uC/OS-II. Unless you don't have compiler optimization on (-O2 or -O3) there's not much you can do - unless you put ossempost and ossempend in tcm.

--- Quote End ---

Sounds good to me! Can you explain me in few words how to accomplish this, please ?

--- Quote Start ---

I use custom opcode for int vectors and O3 optimization level.

Then I believe R2-D2 does, too.

--- Quote End ---

The timings are based on standard IIC (no VIC) and -O0 optimations. So, I haven't done any optimations yet.

Thanks for your comments

Regards,

R2-D2

Altera_Forum · ‎07-29-2010

--- Quote Start ---

Sry, I don't understand this. What do you mean with custom optcode ? Taking the whole bunch of the ISR directly in the VIC, so that I don't need alt_irq_register ?

--- Quote End ---

Maybe this helps? http://www.alteraforum.org/forum/showthread.php?p=65073

And Google: ALT_CI_EXCEPTION_VECTOR_N

--- Quote Start ---

Sounds good to me! Can you explain me in few words how to accomplish this, please ?

--- Quote End ---

That's covered in this thread in fact: http://www.alteraforum.com/forum/showpost.php?p=92691&postcount=19

--- Quote Start ---

The timings are based on standard IIC (no VIC) and -O0 optimations. So, I haven't done any optimations yet

--- Quote End ---

Don't bother timing anything until you're using -O2 or -O3 - the difference is night and day.

Bill

Altera_Forum · ‎07-29-2010

Actually, I know these links ;) I will take a closer look then. I will post my results when finished or earlier when something strange happens :eek:.

Thank you very much