Re: GCC 4.1.2 Issue. New toolchain soon?

Altera_Forum · ‎02-02-2011

I'm running into a bug where the following code causes a compiler error:

class foo {
public:
        virtual const char* name() const throw();
        virtual const char* faux_name() const throw();
};
const char* foo::name() const throw() 
{
        return "bar";
}
const char* foo::faux_name() const throw() 
{
        return name();
}

Assuming it's named 'foo.cpp':

 nios2-linux-gnu-g++ -fPIC -O2 -c foo.cpp -o foo.o
foo.cpp: In member function 'virtual const char* foo::faux_name() const':
foo.cpp:15: error: Attempt to delete prologue/epilogue insn:
(insn 44 43 45 0 (set (reg:SI 22 r22)
        (plus:SI (reg:SI 22 r22)
            (reg:SI 8 r8))) -1 (nil)
    (nil))
foo.cpp:15: internal compiler error: in propagate_one_insn, at flow.c:1699
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:...> for instructions.

Normally I'd go to GCC for this, but it's a very outdated compiler version and no longer supported. They don't even use flow.c anymore. This only happens with the nios2-linux-gnu-xxx tools, and not the normal linux tools, even though they're both 4.1.2.

Are there any plans on moving toolchain-mmu to a newer version of GCC? Anybody do this themselves? Can I just replace gcc4.3.5 in the gcc4 toolchain build instructions?

Thanks.

Altera_Forum · ‎02-02-2011

The toolchain comes from CodeSourcery. It's the versions stated with many patches applied, most importantly to support the NIOS II architecture. There is actually a slightly newer version out there than the one available on the wiki, but it's still based on the same upstream versions. I haven't seen any sign of a future release based on newer tools unfortunately.

It should theoretically be possible to carry the CodeSourcery patches over to a newer version of GCC but this would require some expertise and a lot of effort.

I tried compiling your test with the aforementioned slightly newer version that I'm using and it actually works. I will try to get this published soon on the wiki.

Altera_Forum · ‎02-02-2011

Thank you for the reply.

Just a note, that the issue only comes up with the above compiler options (-fPIC and any optimizations other than -O0).

Altera_Forum · ‎05-04-2012

Bringing up an old issue...

Has there been an update? I haven't seen a newer version on the Wiki yet.

Altera_Forum · ‎08-29-2013

--- Quote Start ---

Bringing up an old issue...

Has there been an update? I haven't seen a newer version on the Wiki yet.

--- Quote End ---

A gcc 4.7.3 based toolchain for nios2 is available from Mentor/CodeSourcery at:

https://sourcery.mentor.com/gnutoolchain/release2499

ACDS 13.1 will also include a gcc 4.7.3 based nios2 toolchain.

Altera_Forum · ‎08-30-2013

Does anyone know if they have incorporated my patches (on the wiki) for 'small data' accesses?

I wrote them for gcc 3.4, but they applied to the gcc 4.1 version as well (the only changes I saw between 3.4 and 4.1 were regressions!).

gcc 4.1 certainly made a worse job (than 3.4) of compiling my code - although I didn't spend any time trying to work out why.

Altera_Forum · ‎08-30-2013

--- Quote Start ---

Does anyone know if they have incorporated my patches (on the wiki) for 'small data' accesses?

I wrote them for gcc 3.4, but they applied to the gcc 4.1 version as well (the only changes I saw between 3.4 and 4.1 were regressions!).

gcc 4.1 certainly made a worse job (than 3.4) of compiling my code - although I didn't spend any time trying to work out why.

--- Quote End ---

1) Le "they" -- c'est moi. At least, I was hired a few months back as gcc maintainer and liaison to Mentor/CodeSourcery, who do the heavy lifting. (As you may have detected, nios2 toolchain maintainance has been minimal since gcc 4.1.2. We're intending and expecting to do a lot better going forward, starting with the big jump to gcc 4.7.3.)

2) Ghodz you're good! How did you work out all of those?? :-) (If you have any advice/pointers for learning gcc .md stuff I'm all ears. I have a reasonable compiler background -- e.g., I maintain the Mythryl compiler for fun -- but I'm new to gcc+binutils, which have a whole little jargon/world of their own.)

3) I don't see any evidence that your patches have been applied per set, but patches 1,7,8,11,12 have been re-invented. (I managed to find 11,12 on my own, whee.)

4) I'm going to point the Mentor/CodeSourcery folx to your remaining patches. They've re-invented five of them the hard way, they'd probably prefer to do the remaining ones the easy way. :-)

5) My/our informal experience to date is that gcc 4.7.3 generates slightly but significantly better code than gcc 4.1.2, as one might expect and hope. Very roughly 5% smaller, for example, with of course significant variation around the mean. We've had a few teething problems with Nios2 custom instruction generation, but otherwise the new toolchain seems pleasingly solid.

Altera_Forum · ‎09-02-2013

I'd never looked at gcc (or any other compiler) internals before, so it was a matter of reading the on-line gcc internals docs and the code (and a certain amount of trial and error).

OTOH I've hand written assembler for quite a few cpus over the years.

In some places I just hacked the opcode strings in order to see exactly where some common instructions were generated.

I was writing a fairly small piece of code (less than 2kb) that is a multi-channel hdlc controller. It has 195 clocks to do a bytes rx and tx on each channel (doing the bit-stuffing and crc in software), so absolutely every clock counts and I needed to minimise the worst-case code paths, not the common ones.

I had a moderate incentive to optimise obvious defects in the code generator!

The big gain from fixing access to structures 'small data' was reducing 'register pressure' by stopping the compiler allocating a register to contain the start address of the structure - sometimes it sould generate the same pointer twice!

Some other stuff I noticed:

1) The gcc 4 config always puts switch statement jump tables directly into the .code segment (something about not having the appropriate relocations for PIC code). I run nios cpu with tightly coupled instruction and data memory (no caches) and without cpu data access to the code memory - so I need the .code to 'pure'. They could probably be written to a .rodata.switch (or .code.switch or ...) section so that the linker script can decide exactly where they end up.

2) The instruction scheduler doesn't know about the delay slots after 'ld' instructions (and a few others). I had to go to great lengths to get delay slots filled in order to avoid any stall cycles.

3) It ought to be possible to generate the switch statement jump table code as a series of rtx to aid instruction scheduling and also to move the 'add' into the load offset removing an instruction.

4) In my code the only references to the stack pointer are in the function prologue where some registers are saved - that seems a waste for a function that doesn't return! The code is compiled in a single unit and all functions are marked __attribute__((always_inline)).

5) The 'global pointer' / 'small data' stuff seems to be based on gcc support for 'page 0' addressing. Although I fixed the code for gp relative access to structures, the code for accessing 'small data' arrays still uses an extra register. I used gp as a register variable pointing to a structure because that generates better code (gcc knows about the 16bit offset in the final memeory reference).

5a) I'd arranged for my nios data areas and the 'small io' to be within a 64k block (to get gp relative addressing for everything), but I'd missed a trick! I should have put the 'small io' below 0x7fff - then it could be accessed by offsets from r0. It would be nice if gcc supported such variables - probably need a gcc attribute (or a special section - attribute is probably cleaner).

6) Add a gcc attribute to mark data an 'io', and generate ldio/stio (etc) for accesses to such data.

Unfortunately I can't give you a copy of my sources.

Altera_Forum · ‎09-02-2013

Another problem I found - which might be in gcc itself, so might be fixed in later versions...

If I read a 'volatile unsigned char' value, the compiler follows the 'ldbu' instruction with one that masks the value with 0xff.

(It sometimes does it for non-volatiles as well.)

My suspicion is that 'volatile' forces a separate load of the value into a register, then the value from the register is used for the variable reference. Between the two it 'forgets' that the high bits aren't set (they might be set if there was intervening arithmetic).

Basically this meant I couldn't mark things as 'volatile', I had to use asm volatile("#comment\n":::"memory") instead - I used a fair number of those to change the instruction scheduling elsewhere.

Altera_Forum · ‎09-03-2013

Thanks for the pointers and ideas!

--- Quote Start ---

The code is compiled in a single unit and all functions are marked __attribute__((always_inline)).

.

--- Quote End ---

Note that in gcc 4.7.3 (at least) always_inline has no effect unless the fn is declared inline. The gcc docs more or less say this. We've tripped over this several times.

-Jeff

Altera_Forum · ‎09-04-2013

The functions are marled 'inline' as well.

However, even though they are static and only called once minor changes would stop the functions being inlined.

--- Quote Start ---

5a) I'd arranged for my nios data areas and the 'small io' to be within a 64k block (to get gp relative addressing for everything), but I'd missed a trick! I should have put the 'small io' below 0x7fff - then it could be accessed by offsets from r0. It would be nice if gcc supported such variables - probably need a gcc attribute (or a special section - attribute is probably cleaner).

--- Quote End ---

Actually I wonder if the linker could modify the instruction to use r0 instead of gp if the offset from gp is out of range but a valid offset from r0? There is a specific error message so there must be some specific code.

Altera_Forum · ‎09-04-2013

--- Quote Start ---

Actually I wonder if the linker could modify the instruction to use r0 instead of gp if the offset from gp is out of range but a valid offset from r0? There is a specific error message so there must be some specific code.

--- Quote End ---

I really like that idea! It also fits the zeitgeist of doing much more stuff at linktime -- basically Creeping Full-program Optimization. (The Mlton compiler has been doing that for ages; mainstream efforts like LLVM are now getting into the game too.) The Mentor release has link-time optimization on enabled by a -lto flag. The Altera ACDS 13.1 release will not, because I haven't had time to check it out. Altera keeps me pretty busy just putting out major fires.

BTW, Mentor wasn't happy about my tossing a batch of unrelated third-party patches at them. They want me to check them out one by one myself for validity and submit them as separate issues. Understandably. And as usual I have Critical level stuff I need to do first.

“The mills of the gods grind slowly, but they grind exceedingly fine.” -- English Proverb

Altera_Forum · ‎09-05-2013

Ho hum....

If you have to 'check them out' you might as well apply them yourself!

Anyway patches 9 and 10 are trivial to verify. 9 fixed a complete stupidity.

Patch 2 (memory access costs is straight from the gcc docs).

Patch 5 (use high and lo_sum rtx) is from the gcc docs - it says 'if you have these instructions, do this'. I may well have copied the code from one of the other cpus.

That just leaves patches 3 and 4, patch 3 is easy to test. The case for patch 4 just appeared, NFI why.

If you allocate a small structure in the 'small data' region you want the compiler to generate gp relative addresses for it.

Structure references generate 'symbol + const-offset' rtx, without these patches the compiler generates multiple instructions and can end up keeping an extra register pointing to the structure field member - significantly increasing register pressure.

If you look at the gcc code, they are very localised changed and inside a lot of conditionals that restrict when they might apply.

It is reasonable for the compiler to assume that if the start of a data item is accessible via valid offset from gp then all of that item is accessible from gp [1].

I didn't look at fixing indexing of arrays in 'small data'. Again the 16-bit offset gets added early - instead of being elided into the final memory access.

I fixed my small data arrays by defining a structure and using gp as a register variable pointing to its base (this is a very controlled memory map).

Not that the code we have would be too slow without these changes.

[1] I have a plan to have an array that extends beyond 32k from gp...

Altera_Forum · ‎09-06-2013

A quick patch to let 'small data' be accessed from r0:

--- bfd.altera/elf32-nios2.c    2009-10-21 09:00:21.000000000 +0100
+++ bfd/elf32-nios2.c   2013-09-06 09:46:07.000000000 +0100
@@ -1816,7 +1816,7 @@
              if (!nios2_elf_assign_gp (output_bfd, &gp, info))
                {
                  format = _("global pointer relative relocation at address 0x%08x when _gp not defined\n");
-                 sprintf(msgbuf, format, reloc_address);
+                 sprintf(msgbuf, format, (unsigned int)reloc_address);
                  msg = msgbuf;
                  r = bfd_reloc_dangerous;
                }
@@ -1828,9 +1828,33 @@
                  if ((signed) relocation < -32768
                      || (signed) relocation > 32767)
                    {
+#if 1  /* Allow small data be accessed from r0 as well as gp */
+                     /* Before erroring, see if we can change the instruction to use an offset
+                      * from r0 (always zero) instead.
+                      * Verify source register is r26 and the opcode is addi, ldxxx or stxxx.
+                      * (We really shouldn't see GPREL for any other instructions.) */
+                     unsigned int instruction, opcode;
+                     if ((signed)symbol_address < -32768 || (signed)symbol_address > 32767)
+                       goto gprel_fail;
+                     instruction = bfd_get_32(input_bfd, contents + rel->r_offset);
+                     if ((instruction >> 27) != 26)
+                       /* Source register is not gp */
+                       goto gprel_fail;
+                     opcode = instruction & 0x3f;
+                     /* 4 is 'addi', 0x20 collapses the 'io' variants, 0x15/0x17 are 'ldw'/'stw',
+                      * 0x8 collapses 'xxh' onto 'xxb', 5 is 'stb', 3 and 7 are 'ldbu' and 'ldb'. */
+                     if (opcode == 4 || (opcode &= ~0x20) == 0x15 || opcode == 0x17
+                           || (opcode &= ~8) == 5 || (opcode & ~4) == 3) {
+                       /* Change source register to r0 and drop in offset */
+                       instruction = (instruction & 0x07c0003f) | (symbol_address & 0xffff) << 6;
+                       bfd_put_32(input_bfd, instruction, contents + rel->r_offset);
+                       break;
+                     }
+                   gprel_fail:
+#endif
                      format = _("Unable to reach %s (at 0x%08x) from the global pointer (at 0x%08x) "
                                 "because the offset (%d) is out of the allowed range, -32678 to 32767.\n" );
-                     sprintf(msgbuf, format, name, symbol_address, gp, (signed)relocation);
+                     sprintf(msgbuf, format, name, (unsigned int)symbol_address, (unsigned int)gp, (signed)relocation);
                      msg = msgbuf;
                      r = bfd_reloc_outofrange;
                    }

There is a second check for the gp offset, but the above is the one that usually detects errors.

I've also fixed a couple of format fubars. They would cause grief if 'long' is 64 bits and the arguments aren't all passed in registers (linux amd64 and sparc64 will use registers for the arguments).

Altera_Forum · ‎09-11-2013

--- Quote Start ---

Ho hum....

If you have to 'check them out' you might as well apply them yourself!

--- Quote End ---

True enough! But we'd like to keep the Altera and Mentor releases as close as reasonable, to avoid confusing people.

[ Sorry for delayed response; I've been off breaking the build the day before RC1... *wrygrin* ]

Altera_Forum · ‎09-11-2013

Looks good -- noted! I wish I could say I'll jump right on it, but unfortunately people screaming about broken existing functionality take precedence over people will really cool new functionality... :-/

Altera_Forum · ‎09-12-2013

Hmmm....

I was only patching gcc because it was broken! Seemed quicker to work out how to fix it than raise support requests....

The change that put switch statement jump tables directly into .code is a serious bug - it makes the gcc 4.1 altera build unusable if you need pure code.

The fixes for 64bit build systems should also be treated as very important.

I'm not sure why the printf() format strings are processed by _(), but it might be worth compiling the code with# define _(x) x so that gcc decects format string errors (you might need to remove the 'format' variable as well). I suspect there are quite a few places where 'long' values get printed with 'int' formats - these will cause grief.

Altera_Forum · ‎03-07-2014

I've noticed many seemingly unnecessary masks of unsigned chars with FF as well. I don't think this has changed with the new 4.7.3 version. dsl is this just a small performance penalty or were there other concerns about this that made you decide you "couldn't mark things as 'volatile'"?

Another problem I noticed that I don't know if any of these patches address, is when I use the I/O byte and short built-ins, I get code like this:

addi    r17,r2,2
ldbuio  r3,0(r17)

Which I would think could simply be:

ldbuio r3,2(r2)

Which it doesn't have a problem generating for the 4-byte instruction:

ldwio r3,4(r2)

Jeff, any word on getting these patches integrated? And what is the best way to report things like this?

Altera_Forum · ‎03-07-2014

--- Quote Start ---

Jeff, any word on getting these patches integrated? And what is the best way to report things like this?

--- Quote End ---

The truth is I'm the only Altera Nios2 GNU toolchain guy (which is one more than we had for half a decade...) and to date I've been pretty much continuously in crisis mode putting out fires. (For some reason Altera prefers that I favor corporate customers with large contracts over forums users...) So these patches have been molding in my wishfile, alas.

I believe the Official Procedure for getting attention paid to a patch or such is to submit it as an issue via an Altera FAE. Since I only see the arriving-in-my-mailbox end of that procedure, I'm not sure what the external interface looks like. I found a big backlog of those sitting in the bugbase when I hired on. :-)

In other news:

ACDS 14.1 has transparent support for >256MB of code, no special options like -Wl,--relax-all needed. A Major Customer requested (well, demanded) this, so it was custom-developed via a Mentor contract. It works via a linker tweak which redirects jmp/call instructions to destinations outside the 26-bit immediate field range of the jmp/call instructions to little trampoline code sequences which use jmpr/callr for full 32-bit address range capability. Took awhile to make this happen, but I think the result is pretty nice, as GNU toolchain things go. (Nothing in gcc makes one gasp at the sheer elegance, exactly...)

The Linux version of the toolchain (available only directly from Mentor, since Altera does not yet have full direct support for Nios2 Linux, although we seem to be headed that direction -- e.g. we're now the mainstream maintainer of Nios2 Linux kernel stuff, if I understand correctly) supports 64KB (up from 32KB) GOT tables for position-independent code, by biasing the base pointer to point into the middle of the GOT rather than the start, thus allowing the full -32K -> 32K immediate field range to be used. (We've been hearing from customers who use C++ OpenGL libraries and such that they've been hitting the 32KB GOT wall on Nios2 Linux. C++ seems to generate a lot of globals in at least some situations; I haven't dug into that to figure out what is really going on, since officially I'm just supporting Nios2 on bare metal.)

Additionally, the Linux Nios2 toolchain now supports the conventional -fpic vs -fPIC distinction, generating "big-GOT" code when -fPIC is specified, allowing basically unlimited GOT size (at the cost of multi-instruction sequences to access the GOT).

Internally, engineering has been asking us for better code density; we were chatting with Mentor about this and they reported that from poking around a bit examining the code, it appeared that the Nios2-platform gcc/binutils code has never been seriously tuned. (That conclusion probably won't surprise you. :-) We haven't committed to funding a project to do such tuning, but it seems to be in the air at the moment.

There's more Nios2/toolchain stuff in the tubes, but I think nothing I can post about yet. Suffice to say that we're by no stretch of the imagination abandoning Nios2 in favor of ARM (say), even though we're shipping ARM-based SOPCs. (I mention that because at least one major customer seemed to be concerned about it and to need conference-call reassurance.) ARM and Nios2 are both cool, but they address very different markets, both of which are of major interest to Altera. (Personally, I'd like to put a Nios2 in a hobby UAV, using the FPGA side for low-level computer vision processing and servo control and the Linux side for high-level flight planning etc. That sounds like balls of fun and super cost-effective, but somehow the unfun projects always seem to need to be done first. Maybe this year. Just took delivery of a 3drobotics drone with a hexcopter supposedly in the mail -- they look like good platforms to hack up for this sort of stuff.)

Oh, also worth mentioning: There's a major Altera push to upstream all of our Nios2-support stuff for the GNU toolchain. Legal has signed off on it, all the relevant code has had copyright formally signed over to FSF, rms has signed off on it, and all the major parts have been reviewed and accepted at this point; expect to see gcc 4.9 and similar-generation releases of the binutils etc with Nios2 support out of the box. (One major exception is "smallc" support in newlib: That's a huge patch which I think the newlib team is unlikely to ever accept, so I'll probably be stuck hand-merging it into mainstream newlib releases for the foreseeable future. Thpt.)

So things are trundling along pretty nicely, albeit at major-corporation release-cycle speeds rather than open source overnight-delivery speeds.

-Jeff

Altera_Forum · ‎03-10-2014

My conclusions about the extra ands with 0xff were that they happened when gcc 'found' the value in a register and assumed that this was the result of an arithmetic operation so needed masking with 0xff.

I think volatile is implemented by forcing an extra memory load - and that is promoting the 'unsigned char' to 'int' - hence the mask.

The ldio offsets might be similar to the issues I had with 'small data'. Not difficult to fix once you've worked out the code paths involved.

For my code the extra instructions were a performance problem.

I used quite a few asm volatile ("":::"memory") lines (which gcc has to assume might modify all of memory) not only to replace volatile, but also to cause values to be read early (eg before an 'if' when they are only needed in the 'then' clause) to avoid stall cycles following memory loads.

If might have been easier to write the code directly in assembler - but that is error prone even for less than 1000 instructions.

FWIW if we'd had to wait for Altera to fix the compiler to support structures in the small data segment, I suspect we'd still be waiting and wouldn't have the product we shipped several years ago. Fortunately I was able to fix gcc myself so we didn't have to raise a support request through our FAE. We don't have the resources to chase Altera to accept the fix.

This reminds me of my dealings with major real-time OS manufacturer a few years back. I got our system running by re-implementing some functions and patching in jump instructions to the new version over the first instructions of the old copy (as well as some smaller patches).

In spite of raising these issues with their UK support, I finally discovered that the only way any support issues would get fixed in the next release was if the support group had themselves had to issue a fix. Once I'd generated the patch, we didn't need a fix from support.

Any one else trying to do the same thing would though.

Altera_Forum · ‎03-10-2014

Hi,

--- Quote Start ---

The truth is I'm the only Altera Nios2 GNU toolchain guy (which is one more than we had for half a decade...) and to date I've been pretty much continuously in crisis mode putting out fires. (For some reason Altera prefers that I favor corporate customers with large contracts over forums users...) So these patches have been molding in my wishfile, alas.

--- Quote End ---

--- Quote Start ---

FWIW if we'd had to wait for Altera to fix the compiler to support structures in the small data segment, I suspect we'd still be waiting and wouldn't have the product we shipped several years ago. Fortunately I was able to fix gcc myself so we didn't have to raise a support request through our FAE. We don't have the resources to chase Altera to accept the fix.

--- Quote End ---

Do these facts mean that we can never expect the GNU toolchain support from Altera?::p

Kazu