Re: floating-point assist fault

ryofurue · ‎11-20-2004

Hello all,

The /var/adm/messages file of our Linux server shows the following message
at a very high frequency.

kernel: ogcm.par(30048): floating-point assist fault at ip 400000000003b0f1,
isr 0000020000000008

This program "ogcm.par" is mine. As long as this program is running, this message
repeats. I have learned that this message is issued when the kernel emulates
floating-point operations that the hardware cannot perform.

My question is how to identify which part of my Fortran program is causing this
message and how to fix it. I've tried the "-fpe0 -ftz" (without other options) to no avail.

% ifort -V
Intel Fortran Itanium Compiler for Itanium-based applications
Version 8.1 Build 20040922 Package ID: l_fc_pc_8.1.019
. . . .

The system is an SGI Altix with a 2.4.21-sgi301r1 kernel.

Ryo

TimP · ‎11-20-2004

Don't -fpe0 and -ftz conflict? Do the assist faults occur with -ftz? Why are you using 8.1.019? Does the ip information not make sense, even together with -Wl,--print-map ?

ryofurue · ‎11-21-2004

Thank you, tim18, for your help.

> Don't -fpe0 and -ftz conflict?

I don't know. The manpage of ifort 8 ("man ifort") doesn't
say anything about the interaction between the two options.
I tried each option separately, but the result was the same
(except the address [ip] as reported by /var/log/messages
changed. See below.)

> Why are you using 8.1.019?

I guess you mean we should use the newest version? I'll
ask our admin to upgrade the compiler, later. He's currently
terribly busy.

> Does the ip information not make sense, even together
> with -Wl,--print-map ?

Thanks! I think you mean the ip information shown in
/var/log/messages corresponds to the address, shown by the
-Wl,--print-map option, of the machine code (text?).

Although the address shown by the option changes depending on
which of -ftz and -fpe0 you use, the address shown in
/var/log/messages changes accordingly, and I think I've been
able to narrow down the problem to a single subroutine. I'll
check if the problem is in there or not.

I have two more questions or requests. One is that I'd like to
see the warning messages in the stderr of my program. Is this
possible? I had to ask our administrator to change the permission
of /var/log/messages from 600 to 644. Although this allows me to
see my warnig messages, they are mixed with other similar messages
because another person's program is also issuing the same type of
warning messages.

The other question is that I'd like my program to stop and show
the stack trace at the very location of this problem. Is this possible
with some combination of options? I thought -ftz and/or -fpe0 were it
(possibly with -g), but apparently they aren't.

If the answers to these questions are no, I'd submit it to a wishlist
(if such a thing exist).

Thank you,
Ryo

ryofurue · ‎11-28-2004

Hi,

> Does the ip information not make sense,
> even together with -Wl,--print-map

I'm trying to identify which part of the code is causing
the floating-point fault. But, I'm bewildered.

Now /var/log/messages says:

kernel: ogcm-test(16817): floating-point assist fault
at ip 40000000000cd9f1, isr 0000020000000008

ifort -Wl,--print-map prints:

.text 0x40000000000cce00 0xc80 sfcng-delete-this.o
0x40000000000ccec0 bdyflx_
0x40000000000cce00 sfcflx_
.text 0x40000000000cda80 0xa040 /home/amakihi/furue
/work/coco3.4-linux
/lib/libogcm.a(atmct.o)
0x40000000000cda80 tmstup_
0x40000000000d07c0 tmstpc_

[I've made one line wrap around.] Now, according to /var/log/messages,
the problem is occuring at cd9f1, which is after the top of subroutine
sfcflx and before the top of subroutine tmstup. That means, I think,
the problem is in subroutine sfcflx.

But, in fact, I commented out all the executive statements of sfcflx
and inserted a single write statement, so that the subroutine now looks

subroutine sfcflx(...arguments...)
... declaration of arguments and constants ...
REAL*8 TSFC(NXYDIM, NTDIM), TDMP(NXYDIM, NTDIM)
COMMON /WORK/ TSFC, TDMP
write(0,*) "sfcflx:"
end

Does this mean the write statement is causing the floating-point
fault ? :-) I don't know how to proceed any further. Could
somebody help me out?

Once again, is it possible to abort the program at the very site
of the fault? I'd love to see a SIGFPE, for example, raised on
the spot. Then I could use a debugger to identify the problematic
line in the code.

Thank you,
Ryo

Lorri_M_Intel · ‎11-29-2004

Let me make one observation. Depending on what the declarations are, there may be code in your subroutine above the write statement. That is, if there are automatic variables being created based on arguments passed in, there is code output to generate those variables.

If you don't get some help on generating a SIGFAULT, you could try removing local declarations until you find the problematic one, or you could get an assembly language dump of the file and inspect it for likely candidates.

- Lorri

ryofurue · ‎11-30-2004

Thanks, Lorri, for your suggestin. I cut down the offending subroutine to the one attached at the end of this message, but the problem persists. We don't have any automatic variables. The only conjecture I can make now is that the address ("ip") reported in /var/log/messages not correspond to the one in the "-Wl,--print-map" listing from the linker. I double-checked the address reported in /var/log/messages is after the top of the offending subroutine and before the one listed next in the linker listing. Oh, and the compiler options I use are -132 -convert big_endian -ftz -Wl,--print-map -zero -check -O0 I got stuck. I hope somebody knows better. Thank you, Ryo ============================================= SUBROUTINE SFCFLX( O FT, TAUX, TAUY, I T) IMPLICIT NONE INTEGER NTDIM PARAMETER(NTDIM = 2) INTEGER NXDIM, NYDIM, NZDIM PARAMETER(NXDIM = 56, NYDIM = 56, NZDIM = 38) INTEGER NXYDIM PARAMETER(NXYDIM = NXDIM*NYDIM) REAL*8 FT(NXYDIM, NTDIM) REAL*8 TAUX(NXYDIM), TAUY(NXYDIM) REAL*8 T(NXYDIM, NZDIM, NTDIM) write(0,*) "---sfcflx---" RETURN END

ryofurue · ‎11-30-2004

Sorry I submitted a mangled message.
(I unchecked "Automatically convert carriege returns
to HTML line breakes". Stupid of me.)
I'm resubmitting it.

Ryo
=======================================
Thanks, Lorri, for your suggestin.

I cut down the offending subroutine to the one attached
at the end of this message, but the problem persists.
We don't have any automatic variables.

The only conjecture I can make now is that the
address ("ip") reported in /var/log/messages
not correspond to the one in the "-Wl,--print-map"
listing from the linker. I double-checked the
address reported in /var/log/messages is after the
top of the offending subroutine and before the one
listed next in the linker listing.

Oh, and the compiler options I use are

-132 -convert big_endian
-ftz -Wl,--print-map -zero -check -O0

I got stuck. I hope somebody knows better.

Thank you,
Ryo
=============================================


      SUBROUTINE SFCFLX(
     O                 FT,   TAUX,   TAUY,
     I                  T)
      IMPLICIT NONE

      INTEGER NTDIM
      PARAMETER(NTDIM =      2)
      INTEGER NXDIM, NYDIM, NZDIM
      PARAMETER(NXDIM  =     56, NYDIM  =     56, NZDIM  =     38)
      INTEGER NXYDIM
      PARAMETER(NXYDIM = NXDIM*NYDIM)

      REAL*8      FT(NXYDIM, NTDIM)
      REAL*8    TAUX(NXYDIM),   TAUY(NXYDIM)
      REAL*8       T(NXYDIM, NZDIM, NTDIM)

      write(0,*) "---sfcflx---"

      RETURN
      END

ryofurue · ‎12-02-2004

I finally solved the problem.

1) How to stop the program immediatly at the site
of the fault?

There is a command "prctl" on the SGI Altix system, with which
you can specify what to do for the two notorious kernel
emulations: unaligned access and floating-point assist fault.
For example, if your Fortran program is run as


   $ prctl --fpemu=signal  your_program

the floating-point assist fault condition triggers SIGFPE. So,
if you compile your program with -g and run it under gdb as


   $ prctl --fpemu=signal gdb your_program

you'll find which line of your code is causing the fault.

On an ordianry Linux, there's a system call "prctl". This command
on Altix may be encapsulating it. (I found Solaris 9 also has this
command.) I wonder what other Linux systems on Itanium machines
are doing. This problem should be common to all Linuxes on Itanium
systems.

2) What does the "ip" information in /ver/log/messages mean?

I found that the listing from the "-Wl,--print-map" does not match
what /var/log/messages says. Or, otherwise I'm missing something.
The messages file says the problem occurs at "ip 400000000005d6e1"
and the listing from the linker has this:


 [...]
 .text  0x400000000005ca00 0x3080 sfcng-delete-this.o
        0x400000000005db80           bdyflx_
        0x400000000005ca00           sfcflx_
 .text  0x400000000005fa80 0x7f40 libogcm.a(atmct.o)
        0x400000000005fa80           tmstup_
        0x4000000000065e80           tmstpc_
 [...]

This address 5d6e1 is after the top of sfcflx_ (5ca00) and before
bdyflx_ (5db80). That means the problem is in sfcflx_. At least
so I thought.

The fact is that the problem was in bdyflx_.

I hope this will be helpful for other people having the same
problem.

Ryo