ifx: no debug info in AMD CPU

artu_72 · ‎01-08-2024

Hi,

ifx version 2024.0.2 does not show debug info turning on debugging options on AMD ZEN 3 CPU ( Ryzen 7 5800X ). I am using the options

-fpe0 -O0 -g -traceback -debug full -check all,nouninit -align all -gen-interfaces -warn all,noexternals -warn stderrors -stand f18

and this is the resulting traceback (even using gdb)

forrtl: severe (59): list-directed I/O syntax error, unit -130, file ...
Image PC Routine Line Source
auto-ddm_intel_V_ 000000000058F2EB Unknown Unknown Unknown
auto-ddm_intel_V_ 0000000000432328 Unknown Unknown Unknown
auto-ddm_intel_V_ 000000000040582B Unknown Unknown Unknown
auto-ddm_intel_V_ 0000000000405775 Unknown Unknown Unknown
auto-ddm_intel_V_ 000000000040529D Unknown Unknown Unknown
libc.so.6 000014AC83E4614A Unknown Unknown Unknown
libc.so.6 000014AC83E4620B __libc_start_main Unknown Unknown
auto-ddm_intel_V_ 00000000004051B5 Unknown Unknown Unknown

Of course, on an Intel CPU the stack is correctly shown.

Ron_Green · ‎01-09-2024

Traceback calls into glibc. This is the most probable cause. What OS distro, version and glibc are you using on the AMD system and the Intel system?

artu_72 · ‎01-09-2024

Hi Ron, thank you for reply.

I am using Fedora 39 distro on both (Intel and AMD) PC. glibc is 2.38.14 version. Of course ifx version is the same also.

Ron_Green · ‎01-09-2024

I don't have an AMD system to test. FC 39 is not supported until the next update 2024.1.0 BUT let's explore this issue and try a few things.

The error is on a IO to unit -130. I assume from the Intel traceback you found an IO routine for the fault. A read or a write, probably an internal read or write? In general we do not use negative unit numbers, only internally and only small integer numbers. So from the Intel traceback, what is the IO statement causing this? Is it possible the unit is unconnected?

Next, let's keep it simple: you want to know if you can get a traceback. So forget the complex, and try a simple test to see if traceback works:

subroutine crashme(a, b)
real :: a, b
b = a/b
end

program crash
external crashme
real a
a = 42.0
call crashme(a)
print*, "a is ", a
end program crash

This will give a traceback for a segfault.

compile and run thusly, assumes you name the file 'cause_traceback.f90'

ifx -g -O0 -traceback cause_traceback.f90 -V -what
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.0.0 Build 20231017
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

 Intel(R) Fortran 24.0-1238.2
GNU ld version 2.39-9.fc38
$ ./a.out
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libc.so.6          00007FA43976BB70  Unknown               Unknown  Unknown
a.out              00000000004051C8  crashme                     3  cause_traceback.f90
a.out              000000000040528D  crashme_.t56p               0  cause_traceback.f90
a.out              0000000000405202  crash                      10  cause_traceback.f90
a.out              000000000040519D  Unknown               Unknown  Unknown
libc.so.6          00007FA439755B4A  Unknown               Unknown  Unknown
libc.so.6          00007FA439755C0B  __libc_start_main     Unknown  Unknown
a.out              00000000004050B5  Unknown               Unknown  Unknown

there are a lot of stack corruption issues that can confuse the traceback: stack alignments, sizes, allocations, glibc differences, etc can affect traceback. This simple case should work on your AMD system.

artu_72 · ‎01-11-2024

In order:

- IO unit number -130: this number is assigned by statement newunit=lu in the open line. It is in Fortran standard and reported in the Intel Fortran manual. The unit is connected, I have verified with gfortran that the error has been raised by a wrong format in reading.

- your simple program cause_traceback.f90 works well in my PC, giving the same output as yours (except PC column, of course)

Meanwhile, I have compiled other big project (~100 routines) and it dumps flawlessly stack trace in cas of error. Perhaps in my troublesome previous code there is some hidden feature that causes no stack dump.

Ron_Green · ‎01-11-2024

IO units - yes, I should have been more exact about negative unit numbers. When I said we use them "internally" I should have elaborated that we use negative numbers for newunit. This is because negatives are typically not used by customers via an explicit unit=-#. So we decided to implement 'newunit' to return/use negative numbers for these units. it's an internally generated unit number.

For traceback, it is not uncommon for very corrupted stacks to not provide traceback. If the stack is completely obliterated or has corrupted return addresses the traceback will fail. In other words, the call frame on stack may be overwritten somehow. Or perhaps stack was exhausted??

There is no difference in our code generation for traceback between AMD and Intel. And as you have seen, some or most apps will give the traceback.

Steve_Lionel · ‎01-12-2024

FWIW, the standard requires NEWUNIT to return a negative unit number. There are some that have been traditionally used by DEC/Compaq/Intel Fortran for certain predefined units used by PRINT, unit * and internal I/O - so NEWUNIT starts at around -128 (from memory) and goes down from there.