Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28895 Discussions

xtb: ICE during compilation of constrain_pot.f90

foxtran
New Contributor I
1,225 Views

Hello!

I  have tried to compile xtb (https://github.com/grimme-lab/xtb). Unfortunately, with ifx 2024.1.0, I got ICE:

          #0 0x00000000023916e2
          #1 0x00000000023f5bc7
          #2 0x00000000023f5cf0
          #3 0x00007f232544fb50
          #4 0x0000000004394b15
          #5 0x0000000004395724
          #6 0x0000000004394bd1
          #7 0x0000000004391eec
          #8 0x000000000439099c
          #9 0x000000000438ffe2
         #10 0x000000000339c5af
         #11 0x000000000339c42d
         #12 0x000000000272fcd5
         #13 0x00000000023452ed
         #14 0x0000000002736dbd
         #15 0x000000000234505d
         #16 0x000000000272ea2a
         #17 0x000000000232eacf
         #18 0x000000000232cf52
         #19 0x00000000022d97eb
         #20 0x00000000024b171c
         #21 0x00007f232543bd85 __libc_start_main + 229
         #22 0x00000000021134e9

/home/c_lokgi/workdir/xtb/src/constrain_pot.f90: error #5633: **Internal compiler error: segmentation violation signal raised** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.
compilation aborted for /home/c_lokgi/workdir/xtb/src/constrain_pot.f90 (code 3)

Steps to reproduce:

git clone https://github.com/grimme-lab/xtb.git
cd xtb
cmake -Bbuild_ifx -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_Fortran_COMPILER=ifx -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-g" -DCMAKE_CXX_FLAGS="-g" -DCMAKE_Fortran_FLAGS="-g"
cd build_ifx
make -j$(nproc)


In this thread I will let you know what I will find

I have no idea was it fixed in later releases of ifx or not, so, everybody can try to reproduce


0 Kudos
10 Replies
foxtran
New Contributor I
1,208 Views

Here is the diff, that resolves issue:

diff --git a/src/constrain_pot.f90 b/src/constrain_pot.f90
index 1ce0cdd..f45f7e3 100644
--- a/src/constrain_pot.f90
+++ b/src/constrain_pot.f90
@@ -173,7 +173,7 @@ subroutine constrain_dist(fix,n,at,xyz,g,e)
    real(wp),intent(inout) :: e
    real(wp),intent(inout) :: g(3,n)

-   integer i,j,k,l,m,mm
+   integer i,j,k,l,m,mm,idx
    real(wp)rij(3),dum,r0,r,vp(3),ra(3),rb(3)
    real(wp)va(3),vb(3),vc(3),vab(3),vcb(3),deda(3),dedc(3),dedb(3)
    real(wp)dda(3),ddb(3),ddc(3),ddd(3)
@@ -198,8 +198,10 @@ subroutine constrain_dist(fix,n,at,xyz,g,e)
       e=e+fix%fc*dum
       ff=fix%fc*fix%expo(m)*dum2
       dum=ff/r
-      g(:,j)=g(:,j)+dum*rij
-      g(:,i)=g(:,i)-dum*rij
+      do idx = 1, 3
+        g(idx,j)=g(idx,j)+dum*rij(idx)
+        g(idx,i)=g(idx,i)-dum*rij(idx)
+      end do

    enddo

 

@Igor_V_Intel, could you please have a look why ICE happens?

0 Kudos
foxtran
New Contributor I
1,075 Views

The simpler patch is to check that `(i == j)` and skip those pairs. However, this check will never happen, but it helps compiler not to fail with something internal (most probably, vectorization).

See the patch here: https://github.com/grimme-lab/xtb/pull/1106/files

0 Kudos
Ron_Green
Moderator
1,059 Views

Thanks for the git info.  I am able to see the ICE and am investigating.  

0 Kudos
Ron_Green
Moderator
1,018 Views

This seems to be an issue we've been having with some codes and -g -traceback.  If you remove those it should build.  but of course I will isolate something to take to the developers. For me it is crashing on DO CONCURRENT, src/disp/ncoord.f90 line 437.  Again, it's the -g -traceback options triggering it. 

xtb/src/disp/ncoord.f90(437): error #5623: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.
      do concurrent(tx = -rep_cn(1):rep_cn(1), &
------^

0 Kudos
Ron_Green
Moderator
1,006 Views

The more recent compiler I'm using is having the issue with the mask-exp of the last index spec.  I can work around the issue with this change to src/disp/ncoord.f90

before

 437       do concurrent(tx = -rep_cn(1):rep_cn(1), &
 438             &       ty = -rep_cn(2):rep_cn(2), &
 439             &       tz = -rep_cn(3):rep_cn(3), &
 440             &       tx.ne.0 .or. ty.ne.0 .or. tz.ne.0)

after, this compiles w/o issue even with -g -traceback

 437       do concurrent(tx = -rep_cn(1):rep_cn(1), &
 438             &       ty = -rep_cn(2):rep_cn(2), &
 439             &       tz = -rep_cn(3):rep_cn(3), &
 440             &       (tx.ne.0 .or. ty.ne.0 .or. tz.ne.0))

 

The issue is that the parser is not recognizing your mask expression without the parenthesis, and it's only when -g -traceback is used.  This does cause a different code path in the parser and IR we generate because of that.  

 

I am doing some more tests with the public 2024.2.0 and upcoming 2025.0 early builds.

0 Kudos
Ron_Green
Moderator
1,001 Views

Unfortunately, the only compiler that is able to compile ncoord.f90 is our main branch which is what we will use for 2025.1. But this compiler will not be public until later Q1 2025.  
I tested 2024.1.0, 2024.2.0, 2025.0.0 preview build (not public).  All 3 get the ICE.  the main branch nightly build needs the parens around the mask expression to work.

 

All of them will work if you can remove -g -traceback. 

 

I'd like to isolate this a bit further so that the mask expression does not require the parens, and there is something lurking with -g -traceback.  We've had other issues with traceback lately, as we have made some major changes to how we create IR for traceback. 

0 Kudos
foxtran
New Contributor I
946 Views

@Ron_Green, do you see the issue with constrain_pot.f90 without my patch with your versions of ifx? 

0 Kudos
Ron_Green
Moderator
734 Views

I put the original, pre-patched constrain_pot.f90 back into src.  Yes, it crashes with every compiler I have.  I will come back to that one and triage it later.  I think you're right, it may be something in the vectorizer.

0 Kudos
Ron_Green
Moderator
635 Views

Following up on my research on disp/ncoord.f90

I isolated this down to a single file reproducer of around 20 source lines. Unfortunately our servers are down for maintenance today.  I will share the reproducing code next week when they come back online.  It's 2 DO CURRENTS, one after the other along with debug symbol creation and other syntactic patterns, that is the root cause. It is a unique pattern that we have only seen in 1 other application.

 

This bug is a regression.  2024.x has no problem with ncoord.f90.  Unfortunately, the upcoming 2025.0.x versions will hit an Internal Compiler Error (ICE) when any of the options "-g", "-traceback", or "-g -traceback" are used along with "-qopenmp" or "-fiopenmp" for the particular code pattern found in ncoord.f90.

This is the bad news.  For XTB users, you should advise them to either use 2024.x version compilers, or avoid using compiler options "-g", "-g -traceback", or "-traceback" if they are building with OpenMP with the 2025.0.x compilers (to be released in late October or November).  For you to research:  I see "-g" in the CMAKE flag options, but where is "-traceback" coming from?  I see -traceback in the cmake VERBOSE output.  That option needs to be avoided with 2025.0.x as well, as it implies debug symbol creation for the traceback. 

 

Now for the good news:  internal testing found this bug as well in an unrelated code.  We fixed this in the developmewnt code branch for 2025.1.0 which is due March/April 2025.  We traced the fix back to a specific edit and understand the root cause and it's fix. That took some time to find, as the fix was adjacent but not directly related to what I found in ncoord. Fortunately, the fix works for the XTB failure and the adjacent failure.

I can't get this backported to 2025.0.x.  The code branch for 2025.0.x has been taken and stabilized. The fix for the ncoord bug requires a string of other changes we've made on our dev branch that will not go into 2025.0.x versions.  A backport from our dev branch for this fix would destabilize and probably break the 2025.0 code branch.  The conclusion is that we'll have to wait for the 2025.1.0 update release end of Q1 2025 next year.  

 

I think the best course for XTB is to either:  not use debug or -g or -traceback options with the upcoming 2025.0.0, and any patch releases like 2025.0.1, which will not have the fix for this regression bug.  Without the debug symbols, xtb will build fine with 2025.0.x.  But with debug symbols and OpenMP you will hit this ICE.   I don't see any other issues with xtb with 2025.0.x branch builds without debug symbols AND the code patch you provided for constrain_pot.f90.  But as I said, you WILL see this fix appear in the 2025.1.0 Update Release around March/April 2025 - this compiler should work for xtb with debug builds.  Perhaps recommend using 2024.x until 2025.1.0 is released.

 

I will now go back and triage what you reported and looks like a vectorizer bug with constrain_pot.f90.  we need to understand this break as well and get it fixed. 

 

 

 

0 Kudos
foxtran
New Contributor I
346 Views

Is there any progress with constrain_pot.f90? 

I've found one more bug with xTB, but in another file. This OMP section is failed at runtime:
https://github.com/grimme-lab/xtb/blob/e39b5caab37b6b02cc63592a8acff14952ad5266/src/type/calculator.f90#L150-L207

More precisely, it happens with line:
https://github.com/grimme-lab/xtb/blob/e39b5caab37b6b02cc63592a8acff14952ad5266/src/type/calculator.f90#L166

Steps to reproduce:

 

git clone git@github.com:grimme-lab/xtb.git
# Apply patch for constrain_pot.f90
cmake -Bbuild_rel_llvm -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_Fortran_COMPILER=ifx 
cd build_rel_llvm
make -j 47 && ctest . --parallel 12 -R xtb/hessian --verbose

 

 

The interesting issues are:
https://github.com/dftd4/dftd4/issues/259
https://github.com/dftd3/simple-dftd3/blob/50245008d0bf596e6da98a3f7bdeec4e6bc284fa/src/dftd3/ncoord.f90#L151

Somewhat, guys added `schedule(runtime)` to avoid some issues with Intel compilers. Probably, you've already found it.

ifx 2024.1 is still used

0 Kudos
Reply