- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
I have tried to compile xtb (https://github.com/grimme-lab/xtb). Unfortunately, with ifx 2024.1.0, I got ICE:
#0 0x00000000023916e2
#1 0x00000000023f5bc7
#2 0x00000000023f5cf0
#3 0x00007f232544fb50
#4 0x0000000004394b15
#5 0x0000000004395724
#6 0x0000000004394bd1
#7 0x0000000004391eec
#8 0x000000000439099c
#9 0x000000000438ffe2
#10 0x000000000339c5af
#11 0x000000000339c42d
#12 0x000000000272fcd5
#13 0x00000000023452ed
#14 0x0000000002736dbd
#15 0x000000000234505d
#16 0x000000000272ea2a
#17 0x000000000232eacf
#18 0x000000000232cf52
#19 0x00000000022d97eb
#20 0x00000000024b171c
#21 0x00007f232543bd85 __libc_start_main + 229
#22 0x00000000021134e9
/home/c_lokgi/workdir/xtb/src/constrain_pot.f90: error #5633: **Internal compiler error: segmentation violation signal raised** Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error.
compilation aborted for /home/c_lokgi/workdir/xtb/src/constrain_pot.f90 (code 3)
Steps to reproduce:
git clone https://github.com/grimme-lab/xtb.git
cd xtb
cmake -Bbuild_ifx -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_Fortran_COMPILER=ifx -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-g" -DCMAKE_CXX_FLAGS="-g" -DCMAKE_Fortran_FLAGS="-g"
cd build_ifx
make -j$(nproc)
In this thread I will let you know what I will find
I have no idea was it fixed in later releases of ifx or not, so, everybody can try to reproduce
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is the diff, that resolves issue:
diff --git a/src/constrain_pot.f90 b/src/constrain_pot.f90
index 1ce0cdd..f45f7e3 100644
--- a/src/constrain_pot.f90
+++ b/src/constrain_pot.f90
@@ -173,7 +173,7 @@ subroutine constrain_dist(fix,n,at,xyz,g,e)
real(wp),intent(inout) :: e
real(wp),intent(inout) :: g(3,n)
- integer i,j,k,l,m,mm
+ integer i,j,k,l,m,mm,idx
real(wp)rij(3),dum,r0,r,vp(3),ra(3),rb(3)
real(wp)va(3),vb(3),vc(3),vab(3),vcb(3),deda(3),dedc(3),dedb(3)
real(wp)dda(3),ddb(3),ddc(3),ddd(3)
@@ -198,8 +198,10 @@ subroutine constrain_dist(fix,n,at,xyz,g,e)
e=e+fix%fc*dum
ff=fix%fc*fix%expo(m)*dum2
dum=ff/r
- g(:,j)=g(:,j)+dum*rij
- g(:,i)=g(:,i)-dum*rij
+ do idx = 1, 3
+ g(idx,j)=g(idx,j)+dum*rij(idx)
+ g(idx,i)=g(idx,i)-dum*rij(idx)
+ end do
enddo
@Igor_V_Intel, could you please have a look why ICE happens?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The simpler patch is to check that `(i == j)` and skip those pairs. However, this check will never happen, but it helps compiler not to fail with something internal (most probably, vectorization).
See the patch here: https://github.com/grimme-lab/xtb/pull/1106/files
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the git info. I am able to see the ICE and am investigating.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This seems to be an issue we've been having with some codes and -g -traceback. If you remove those it should build. but of course I will isolate something to take to the developers. For me it is crashing on DO CONCURRENT, src/disp/ncoord.f90 line 437. Again, it's the -g -traceback options triggering it.
xtb/src/disp/ncoord.f90(437): error #5623: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error.
do concurrent(tx = -rep_cn(1):rep_cn(1), &
------^
s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The more recent compiler I'm using is having the issue with the mask-exp of the last index spec. I can work around the issue with this change to src/disp/ncoord.f90
before
437 do concurrent(tx = -rep_cn(1):rep_cn(1), &
438 & ty = -rep_cn(2):rep_cn(2), &
439 & tz = -rep_cn(3):rep_cn(3), &
440 & tx.ne.0 .or. ty.ne.0 .or. tz.ne.0)
after, this compiles w/o issue even with -g -traceback
437 do concurrent(tx = -rep_cn(1):rep_cn(1), &
438 & ty = -rep_cn(2):rep_cn(2), &
439 & tz = -rep_cn(3):rep_cn(3), &
440 & (tx.ne.0 .or. ty.ne.0 .or. tz.ne.0))
The issue is that the parser is not recognizing your mask expression without the parenthesis, and it's only when -g -traceback is used. This does cause a different code path in the parser and IR we generate because of that.
I am doing some more tests with the public 2024.2.0 and upcoming 2025.0 early builds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, the only compiler that is able to compile ncoord.f90 is our main branch which is what we will use for 2025.1. But this compiler will not be public until later Q1 2025.
I tested 2024.1.0, 2024.2.0, 2025.0.0 preview build (not public). All 3 get the ICE. the main branch nightly build needs the parens around the mask expression to work.
All of them will work if you can remove -g -traceback.
I'd like to isolate this a bit further so that the mask expression does not require the parens, and there is something lurking with -g -traceback. We've had other issues with traceback lately, as we have made some major changes to how we create IR for traceback.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I put the original, pre-patched constrain_pot.f90 back into src. Yes, it crashes with every compiler I have. I will come back to that one and triage it later. I think you're right, it may be something in the vectorizer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Following up on my research on disp/ncoord.f90
I isolated this down to a single file reproducer of around 20 source lines. Unfortunately our servers are down for maintenance today. I will share the reproducing code next week when they come back online. It's 2 DO CURRENTS, one after the other along with debug symbol creation and other syntactic patterns, that is the root cause. It is a unique pattern that we have only seen in 1 other application.
This bug is a regression. 2024.x has no problem with ncoord.f90. Unfortunately, the upcoming 2025.0.x versions will hit an Internal Compiler Error (ICE) when any of the options "-g", "-traceback", or "-g -traceback" are used along with "-qopenmp" or "-fiopenmp" for the particular code pattern found in ncoord.f90.
This is the bad news. For XTB users, you should advise them to either use 2024.x version compilers, or avoid using compiler options "-g", "-g -traceback", or "-traceback" if they are building with OpenMP with the 2025.0.x compilers (to be released in late October or November). For you to research: I see "-g" in the CMAKE flag options, but where is "-traceback" coming from? I see -traceback in the cmake VERBOSE output. That option needs to be avoided with 2025.0.x as well, as it implies debug symbol creation for the traceback.
Now for the good news: internal testing found this bug as well in an unrelated code. We fixed this in the developmewnt code branch for 2025.1.0 which is due March/April 2025. We traced the fix back to a specific edit and understand the root cause and it's fix. That took some time to find, as the fix was adjacent but not directly related to what I found in ncoord. Fortunately, the fix works for the XTB failure and the adjacent failure.
I can't get this backported to 2025.0.x. The code branch for 2025.0.x has been taken and stabilized. The fix for the ncoord bug requires a string of other changes we've made on our dev branch that will not go into 2025.0.x versions. A backport from our dev branch for this fix would destabilize and probably break the 2025.0 code branch. The conclusion is that we'll have to wait for the 2025.1.0 update release end of Q1 2025 next year.
I think the best course for XTB is to either: not use debug or -g or -traceback options with the upcoming 2025.0.0, and any patch releases like 2025.0.1, which will not have the fix for this regression bug. Without the debug symbols, xtb will build fine with 2025.0.x. But with debug symbols and OpenMP you will hit this ICE. I don't see any other issues with xtb with 2025.0.x branch builds without debug symbols AND the code patch you provided for constrain_pot.f90. But as I said, you WILL see this fix appear in the 2025.1.0 Update Release around March/April 2025 - this compiler should work for xtb with debug builds. Perhaps recommend using 2024.x until 2025.1.0 is released.
I will now go back and triage what you reported and looks like a vectorizer bug with constrain_pot.f90. we need to understand this break as well and get it fixed.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page