Solved: ifx OpenMP compilation segmentation fault

jvo203 · ‎04-18-2023

The attached "fixed_array.f90" code compiles fine with gfortran and ifort but fails with the new ifx.

No problems with ifort:

chris@capricorn:/tmp> ifort -Ofast -qopenmp fixed_array.f90 -o fixed_array.o -c

A segmentation fault with ifx:

chris@capricorn:/tmp> ifx -Ofast -qopenmp fixed_array.f90 -o fixed_array.o -c
#0 0x0000000001fc5727 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x1fc5727)
#1 0x0000000001fc5850 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x1fc5850)
#2 0x00007f191b2bf420 __restore_rt (/lib64/libc.so.6+0x3e420)
#3 0x00000000039d23ae (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x39d23ae)
#4 0x00000000039ce5f5 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x39ce5f5)
#5 0x00000000039a2473 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x39a2473)
#6 0x00000000039a22f3 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x39a22f3)
#7 0x0000000002e97add (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x2e97add)
#8 0x00000000022ebeec (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x22ebeec)
#9 0x0000000002e84fbd (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x2e84fbd)
#10 0x00000000022f30b7 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x22f30b7)
#11 0x0000000002e8519d (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x2e8519d)
#12 0x00000000022eaa8a (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x22eaa8a)
#13 0x0000000001f08104 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x1f08104)
#14 0x0000000001f06b83 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x1f06b83)
#15 0x0000000001eb5859 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x1eb5859)
#16 0x00000000020797c5 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x20797c5)
#17 0x00007f191b2a8bb0 __libc_start_call_main (/lib64/libc.so.6+0x27bb0)
#18 0x00007f191b2a8c79 __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x27c79)
#19 0x0000000001cf1729 (/opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/xfortcom+0x1cf1729)

fixed_array.f90: error #5633: **Internal compiler error: segmentation violation signal raised** Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error.
compilation aborted for fixed_array.f90 (code 3)

It doesn't matter if "-qopenmp" or "-fopenmp" is used. The OpenMP comes into play in other source code files as part of a large mixed-source C / FORTRAN project.

The error goes away without the OpenMP flag:

chris@capricorn:/tmp> ifx -Ofast fixed_array.f90 -o fixed_array.o -c

The Intel CPU is "model name : Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz", OS is 64-bit openSUSE Tumbleweed.

chris@capricorn:/tmp> uname -a
Linux capricorn 6.2.10-1-default #1 SMP PREEMPT_DYNAMIC Thu Apr 6 10:36:55 UTC 2023 (ba7816e) x86_64 x86_64 x86_64 GNU/Linux

Ron_Green · ‎04-20-2023

Bug ID is CMPLRLLVM-46969

Reproducer reduced to this:

subroutine to_fixed()

implicit none

integer(kind=4) :: i, j

! ICE needs DO CONCURRENT with Internal BLOCK and must be 2D

! Interesting, only 2D do concurrent causes ICE

! for 1D this does not ICE

! do concurrent(i=1:16) no ICE for this 1D case

!

do concurrent(j=1:16, i=1:16)

block

integer :: x1

end block

end do

end subroutine to_fixed

View solution in original post

Ron_Green · ‎04-19-2023

I think I know where the problem is. there was a similar issue not too long ago.

I'll get a bug report started on this.

jvo203 · ‎04-19-2023

Thank you.

Ron_Green · ‎04-20-2023

Bug ID is CMPLRLLVM-46969

Reproducer reduced to this:

subroutine to_fixed()

implicit none

integer(kind=4) :: i, j

! ICE needs DO CONCURRENT with Internal BLOCK and must be 2D

! Interesting, only 2D do concurrent causes ICE

! for 1D this does not ICE

! do concurrent(i=1:16) no ICE for this 1D case

!

do concurrent(j=1:16, i=1:16)

block

integer :: x1

end block

end do

end subroutine to_fixed

jvo203 · ‎04-21-2023

Splendid! It's nice to see that Intel does care about improving its products.

Ron_Green · ‎07-21-2023

this bug is fixed in 2023.2.0.

Ron_Green · ‎07-21-2023

this bug is fixed in 2023.2.0.

jvo203 · ‎07-21-2023

Thank you so much for fixing this. Have just checked it out. Everything compiles with the new 2023.2.0 icx / ifx and, more importantly, the program seems to work fine.

jvo203 · ‎08-01-2023

Unfortunately the do concurrent is now causing a runtime segmentation fault with the new ifx when compiled with OpenMP. Please see the attached reproducer code. The attached program either hangs indefinitely or terminates with a segmentation fault.

jvo203 · ‎08-01-2023

Upon more extensive testing of the whole application, unfortunately "DO CONCURRENT" is causing a segmentation fault. I have already raised a new issue and attached a simplified reproducer FORTRAN code. Unfortunately my new post entitled "DO CONCURRENT segmentation fault with ifx (IFX) 2023.2.0 20230721" is being rejected (marked) as spam!

The big C / FORTRAN application compiled with icx / ifx terminates with the following segmentation fault, caused by the "DO CONCURRENT":

Thread 13 "MHD-single" received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7fffe03fe840 (LWP 6702)]

0x00007fffeb887b83 in __kmp_fork_barrier (gtid=-510132800, tid=-342298832) at ../../src/kmp_barrier.cpp:2540

2540 ../../src/kmp_barrier.cpp: そのようなファイルやディレクトリはありません.

(gdb) bt

#0 0x00007fffeb887b83 in __kmp_fork_barrier (gtid=-510132800, tid=-342298832) at ../../src/kmp_barrier.cpp:2540

#1 0x00007fffeb8d2778 in __kmp_fork_call (loc=0x7fffe197fdc0, gtid=-342298832, call_context=(unknown: 0xe197fe40), argc=0, microtask=0x0, invoker=0x0, ap=0x7fffe03fbca0)

at ../../src/kmp_runtime.cpp:2649

#2 0x00007fffeb88a033 in __kmpc_fork_teams (loc=0x7fffe197fdc0, argc=-342298832, microtask=0x7fffe197fe40) at ../../src/kmp_csupport.cpp:499

#3 0x0000000000614a23 in fixed_array::to_fixed (compressed=<error reading variable: value requires 211410 bytes, which is more than max-value-size>,

x=<error reading variable: value requires 705600 bytes, which is more than max-value-size>, pmin=<optimized out>, pmax=-0.281975746, ignrval=-1.00000002e+30, datamin=-0.281975746,

datamax=-5.5271781e+19) at src/fixed_array.f90:50

#4 0x00000000006555fd in fits_mp_read_fits_file_.DIR.OMP.PARALLEL.2.split7647 () at src/fits.f90:4457

#5 0x00007fffeb963493 in __kmp_invoke_microtask () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so

#6 0x00007fffeb8d1533 in __kmp_invoke_task_func (gtid=-510132800) at ../../src/kmp_runtime.cpp:8273

#7 0x00007fffeb8d0470 in __kmp_launch_thread (this_thr=0x7fffe197fdc0) at ../../src/kmp_runtime.cpp:6648

#8 0x00007fffeb9641ff in _INTERNAL1ebb3278::__kmp_launch_worker (thr=0x7fffe197fdc0) at ../../src/z_Linux_util.cpp:559

#9 0x00007fffeb690c64 in start_thread () from /usr/lib64/libc.so.6

#10 0x00007fffeb718550 in clone3 () from /usr/lib64/libc.so.6

Anyway, please see the newly-raised issue for a short reproducer program.

jimdempseyatthecove · ‎08-01-2023

This may be a stack overflow situation.

Set the OMP_STACKSIZE to a larger size (default is 4MB)

Jim Dempsey

jvo203 · ‎08-01-2023

Have set the OpenMP stack to 1G, no luck. A segmentation fault occurs:

Thread 3 "intel_fixed_arr" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff5ebf87c0 (LWP 20051)]
0x00007ffff758074d in rml::internal::BackRefMain::findFreeBlock() () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
(gdb) bt
#0 0x00007ffff758074d in rml::internal::BackRefMain::findFreeBlock() () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#1 0x00007ffff758059c in rml::internal::BackRefIdx::newBackRef(bool) () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#2 0x00007ffff7574290 in rml::internal::ExtMemoryPool::mallocLargeObject(rml::internal::MemoryPool*, unsigned long) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#3 0x00007ffff756edb3 in rml::internal::MemoryPool::getFromLLOCache(rml::internal::TLSData*, unsigned long, unsigned long) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#4 0x00007ffff756f727 in scalable_aligned_malloc () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#5 0x000000000041a050 in for_allocate_handle ()
#6 0x00000000004087fb in main::to_fixed (compressed=<error reading variable: value requires 211410 bytes, which is more than max-value-size>,
x=<error reading variable: value requires 705600 bytes, which is more than max-value-size>, datamin=0, datamax=1) at intel_fixed_array.f90:91
#7 0x0000000000407b3c in MAIN__.DIR.OMP.PARALLEL.LOOP.2591.split596 () at intel_fixed_array.f90:54
#8 0x00007ffff7563493 in __kmp_invoke_microtask () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#9 0x00007ffff74d1533 in __kmp_invoke_task_func (gtid=-140836864) at ../../src/kmp_runtime.cpp:8273
#10 0x00007ffff74d0470 in __kmp_launch_thread (this_thr=0x7ffff79b0000) at ../../src/kmp_runtime.cpp:6648
#11 0x00007ffff75641ff in _INTERNAL1ebb3278::__kmp_launch_worker (thr=0x7ffff79b0000) at ../../src/z_Linux_util.cpp:559
#12 0x00007ffff7290c64 in start_thread () from /lib64/libc.so.6
#13 0x00007ffff7318550 in clone3 () from /lib64/libc.so.6

Even setting the stack to 10G does not help:

Thread 1 "intel_fixed_arr" received signal SIGINT, Interrupt.
0x00007ffff757f1d0 in rml::internal::Backend::IndexedBins::getFromBin(int, rml::internal::BackendSync*, unsigned long, bool, bool, bool, int*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
(gdb) bt
#0 0x00007ffff757f1d0 in rml::internal::Backend::IndexedBins::getFromBin(int, rml::internal::BackendSync*, unsigned long, bool, bool, bool, int*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#1 0x00007ffff757eeb0 in rml::internal::Backend::IndexedBins::findBlock(int, rml::internal::BackendSync*, unsigned long, bool, bool, int*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#2 0x00007ffff757ecc3 in rml::internal::Backend::genericGetBlock(int, unsigned long, bool) () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#3 0x00007ffff757fe7d in rml::internal::Backend::getLargeBlock(unsigned long) () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#4 0x00007ffff75742a4 in rml::internal::ExtMemoryPool::mallocLargeObject(rml::internal::MemoryPool*, unsigned long) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#5 0x00007ffff756edb3 in rml::internal::MemoryPool::getFromLLOCache(rml::internal::TLSData*, unsigned long, unsigned long) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#6 0x00007ffff756f727 in scalable_aligned_malloc () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#7 0x000000000041a050 in for_allocate_handle ()
#8 0x00000000004087fb in main::to_fixed (compressed=<error reading variable: value requires 211410 bytes, which is more than max-value-size>,
x=<error reading variable: value requires 705600 bytes, which is more than max-value-size>, datamin=0, datamax=1) at intel_fixed_array.f90:91
#9 0x0000000000407b3c in MAIN__.DIR.OMP.PARALLEL.LOOP.2591.split596 () at intel_fixed_array.f90:54
#10 0x00007ffff7563493 in __kmp_invoke_microtask () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#11 0x00007ffff74d1533 in __kmp_invoke_task_func (gtid=-142475232) at ../../src/kmp_runtime.cpp:8273
#12 0x00007ffff74d27b8 in __kmp_fork_call (loc=0x7ffff7820020 <_INTERNALae4d1d49::rml::internal::defaultMemPool_space+224>, gtid=75, call_context=fork_context_intel, argc=212992, microtask=0x0,
invoker=0x34000, ap=0x7fffffffbbf0) at ../../src/kmp_runtime.cpp:2673
#13 0x00007ffff7489d23 in __kmpc_fork_call (loc=0x7ffff7820020 <_INTERNALae4d1d49::rml::internal::defaultMemPool_space+224>, argc=75, microtask=0x1) at ../../src/kmp_csupport.cpp:350
#14 0x00000000004085a9 in main () at intel_fixed_array.f90:50
#15 0x0000000000406d2d in main ()
#16 0x00007ffff722abf0 in __libc_start_call_main () from /lib64/libc.so.6
#17 0x00007ffff722acb9 in __libc_start_main_impl () from /lib64/libc.so.6
#18 0x0000000000406c35 in _start () at ../sysdeps/x86_64/start.S:115
(gdb)
#0 0x00007ffff757f1d0 in rml::internal::Backend::IndexedBins::getFromBin(int, rml::internal::BackendSync*, unsigned long, bool, bool, bool, int*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#1 0x00007ffff757eeb0 in rml::internal::Backend::IndexedBins::findBlock(int, rml::internal::BackendSync*, unsigned long, bool, bool, int*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#2 0x00007ffff757ecc3 in rml::internal::Backend::genericGetBlock(int, unsigned long, bool) () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#3 0x00007ffff757fe7d in rml::internal::Backend::getLargeBlock(unsigned long) () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#4 0x00007ffff75742a4 in rml::internal::ExtMemoryPool::mallocLargeObject(rml::internal::MemoryPool*, unsigned long) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#5 0x00007ffff756edb3 in rml::internal::MemoryPool::getFromLLOCache(rml::internal::TLSData*, unsigned long, unsigned long) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#6 0x00007ffff756f727 in scalable_aligned_malloc () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#7 0x000000000041a050 in for_allocate_handle ()
#8 0x00000000004087fb in main::to_fixed (compressed=<error reading variable: value requires 211410 bytes, which is more than max-value-size>,
x=<error reading variable: value requires 705600 bytes, which is more than max-value-size>, datamin=0, datamax=1) at intel_fixed_array.f90:91
#9 0x0000000000407b3c in MAIN__.DIR.OMP.PARALLEL.LOOP.2591.split596 () at intel_fixed_array.f90:54
#10 0x00007ffff7563493 in __kmp_invoke_microtask () from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#11 0x00007ffff74d1533 in __kmp_invoke_task_func (gtid=-142475232) at ../../src/kmp_runtime.cpp:8273
#12 0x00007ffff74d27b8 in __kmp_fork_call (loc=0x7ffff7820020 <_INTERNALae4d1d49::rml::internal::defaultMemPool_space+224>, gtid=75, call_context=fork_context_intel, argc=212992, microtask=0x0,
invoker=0x34000, ap=0x7fffffffbbf0) at ../../src/kmp_runtime.cpp:2673
#13 0x00007ffff7489d23 in __kmpc_fork_call (loc=0x7ffff7820020 <_INTERNALae4d1d49::rml::internal::defaultMemPool_space+224>, argc=75, microtask=0x1) at ../../src/kmp_csupport.cpp:350
#14 0x00000000004085a9 in main () at intel_fixed_array.f90:50
#15 0x0000000000406d2d in main ()
#16 0x00007ffff722abf0 in __libc_start_call_main () from /lib64/libc.so.6
#17 0x00007ffff722acb9 in __libc_start_main_impl () from /lib64/libc.so.6
#18 0x0000000000406c35 in _start () at ../sysdeps/x86_64/start.S:115

jvo203 · ‎08-01-2023

Thank you for a suggestion - will try it out tomorrow. However, one note of caution: the attached program executes fine with the classic Intel ifort as well as gfortran, all without tinkering with the OpenMP stacksize. It also passes the valgrind checker with zero memory issues when compiled with ifort. The trouble happens with the ifx only.

Here are the Makefile compiler flags used with ifort / ifx:

# Intel FORTRAN

IFORT = ifx # ifort or ifx

IFLAGS = -g -Ofast -xHost -axAVX -qopt-report=2 -qopenmp

And gfortran:

# GCC FORTRAN

FORT = gfortran

FLAGS = -march=native -g -Ofast -fPIC -fno-finite-math-only -funroll-loops -ftree-vectorize -fopenmp

The ifort-compiled version executes fine each and every time, the ifx one does not.

jimdempseyatthecove · ‎08-02-2023

In examining the error trace I find:

#8 0x00000000004087fb in main::to_fixed (compressed=<error reading variable: value requires 211410 bytes, which is more than max-value-size>,
x=<error reading variable: value requires 705600 bytes, which is more than max-value-size>, datamin=0, datamax=1) at intel_fixed_array.f90:91

Function to_fixed is called from line 54 within a parallel region.

Function to_fixed contains a "do concurrent" line 93.

Should "do concurrent" code generate a parallel region, you will now have nested parallelism. Is this what you intended to do? Note, without explicit stating number of threads at PARALLEL DO and DO CONCURRENT (and do concurrent is parallelized), this would require 2x the number of hardware threads on your system.

As a simple test, replace the do concurrent with nested do loops (for j and i).

Jim Dempsey

jvo203 · ‎08-02-2023

Thank you, yes removing nested OpenMP parallelism fixes the runtime segmentation fault. Either having "do concurrent" running inside a non-OpenMP outer "do k = 1, nz", or replacing "do concurrent" with nested "do loops (for j and i)" removes the seg. fault. For performance reasons having the outer loop "do k = 1, nz" use OpenMP is preferable to using the inner OpenMP "do concurrent".

The other compilers that I've tried (gfortran and ifort) do not parallelise "do concurrent" with OpenMP so the code worked fine. The new ifx from Intel parallelises "do concurrent". The only other Fortran compiler that does use OpenMP in "do concurrent" is nvfortran.

This raises a few questions / points:

1. Up until now the recommended practice has been to use "do concurrent" as this enables compilers to do some extra optimizations (register use? pipelining?), even if the code did not really execute concurrently. In other words, the name "do concurrent" was a bit misleading, the loop was not really concurrent but still, it was a "better quality" "do" loop compared with a plain "do".

2. On the subject of nested OpenMP parallelism, as far as I can remember both gfortran and ifort allowed it, even if it wasn't recommended due to placing an undue pressure on the OpenMP thread pool. They never "bailed out" on me via a hard segmentation fault. Of course manually (carefully) setting num_threads in the outer and inner parallel loops/regions would alleviate the extra stress. It seems the new ifx compiler is less tolerant of nested OpenMP parallelism. Or perhaps the lack of support for nested parallelism is an outright bug. Is it a feature or is it a bug? My hunch is that it's a bug in the current ifx, nested parallelism should work.

IanH · ‎08-03-2023

@jvo203 wrote:
...
1. Up until now the recommended practice has been to use "do concurrent" as this enables compilers to do some extra optimizations (register use? pipelining?), even if the code did not really execute concurrently. In other words, the name "do concurrent" was a bit misleading, the loop was not really concurrent but still, it was a "better quality" "do" loop compared with a plain "do".

Separate to discussion on compiler/runtime bugs - I dispute the first part of this. The recommended practice is to write code that clearly expresses the author's intention, and then let the compiler and associated tools work out the best way of implementing things (perhaps with the odd nudge). Production compilers have lots of smarts in them around how to optimise normal do loops - they already need to "understand" what the code inside the loop is doing for many of those optimisations - not sure that adding the concurrent keyword in source really adds much to that.

I also dispute that the name "do concurrent" is a bit misleading... I reckon it is very misleading!

jvo203 · ‎08-03-2023

"I also dispute that the name "do concurrent" is a bit misleading... I reckon it is very misleading!"

Amen to that! It's all so dependent on the compiler implementation and/or compiler flags.

jimdempseyatthecove · ‎08-03-2023

>>, nested parallelism should work.

You, as software engineer, have the responsibility to be cognizant of the requirements of nested parallelism such that it functions properly. By this I mean structure it such that you either do not have oversubscription .OR. the oversubscription is organized as to not adversely affect performance.

Assume you want to incorporate nested parallelism. Outer lever PARALLEL DO, inner lever DO CONCURRENT.
And assume your platform has 16 available hardware threads (logical processors). You can efficiently partition the threads as

8 for outer, 2 for inner (product is 16)

4 for outer, 4 for inner "

2 for outer, 8 for inner "

6 for outer, 2 for inner (product is less than 16)

2 for outer, 6 for inner

16 for outer, 1 for inner

1 for outer, 16 for inner

For each case, precede the parallel region with CALL OMP_SET_NUM_THREADS(nn) where nn is the desired number of threads for the following parallel region.

Note, you can also use the function OMP_IN_PARALLEL in programs like yours

!$ if(omp_in_parallel()) call omp_set_num_threads(1)
DO CONCURRENT ...
  ...
end do
!$ if(omp_in_parallel()) call omp_set_num_threads(omp_get_max_threads())

Or some variation on that (like saving and restoring the number of threads of the outer region).

Jim Dempsey

EDIT: I assume you read Ron Green's comment about 2D DO CONCURRENT issue above.

jvo203 · ‎08-03-2023

Note, you can also use the function OMP_IN_PARALLEL in programs like yours

!$ if(omp_in_parallel()) call omp_set_num_threads(1)
DO CONCURRENT ...
  ...
end do
!$ if(omp_in_parallel()) call omp_set_num_threads(omp_get_max_threads())

Thanks for that, I wasn't aware of "omp_in_parallel()", that's a neat way to switch at runtime between parallelism and no parallelism in the inner loop, depending on whether the inner loop is being called from within a parallel region or a serial one. That's the way to go!

Regarding saying "nested parallelism should work", I mean the ifx-compiled program should not really seg. fault even in an over-subscribed case (as long as there is plenty of RAM, the OMP stack is sufficiently large etc). I don't think gfortran or ifort would seg. fault in an over-subscribed case.

In the large C / FORTRAN program that this reduced example had been taken from, indeed I use nested OpenMP regions and set num_threads at runtime for both the inner and outer regions so that there is no over-subscription. It's just that I don't remember gfortran or ifort causing a seg. fault due to possibly over-subscribed nested parallelism.

After simply recompiling with icx / ifx instead of gcc / gfortran, and without changing / adjusting the "do concurrent" statement, the program started to seg. fault on the "do concurrent" statement. The seg. fault occurs even with relatively small problem sizes where one would have thought there should not be any running-out-of-memory problems (say with OMP stack set to 10GB). So the seg. fault here was a bit of a surprise.

jimdempseyatthecove · ‎08-03-2023

>>not segfault...as long as there is plenty of RAM, the OMP stack is sufficiently large etc

That is correct.

Ron Green pointed out a bug related to dual indexing a DO CONCURRENT.

You can also consider the preamble test as:

!$ if(omp_in_parallel()) call omp_set_num_threads(max(omp_get_max_threads() / omp_get_num_threads(), 1))

Then you can experiment with tuning in one place of the program (!$omp parallel... num_threads(nn)...)

Jim Dempsey

jvo203 · ‎08-03-2023

Yes, Ron Green raised a compiler bug report for the 2D DO CONCURRENT compile bug, and it has been fixed in the latest ifx version (the code compiles fine).

What we are dealing with right now is a runtime 2D DO CONCURRENT bug (that may or may not be related to the previously raised compile-time bug). Whilst a workaround has been identified in this thread (disabling nested parallelism), a runtime bug is still a bug.

I've just drastically reduced the dimensionality of the problem:

! 3D data cube dimensions

! integer, parameter :: nx = 420, ny = 420, nz = 1908

integer, parameter :: nx = 20, ny = 20, nz = 8

but the runtime ifx bug still persists:

gdb-oneapi ./intel_fixed_array

(...)

Thread 1 "intel_fixed_arr" received signal SIGSEGV, Segmentation fault.
0x00007ffff5d6d9ec in rml::internal::Bin::addPublicFreeListBlock(rml::internal::Block*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
(gdb) bt
#0 0x00007ffff5d6d9ec in rml::internal::Bin::addPublicFreeListBlock(rml::internal::Block*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#1 0x00007ffff5d6d9ba in rml::internal::Block::freePublicObject(rml::internal::FreeObject*) ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#2 0x00007ffff5d707ef in scalable_aligned_free ()
from /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so
#3 0x00007ffff5cd90e5 in __kmp_internal_end_library (gtid_req=0) at ../../src/kmp_runtime.cpp:6836
#4 0x00007ffff7fca102 in _dl_call_fini () from /lib64/ld-linux-x86-64.so.2
#5 0x00007ffff7fce13e in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#6 0x00007ffff5a43c55 in __run_exit_handlers () from /lib64/libc.so.6
#7 0x00007ffff5a43dd0 in exit () from /lib64/libc.so.6
#8 0x00007ffff5a2abf7 in __libc_start_call_main () from /lib64/libc.so.6
#9 0x00007ffff5a2acb9 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x00000000004011f5 in _start () at ../sysdeps/x86_64/start.S:115
(gdb) q
A debugging session is active.

I also tried another test with the deprecated ifort: adding the "-parallel" flag to force parallelization of DO CONCURRENT. The result: nested parallelization works fine in ifort, here is the optimization report. At least the DO CONCURRENT OpenMP part seems to have been parallelised. The auto-vectorization etc. is another matter.

OpenMP Construct at intel_fixed_array.f90(92,7)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at intel_fixed_array.f90(92,7)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at intel_fixed_array.f90(92,7)
remark #17102: loop was not parallelized: not a parallelization candidate
remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
LOOP END

So given the following points:

1. with ifort the 2D DO CONCURRENT statement works fine with nested parallelism.

2. ifx is Intel's new flagship Fortran compiler. It being a successor to ifort, the ifx too should be able to cope with 2D DO CONCURRENT in the nested parallelism context, without causing a segmentation fault.

3. given Ron Green's prior compile-time bug note, there is a strong possibility of a residual bug affecting 2D DO CONCURRENT in ifx.

please would you be able to raise a new runtime bug report in the ifx compiler? The above statements point to a possibility of a runtime compiler bug in ifx. As a normal Intel forum user I don't have an authority to create bug reports in the internal Intel systems, only the forum moderator (Ron Green?) can do so.

On a related note I fully accept and am grateful for the advice about trying to avoid nested parallelism, or manually controlling the number of OpenMP threads if nested parallelism cannot be avoided.