OpenMP Segmentation Fault on XE Max GPU

lamb · ‎03-30-2021

I am trying to find the cause of a segmentation fault when running several of the SPEC Accel OpenMP benchmarks. Specifically, I have been working with the 552.ep (embarrassingly parallel) benchmark.

OS: x86_64 GNU/Linux
CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Accelerator: Intel Iris Xe MAX
Toolkit: Intel(R) oneAPI DPC++ Compiler 2021.1.2 (2020.10.0.1214)

Compilation commands:

$ ulimit -s unlimited
$ export IGC_EnableDPEmulation=1
$ export OverrideDefaultFP64Settings=1
$ source /opt/intel/oneapi/setvars.sh

$ icx -g -Wall -O3 -I. -fiopenmp -fopenmp-targets=spir64 *.c -o ep -lm
$ gdb ./ep

Here is the result of running gdb (run). I can install the missing debuginfos and update the question if that would help in finding a solution.

(gdb) run
Starting program: /552.pep/src/ep 
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-127.el8.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
[New Thread 0x155550008700 (LWP 811436)]
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
[New Thread 0x15554d9ac700 (LWP 811437)]
 Reading from input file ep.input


 NAS Parallel Benchmarks (NPB3.3-OPENMP-C) - EP Benchmark

 Number of random numbers generated:        67108864

Thread 1 "ep" received signal SIGSEGV, Segmentation fault.
0x000015553f455bca in vISA::G4_SrcRegRegion::computeLeftBound() () from /lib64/libigc.so.1
Missing separate debuginfos, use: yum debuginfo-install intel-gmmlib-20.4.1-i482.el8.x86_64 intel-igc-core-1.0.6410-i509.el8.x86_64 intel-igc-opencl-1.0.6410-i509.el8.x86_64 libedit-3.1-23.20170329cvs.el8.x86_64 libgcc-8.3.1-5.1.el8.x86_64 libstdc++-8.3.1-5.1.el8.x86_64 libxml2-2.9.7-8.el8.x86_64 ncurses-libs-6.1-7.20180224.el8.x86_64 xz-libs-5.2.4-3.el8.x86_64

(gdb) bt
#0  0x000015553f455bca in vISA::G4_SrcRegRegion::computeLeftBound() () from /lib64/libigc.so.1
#1  0x000015553f34ef81 in vISA::IR_Builder::createSrcRegRegion(G4_SrcModifier, G4_RegAccess, vISA::G4_VarBase*, short, short, RegionDesc const*, G4_Type, G4_AccRegSel) () from /lib64/libigc.so.1
#2  0x000015553f360739 in VISAKernelImpl::CreateVISASrcOperand(_VISA_VectorOpnd*&, _VISA_GenVar*, VISA_Modifier, unsigned short, unsigned short, unsigned short, unsigned char, unsigned char) () from /lib64/libigc.so.1
#3  0x000015553f256b97 in IGC::CEncoder::GetSourceOperand(IGC::CVariable*, IGC::SModifier const&) () from /lib64/libigc.so.1
#4  0x000015553f25948f in IGC::CEncoder::DataMov(ISA_Opcode, IGC::CVariable*, IGC::CVariable*) () from /lib64/libigc.so.1
#5  0x000015553f297893 in IGC::EmitPass::UniformCopy(IGC::CVariable*, IGC::CVariable*&, IGC::CVariable*, bool) () from /lib64/libigc.so.1
#6  0x000015553f2b9a7b in IGC::EmitPass::emitVectorStore(llvm::StoreInst*, llvm::Value*, llvm::ConstantInt*) () from /lib64/libigc.so.1
#7  0x000015553f2c0a1c in IGC::EmitPass::runOnFunction(llvm::Function&) () from /lib64/libigc.so.1
#8  0x00001555404265ce in llvm::FPPassManager::runOnFunction(llvm::Function&) () from /lib64/libigc.so.1
#9  0x0000155540426e41 in llvm::FPPassManager::runOnModule(llvm::Module&) () from /lib64/libigc.so.1
#10 0x0000155540427298 in llvm::legacy::PassManagerImpl::run(llvm::Module&) () from /lib64/libigc.so.1
#11 0x000015553f0a9ef9 in void IGC::CodeGen<IGC::OpenCLProgramContext>(IGC::OpenCLProgramContext*, llvm::MapVector<llvm::Function*, IGC::CShaderProgram*, llvm::DenseMap<llvm::Function*, unsigned int, llvm::DenseMapInfo<llvm::Function*>, llvm::detail::DenseMapPair<llvm::Function*, unsigned int> >, std::vector<std::pair<llvm::Function*, IGC::CShaderProgram*>, std::allocator<std::pair<llvm::Function*, IGC::CShaderProgram*> > > >&) () from /lib64/libigc.so.1
#12 0x000015553f079635 in IGC::CodeGen(IGC::OpenCLProgramContext*) () from /lib64/libigc.so.1
#13 0x000015553ef40e67 in TC::TranslateBuild(TC::STB_TranslateInputArgs const*, TC::STB_TranslateOutputArgs*, TC::TB_DATA_FORMAT, IGC::CPlatform const&, float) [clone .part.314] () from /lib64/libigc.so.1
#14 0x000015553eff261d in IGC::IgcOclTranslationCtx<0ul>::Impl::Translate(unsigned long, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, unsigned int, void*) const () from /lib64/libigc.so.1
#15 0x00001555506df59a in NEO::CompilerInterface::build(NEO::Device const&, NEO::TranslationInput const&, NEO::TranslationOutput&) ()
   from /lib64/libze_intel_gpu.so.1
#16 0x0000155550672a8e in L0::ModuleTranslationUnit::buildFromSpirV(char const*, unsigned int, char const*, char const*, _ze_module_constants_t const*) () from /lib64/libze_intel_gpu.so.1
#17 0x00001555506740ac in L0::ModuleImp::initialize(_ze_module_desc_t const*, NEO::Device*) () from /lib64/libze_intel_gpu.so.1
#18 0x0000155550674413 in L0::Module::create(L0::Device*, _ze_module_desc_t const*, L0::ModuleBuildLog*, L0::ModuleType) ()
   from /lib64/libze_intel_gpu.so.1
#19 0x000015555065fa48 in L0::DeviceImp::createModule(_ze_module_desc_t const*, _ze_module_handle_t**, _ze_module_build_log_handle_t**, L0::ModuleType) () from /lib64/libze_intel_gpu.so.1
#20 0x000015555129748e in __tgt_rtl_load_binary () from /opt/intel/oneapi/compiler/2021.1.2/linux/lib/libomptarget.rtl.level0.so
--Type <RET> for more, q to quit, c to continue without paging--
#21 0x00001555554ec6ba in DeviceTy::load_binary(void*) () from /opt/intel/oneapi/compiler/2021.1.2/linux/lib/libomptarget.so
#22 0x00001555554f8159 in CheckDeviceAndCtors(long) () from /opt/intel/oneapi/compiler/2021.1.2/linux/lib/libomptarget.so
#23 0x00001555554eeb63 in __tgt_target_data_begin_mapper () from /opt/intel/oneapi/compiler/2021.1.2/linux/lib/libomptarget.so
#24 0x0000000400000000 in ?? ()
#25 0x0000000000000000 in ?? ()

I'm wondering if I'm having the same issue as this post, and if so, if there's a workaround without modifying the source code:

https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/OpenMP-target-data-to-intel-GPU-segmentation-fault-when-array/m-p/1186086#M6856

Also, I'm intending to use the level0 backend, but I see "OpenCLProgramContext" in the gdb output. Does this mean I may be using the OpenCL backend by mistake?

Thank you!

lamb · ‎03-30-2021

I'm wondering if I'm having the same issue as this post, and if so, if there's a workaround without modifying the source code:

https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/OpenMP-target-data-to-intel-GPU-segmentation-fault-when-array/m-p/1186086#M6856

RahulV_intel · ‎03-31-2021

Hi,

By default, the backend is set to level0. You can also set the backend explicitly using the environment variable LIBOMPTARGET_PLUGIN=OPENCL (or LEVEL0).

Can you try running on the OpenCL backend and let me know if it works?

Also, could you please attach the debug logs by setting the environment variable LIBOMPTARGET_DEBUG=2?

In the other forum link that you have mentioned, the segmentation fault occurs due to the exceeding memory limit on the device side. Please make sure that the memory allocation is within the specified limits. (You can get the device info from the "clinfo" command.)

Please attach a small reproducer code for your issue if possible.

Thanks,

Rahul

lamb · ‎03-31-2021

This response didn't display originally so I duplicated below.

lamb · ‎03-31-2021

Thanks for the response! I've attached text files with the results from setting LIBOMPTARGET_DEBUG=2 and LIBOMPTARGET_PLUGIN=OPENCL and LIBOMPTARGET_PLUGIN=LEVEL0.

It seems like I'm hitting the same error with OpenCL (gdb still reports SIGSEGV in vISA::G4_SrcRegRegion::computeLeftBound()).

I'll see if I can create a small code example that reproduces the error. I'll also see if I can temporarily reduce the memory size in the original application and see if that changes the behavior. Here is the information provided by clinfo:

Platform Name                                   Intel(R) OpenCL HD Graphics
  Device Name                                     Intel(R) Graphics [0x4905]
  Global memory size                              6811549696 (6.344GiB)
  Max memory allocation                           3405774848 (3.172GiB)
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             3405774848 (3.172GiB)
  Global Memory cache size                        1048576 (1024KiB)
  Global Memory cache line size                   64 bytes
  Max constant buffer size                        3405774848 (3.172GiB)
  Max size of kernel argument                     2048 (2KiB)

To see if I'm exceeding the device memory, I'm guessing I need to do the math on the variables in the OpenMP data clauses and compare?

lamb · ‎03-31-2021

Also, this may or may not be related, but I'm having trouble figuring out if my device supports double-precision. According to the results of clinfo, it seems like it could:

Platform Name                                   Intel(R) OpenCL HD Graphics
Device Name                                     Intel(R) Graphics [0x4905]
  Device Version                                  OpenCL 3.0 NEO
  Device OpenCL C Version                         OpenCL C 1.2
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Device Extensions                               (no cl_khr_fp64)
  Device Extensions with Version                  cl_khr_fp64                                                      0x400000 (1.0.0)

However if I try and run my application above without using the "OverrideDefaultFP64Settings=1" and "IGC_EnableDPEmulation=1" flags (from here: https://community.intel.com/t5/Intel-DevCloud/Iris-Xe-MAX-node-is-missing-double-precision-support/td-p/1247876) I get the following errors:

error: double type is not supported on this platform
error: backend compiler failed build.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Is there a way to somehow use the "Device Extensions with Version" for cl_khr_fp64 support?

lamb · ‎04-04-2021

I now realize I ran the above clinfo command after already setting the "OverrideDefaultFP64Settings=1" and "IGC_EnableDPEmulation=1" flags, which is why it then reported double-precision support. If the flags are not set it does not report support as expected.

RahulV_intel · ‎04-05-2021

Hi,

Double precision computation is disabled by default on the GPU. To enable the same, it is mandatory to set those two environment variable flags to 1. (OverrideDefaultFP64Settings=1 and IGC_EnableDPEmulation=1)

>>I'll see if I can create a small code example that reproduces the error. I'll also see if I can temporarily reduce the memory size in the original application and see if that changes the behavior.

Yes, a small reproducer code will help. Also, let me know if it works after reducing the memory.

Thanks,

Rahul

RahulV_intel · ‎04-14-2021

Hi,

Do you have any updates on this? Were you able to replicate the issue with a small reproducer code?

Let us know if you face any issues.

Thanks,

Rahul

RahulV_intel · ‎04-26-2021

Hi,

I have not heard back from you, so I will go ahead and close this thread from my end. Intel will no longer monitor this thread. Feel free to post a new query if you require further assistance from Intel.

Thanks,

Rahul