Strange crashes of OMP-parallelized FFT code on Itanium

ewmayer · ‎01-20-2006

A small bit of background:
I'm a relative novice to OpenMP - I decided on it because my literature
survey seemd to indicate that it required far less code intervention
than e.g. direct Posix thread coding would (though it seems OpenMP is
basically designed to be a user-friendly macro-ization of Posix threads.)

I have a large-integer-arithmetic C code that I'm currently trying
to parallelize. The key operation is a big-int multiply algorithm that
uses a double-precision FFT to effect the multiply - we're talking
about multi-megabit inputs here, i.e. FFTs on vectors consisting on
the order of a million doubles. The FFT algorithm first does an initial
radix-R pass through the length-N vectors of doubles (this is not yet
parallelized, though eventually it will be), subsequent to which the
bulk of the work can be done by operating on R independent chunks,
each consisting of N/R doubles which are contiguous in memory. That
chunk-processing step is contained in the following short piece of
code, which I've stripped down to make it simpler to read:

  omp_set_num_threads(NTHREADS);
  #pragma omp parallel for default(shared) schedule(dynamic) nowait
  {
    for(i = 0; i < R; i += 2)
    {
	process_chunk(a,i, {bunch of other scalar and array arguments, all read-only});
    }
  }	/* end of #pragma omp parallel */

Key notes about the above piece of code:

* Each of the iterations through the above loop accesses a block
of the main data array a[] that is not touched by any of the other
threads. Except for a[] and the loop index i, all other arguments
to the process_chunk function are read-only (but not explicitly
declared const, since they need to be initialized and in some
cases permuted prior to their being used in the FFT), and are
assumed shared by all threads.

* If I understand the OpenMP documentation correctly, the loop
index i is thread-private by default, and all other variables
that need to be private are buried within the mers_process_chunk()
function, so we need not explicitly declare anything private above.

The code builds fine (I'm using the Intel v9 C compiler) in both single
and multi-threaded mode (i.e. without and with the -openmp flag and the
#defines that activates within the source). In single-threaded mode it
runs just fine and has been subject to extensive testing. When built
with -openmp, the compiler gives no warnings about anything related
to the above loop or the OpenMP pragmas, but the resulting executable
does not run reliably. Sometimes it plain crashes, other times it runs
but gives incorrect outputs on known test cases, other times it works
fine for a while, but then either crashes or begins outputting incorrect
results. I hope to get access to the Intel Thread analysis tools sometime
in the near future, but as it's a system I only have remote access to and
the tools are not installed there currently, it's not guaranteed. I also
have an account at the HP testdrive program and sent e-mail to the
sysadmin there asking if the thread tools are installed on any of
their systems (they are not on any of the Itanium systems, that much
I know), but haven't received a reply. In the meantime I thought I'd
post here and see if anyone has any ideas as to what might be causing
the behavior I'm seeing.

Since I don't currently have access to the thread tools, the best I've
been able to do is a small amount of debugging via building with the
-g flag and examination of the resulting coredump file. A sample of
the built & test output is appended below. I'd be happy to send a
zipfile containing the entire source archive to anyone who desires
it - just e-mail me at ewmayer@aol.com .

Thanks for any help,
Ernst Mayer

Here is some sample build and test output - the USE_THREADS and ERR_CHECK
flags are internal to the code (i.e. they're not compiler-related):

# icc -o Mlucas -g -static *.c -lm -DUSE_THREADS -openmp -DERR_CHECK

mers_mod_square.c(1417) : (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
mers_mod_square.c(1415) : (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
radix16_ditN_cy_dif1.c(556) : (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
radix16_ditN_cy_dif1.c(554) : (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
radix16_ditN_cy_dif1.c(1549) : (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
radix16_ditN_cy_dif1.c(1547) : (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.


# ./Mlucas

    Mlucas 2.8x

    http://hogranch.com//mayer/README.html#2

Itanium, compiled with Intel C compiler Version 900.
INFO: using 64-bit-double form of rounding constant
INFO: Using subroutine form of MUL_LOHI
 looking for nthreads.ini file...
 NTHREADS = 16
 looking for worktodo.ini file...
 worktodo.ini file found...checking next exponent in range...
 p = 34807387
 restart file p34807387 found...reading...
Restarting M34807387 at iteration 6000
Killed


 # idb Mlucas core.8622

Intel Debugger for Itanium -based Applications, Version 9.0-16 , Build 20051202
------------------
object file name: Mlucas
core file name: core.8622
Reading symbolic information
from /home/reixt/MLucas/Mlucas_src/Mlucas...done
Core file produced from executable Mlucas
Initial part of arglist: ./Mlucas
Thread 1 terminated at PC 0x4000000000878362 by signal ABRT
(idb) where
>0  0x4000000000878362 in __kill(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
#1  0x4000000000869ba0 in pthread_kill(thread=163851, signo=6)  "signals.c":69
#2  0x4000000000869c20 in __pthread_raise(sig=163851) "signals.c":200
#3  0x4000000000878190 in raise(sig=163851) "../linuxthreads/sysdeps/unix/sysv/linux/raise.c":32
#4  0x4000000000878d40 in abort() "../sysdeps/generic/abort.c":117
#5  0x4000000000841ab0 in __kmp_do_abort(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
#6  0x4000000000840c30 in __kmp_wait_sleep(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
#7  0x40000000008430d0 in __kmp_linear_barrier_release(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
#8  0x400000000084efc0 in __kmp_fork_barrier(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
#9  0x400000000084f0a0 in __kmp_launch_thread(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
#10 0x40000000008398d0 in __kmp_launch_worker(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
#11 0x4000000000863c70 in pthread_start_thread(arg=0x2800b) "manager.c":257
#12 0x4000000000
8b10f0 in __clone2(...) in /home/reixt/MLucas/Mlucas_src/Mlucas
getRegFromUnwindContext: Can't get Gr0 from UnwindContext, using 0

TimP · ‎01-20-2006

It looks like you have an out-dated version of the compiler.
Even if you have a reason for dynamic scheduling, you should try static for comparison.
You are correct, Intel Thread checker is intended to diagnose any source code related problems in parallelization, more
efficiently than most of us could do otherwise.

ewmayer · ‎01-20-2006

'icc -v' on the system I'm giving indicates version 9.0 of the compiler -
I was under the impression that was pretty up-to-date as far as support for OpenMP
is concerned.

Using static rather than dynamic in the parallel for loop still
gives a crash during self-tests, but the idb diagnosis has changed:


Intel Debugger for Itanium -based Applications, Version 9.0-14, Build 20051007
------------------
object file name: Mlucas
core file name: core.17232
eidb(17237): unaligned access to 0x200000000486040c, ip=0x4000000001ce3f31
eidb(17237): unaligned access to 0x2000000004860414, ip=0x4000000001ce3f31
eidb(17237): unaligned access to 0x200000000486041c, ip=0x4000000001ce3f31
eidb(17237): unaligned access to 0x2000000004860424, ip=0x4000000001ce3f31
Reading symbolic information from /house/ewm/src/C/IA64_LINUX/Mlucas...done
Core file produced from executable Mlucas
Initial part of arglist: Mlucas -s m
Thread 1 terminated at PC 0x4000000000251f02 by signal ABRT

Message Edited by ewmayer on 01-20-2006 03:48 PM

Henry_G_Intel · ‎01-26-2006

Hello Ernst,

Please try removing the braces around the loop that you want to parallelize, i.e.:

   #pragma omp parallel for default(shared) schedule(dynamic)
    for(i = 0; i < R; i += 2)
    {
	process_chunk(a,i, {bunch of other scalar and array arguments, all read-only});
    }

The braces are unnecessary and could be confusing the compiler. The nowait keyword is also meaningless in this context because you cannot remove the implied barrier on an OpenMP parallel region. The nowait keyword only removes implied barriers on OpenMP worksharing constructs. I also agree with Tim that you should try analyzing your code using Intel Thread Checker. If a race condition is causing the crash, Thread Checker should help you debug it.

Best regards,

Henry

Message Edited by hagabb on 01-26-2006 04:33 PM