Solved: Looks like that attaching the

Martin1 · ‎12-19-2017

Looking for a rare segfault during deallocation in our software we have used valgrind and have come upon some error messages, which possibly are due to a compiler bug. We are using ifort "Version 18.0.1.163 Build 20171018" with compiler flags
"-extend_source -recursive -assume byterecl -fp-model fast=2 -ftz -rcd -align all -static-intel -xSSE4.2 -qopenmp -qopenmp-link static -real-size 64 -O3 -debug -traceback -warn all".

The error messages from valgrind look like the following
(routine and modules names of our own software have been replaced by short generic names):

    ==30890== Conditional jump or move depends on uninitialised value(s)
    ==30890==    at 0xD23B26: __kmp_external_scalable_free (in path_to_exec)
    ==30890==    by 0xBE4379: for_dealloc_allocatable (in path_to_exec)
    ==23148==    by 0x6A6FA4: abc01_ (A.f90:123)
    ==23148==    by 0x6E8964: abc02_ (B.f90:456)
    ==23148==    by 0x6B0114: abc03_ (C.f90:789)
    ...

    ==23148==  Uninitialised value was created by a stack allocation
    ==23148==    at 0x453330: def_ (D.f90:369)

Compared to the frequency with which abc01 is called, the message is pretty rare. However, we have observed the message at quite a number of very different locations in our software.

The code has been compiled with "-qopenmp", but execution is with just one thread and the messages are easily reproducible.
Unfortunately, we have not been able to create a small test case. The problem occurs deep within an omp parallel region. However, the origin of the value in "def" (reported in the last two lines by valgrind) has nothing to do with the "abcNN" routines. The "def" routine might or might not have been called from within the parallel region.

I have carefully investigated one occurence using gdb and it looks like a bug within the internal allocation/deallocation routines of ifort either not initialising some bit
or accessing the wrong bit. We have a small type

   type hashmap
      integer(4), dimension(:), allocatable :: keys
      integer(4) :: hmod
   end type hashmap

and a local variable "type(hashmap) :: hmap" in "abc02". This variable is initialised, used and disposed in "abc02". Deallocation of hmap%keys itself is done in "abc01". In my case hmap resides at memory address 0x7fffffff8c60 (output from gdb "p &hmap"). The first 72 bytes are indeed the array descriptor data (as documented by intel). The first 8 bytes contain the keys pointer which in my case has value 0x7ffff5425600 (output from gdb "p &(hmap%keys)" as well as "x/9xg"). The other values (as far as documented) have proper values assigned to during initialisation. I have verified that the keys pointer in the hmap record, is the same throughout the routine (and hence not overwritten by some out-of-bounds access or whatever).

The deallocation of hmap%keys, after some branching and jumping around, finally ends up here:

Output from gdb is (last line is the offending conditional jump):

│0xd23ae0 <__kmp_external_scalable_free>                 push   %r12
│0xd23ae2 <__kmp_external_scalable_free+2>               push   %r13
│0xd23ae4 <__kmp_external_scalable_free+4>               push   %r15
│0xd23ae6 <__kmp_external_scalable_free+6>               push   %rbx
│0xd23ae7 <__kmp_external_scalable_free+7>               push   %rbp
│0xd23ae8 <__kmp_external_scalable_free+8>               sub    $0x10,%rsp
│0xd23aec <__kmp_external_scalable_free+12>              mov    %rdi,%rbx
│0xd23aef <__kmp_external_scalable_free+15>              mov    0x5be612(%rip),%r12        # 0x12e2108 <_ZN3rml8internal14defaultMemPoolE>
│0xd23af6 <__kmp_external_scalable_free+22>              test   %r12,%r12
│0xd23af9 <__kmp_external_scalable_free+25>              je     0xd23ca0 <__kmp_external_scalable_free+448>
│0xd23aff <__kmp_external_scalable_free+31>              test   %rbx,%rbx
│0xd23b02 <__kmp_external_scalable_free+34>              je     0xd23ca0 <__kmp_external_scalable_free+448>
│0xd23b08 <__kmp_external_scalable_free+40>              test   $0x3f,%rbx
│0xd23b0f <__kmp_external_scalable_free+47>              jne    0xd23b51 <__kmp_external_scalable_free+113>
│0xd23b11 <__kmp_external_scalable_free+49>              mov    -0x8(%rbx),%rax
│0xd23b15 <__kmp_external_scalable_free+53>              lea    -0x10(%rbx),%rbp
│0xd23b19 <__kmp_external_scalable_free+57>              mov    %rax,(%rsp)
│0xd23b1d <__kmp_external_scalable_free+61>              shr    $0x20,%rax
│0xd23b21 <__kmp_external_scalable_free+65>              movzbl %al,%eax
│0xd23b24 <__kmp_external_scalable_free+68>              test   $0x1,%al
│0xd23b26 <__kmp_external_scalable_free+70>              je     0xd23b51 <__kmp_external_scalable_free+113>

The jump "je" in the last line is the offending conditional jump. If I do not misread the code, it depends on bit 32 of the 64 bit value at -0x8(rbx),
where rbx contains the pointer hmap%keys. As keys is an array of 32 bit integers with lower bound 0, "mov -0x8(%rbx),%rax" at 0xd23b11 moves hmap%keys(-2:-1) into %rax. Of course, (-2,-1) is not within the array range, but nevertheless I can write some random values to hmap%keys(-2:-1) right before the deallocation, and I can see that the same two 32bit values are read into rax at 0xd23b11.

Can anybody confirm that this indeed might be a (serious?) bug within ifort libraries, which can lead to a segfault, failure or other undefined behaviour during deallocation?
Please let me know if any more information is required. As already mentioned, we have not been able to produce a small example producing the same error messages from valgrind.

As a side note, I am also confused that the array hmap%keys looks like to reside on the stack ("Uninitialised value was created by a stack allocation"). It should be on the heap. Or is it because a thread local stack (used within routine "def" which is in a omp parallel region) is allocated on the heap?

Best regards Martin

Janus · ‎01-09-2018

Submitted to the Online Service Center as request #03202444 ...

View solution in original post

jimdempseyatthecove · ‎12-21-2017

From my little (incomplete) understanding on this is the TBB scalable allocator code was appropriated by the Fortran runtime system development team to help implement OpenMP Tasking features, principally for fast heap allocation of private variables. By chance, is hmap a task private entity: !$OMP TASK PRIVATE(hmap)?

If so, consider using: !$OMP TASK FIRSTPRIVATE(hmap)?

Jim Dempsey

Martin1 · ‎12-21-2017

Thanks for the answer. Unfortunately, we are not yet using tasks. This particular case is within an "omp parallel" directive implemented using solely openmp 2.5 directives. Moreover, between the routine containing the "omp parallel" and routine abc02, where hmap is a local variable, there are more than 10 routine calls in various modules in between.

Meanwhile we have also compiled the code without -qopenmp and the problem disappears (but then, the kmp prefix points to an openmp problem, so this is not surprising). We have also tested with ifort10, ifort14 and some recent gfortran version, none of which show any

Is there any documentation on the allocation routines (in particular the meaning of the 64bits just prior to the memory location the array base pointer points to), which could help understand me, why these bits are not initialised? Knowing for what they are intended and where I should expect them to be initialised could possibly help to step through the allocator and see where things go wrong.

I might also mention that right at the start we see this valgrind error message, which is probably unrelated, but maybe not (at least the allocate/deallocate routines are also querying threadId via some kmp calls as far as I remember):

==23485== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==23485==    at 0x56C39F9: syscall (syscall.S:38)
==23485==    by 0xD12987: __kmp_affinity_determine_capable (in path_to_executable)
==23485==    by 0xCD9EA0: __kmp_env_initialize(char const*) (in path_to_executable)
==23485==    by 0xCC5D12: __kmp_middle_initialize (in path_to_executable)
==23485==    by 0xCB3F14: omp_get_max_threads (in path_to_executable)
...
==23485==    by 0x40577D: main (in path_to_executable)
==23485==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

(I should mention that omp_get_max_threads nevertheless returns the correct value.)

jimdempseyatthecove · ‎12-22-2017

>>Is there any documentation on the allocation routines (in particular the meaning of the 64bits just prior to the memory location the array base pointer points to),

Allocations typically include a hidden node header that at least holds a size but can hold additional information. For example the call of free(somePointer); does not contain the size of the free. The size of the node to be returned is contained in hidden location(s) in the node header.

I am somewhat concerned about: Syscall param sched_setaffinity(mask) points to unaddressable byte(s)

The call should be sched_setaffinity(mask*), not the mask itself. The Intel compiler (from my understanding) supplies the OpenMP shared library, but does not supply the pthread shared library (nor numa library). I suggest you check to see if you have the correct pthread shared library installed.

Jim Dempsey

Martin1 · ‎01-01-2018

I have looked more closely into the syscall matter, and I found a hint that this is a false positive from valgrind and probably has nothing to do with the deallocation problem. From what I can see there is a sched_setaffinity routine in the pthread library, but this is not touched here. The backtrace does not contain any pthread stuff, but only kmp routines from the openmp compiler runtime routines. In fact __kmp_affinity_determine_capable can also be found in the llvm/openmp repository (code was contributed by intel as far as I know?) and this directly makes the syscall. Please correct me if I am wrong.

Martin1 · ‎01-02-2018

Finally I have been able to produce a small example which shows the valgrind warning. However, it does not crash. Nevertheless, having an uninitialised value at this point in <__kmp_external_scalable_free+70> is rather disconcerting and does not look like a false positive from valgrind. It would be helpful to know whether this can be the cause of our rare and random segfaults at deallocation or if it is a cold lead.

The attached code has been compiled with

 ifort -recursive -xSSE2 -qopenmp -qopenmp-link static -O0 test_dealloc.f90 -o test_dealloc

and executed with

 valgrind --tool=memcheck ./test_dealloc

using just one thread (export OMP_NUM_THREADS=1)!

(Adding --track-origin=yes to valgrind gives some cryptic and seemingly random information about the origin of the uninitialised value.)

Using -xAVX or some other instruction set instead of -xSSE2 has no influence. Also, using -O3 does not change the behaviour. Omitting "-qopenmp-link static" changes the backtrace output slightly (instead of __kmp_scalable_free the offending condition jump appears in scalable_free (frontend.cpp:2868)).

For the code I have used a simple random number generator (I could not produce the problem with the randomand seed routines, but I did not try hard). Also with fixed rather than random sizes I have not been able to produce the problem, but did not try much. Furthermore if the depth in the first step is reduced from 100 to 10 or in the second step from 1000 to 100 then the warning disappears.

The complete output from the program as well as valgrind is

==1315== Memcheck, a memory error detector
==1315== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1315== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==1315== Command: ./test_dealloc
==1315== 
==1315== Conditional jump or move depends on uninitialised value(s)
==1315==    at 0x57884D: __intel_sse2_strcpy (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x456221: for__open_proc (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x442921: for__open_default (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x416BAD: for_write_seq_lis (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x40ED79: MAIN__ (test_dealloc.f90:99)
==1315==    by 0x40E6CD: main (in /home/mstein/tmp/ibug/test_dealloc)
==1315== 
 first step
==1315== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==1315==    at 0x5437C59: syscall (in /lib64/libc-2.18.so)
==1315==    by 0x4F13E2: __kmp_affinity_determine_capable (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x4BA0B9: __kmp_env_initialize(char const*) (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x4A9ABD: _INTERNAL_25_______src_kmp_runtime_cpp_148a49fc::__kmp_do_serial_initialize() (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x49F444: __kmp_get_global_thread_id_reg (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x40ED90: MAIN__ (test_dealloc.f90:101)
==1315==    by 0x40E6CD: main (in /home/mstein/tmp/ibug/test_dealloc)
==1315==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1315== 
 second step
==1315== Conditional jump or move depends on uninitialised value(s)
==1315==    at 0x503E46: __kmp_external_scalable_free (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x415E79: for_dealloc_allocatable (in /home/mstein/tmp/ibug/test_dealloc)
==1315==    by 0x40EC88: aux_mp_snd_alloc_ (test_dealloc.f90:79)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315==    by 0x40EBF8: aux_mp_snd_alloc_ (test_dealloc.f90:77)
==1315== 
==1315== 
==1315== HEAP SUMMARY:
==1315==     in use at exit: 32 bytes in 1 blocks
==1315==   total heap usage: 16 allocs, 15 frees, 16,745 bytes allocated
==1315== 
==1315== LEAK SUMMARY:
==1315==    definitely lost: 0 bytes in 0 blocks
==1315==    indirectly lost: 0 bytes in 0 blocks
==1315==      possibly lost: 0 bytes in 0 blocks
==1315==    still reachable: 32 bytes in 1 blocks
==1315==         suppressed: 0 bytes in 0 blocks
==1315== Rerun with --leak-check=full to see details of leaked memory
==1315== 
==1315== For counts of detected and suppressed errors, rerun with: -v
==1315== Use --track-origins=yes to see where uninitialised values come from
==1315== ERROR SUMMARY: 6 errors from 3 contexts (suppressed: 0 from 0)

Martin1 · ‎01-02-2018

Looks like that attaching the f90 file went wrong. I just include it as code:

module aux

implicit none

public first_uninit
public snd_alloc



contains


   subroutine randInt(rr, rmax, seed)
! routine arguments
      integer(4), intent(out) &
            :: rr
      integer(4), intent(in) &
            :: rmax
      integer(8), intent(inout) &
            :: seed
! ### code ###
      integer(8), parameter &
            :: a1 =  21_8, &
               a2 = -35_8, &
               a3 =   4_8

      if (seed /= 0_8) then
         seed = ieor(seed, ishft(seed, a1))
         seed = ieor(seed, ishft(seed, a2))
         seed = ieor(seed, ishft(seed, a3))
      else
         seed = 1_8
      end if

      ! random integer number with 1 <= rr <= rmax
      rr = 1 + int(modulo(seed, int(rmax, 8)), 4)
   end subroutine randInt




   subroutine first_uninit(depth, seed)
      integer(4), intent(in) &
            :: depth
      integer(8), intent(inout) &
            :: seed

      integer(4), dimension(:), pointer &
            :: a
      integer(4) &
            :: N, d

      call randInt(N, 4000, seed)
      d = depth - 1
      allocate(a(1:N))
      if (d > 0) then
         call first_uninit(d, seed)
      end if
      deallocate(a)
   end subroutine first_uninit


   subroutine snd_alloc(depth, seed)
      integer(4), intent(in) &
            :: depth
      integer(8), intent(inout) &
            :: seed

      integer(4), dimension(:), pointer &
            :: a
      integer(4) &
            :: N, d

      call randInt(N, 16, seed)
      allocate(a(1:N))
      d = depth - 1
      if (d > 0) then
         call snd_alloc(d, seed)
      end if
      deallocate(a)
   end subroutine snd_alloc


end module aux





program test_dealloc

use aux

implicit none


   integer(8) :: seed
   integer(4) :: depth

   write(*,*) 'first step'
   depth = 100
!$omp parallel default(shared) private(seed)
   seed = 1_8
   call first_uninit(depth, seed)
!$omp end parallel


   write(*,*) 'second step'
   depth = 1000
!$omp parallel default(shared) private(seed)
   seed = 1_8
   call snd_alloc(depth, seed)
!$omp end parallel


end program test_dealloc

Martin1 · ‎01-02-2018

Just in case the warning cannot be reproduced with the attached program: We tested it on a different machine (different Linux distro, different processor), and it was necessary to increase the two depth counters (each by a factor of 10) to reproduce the warning.

Janus · ‎01-03-2018

Here is a somewhat reduced case, which should do essentially the same as Martin's example (with slightly different parameters).

program test_dealloc

   implicit none
   integer, parameter :: seed(1:2) = 1
   call random_seed(put = seed)

   write(*,*) 'first step'
!$omp parallel
   call alloc_dealloc(100, 4000)
!$omp end parallel


   write(*,*) 'second step'
!$omp parallel
   call alloc_dealloc(10000, 16)
!$omp end parallel

contains

   recursive subroutine alloc_dealloc(depth, maxsz)
      integer, intent(in) :: depth, maxsz

      integer, dimension(:), allocatable :: a
      integer :: N
      real :: r

      call random_number(r)
      N = r*maxsz

      allocate(a(1:N))
      if (depth > 1) &
         call alloc_dealloc(depth - 1, maxsz)
      deallocate(a)
   end subroutine

end

I can reproduce the valgrind error in scalable_free with ifort versions 18, 17 and 16, compiling just with "ifort -qopenmp".

Janus · ‎01-04-2018

I do see the bug with ifort version 16.0.4.20160811 (and later versions, i.e. 17.x and 18.x). I don't see it with 15.0.7.20160518 (and earlier versions, e.g. 14.x).

Janus · ‎01-04-2018

Here is another demonstrator, which is a variant of the previous one, but might be better suited for reproduction:

program test_dealloc

   implicit none
   integer :: i, seed(1:2) = (/ 1704024938 , 19740874 /)
   call random_seed(put = seed)

!$omp parallel do
   do i=1,10
      call random_seed(get = seed)
      print *, i, seed
      call alloc_dealloc(  6, 1024)
      call alloc_dealloc(200,  128)
   end do
!$omp end parallel do

contains

   recursive subroutine alloc_dealloc(depth, maxsz)
      integer, intent(in) :: depth, maxsz

      integer, dimension(:), allocatable :: a
      integer :: N
      real :: r

      call random_number(r)
      N = max(nint(r*maxsz), 1)

      allocate(a(1:N))
      if (depth > 1) &
         call alloc_dealloc(depth - 1, maxsz)
      deallocate(a)
   end subroutine

end

It employs random numbers, but due to the fixed seed it is completely deterministic. To reproduce, do the following:

$ export OMP_NUM_THREADS=1
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.1.163 Build 20171018
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

$ ifort -qopenmp -g -O0 test.f90 
$ valgrind ./a.out 
==8593== Memcheck, a memory error detector
==8593== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==8593== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==8593== Command: ./a.out
==8593== 
==8593== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==8593==    at 0x581A499: syscall (syscall.S:38)
==8593==    by 0x5211F07: __kmp_affinity_determine_capable (z_Linux_util.cpp:185)
==8593==    by 0x51F2860: __kmp_env_initialize(char const*) (kmp_settings.cpp:5463)
==8593==    by 0x51DDB2A: _INTERNAL_25_______src_kmp_runtime_cpp_6bf59e22::__kmp_do_serial_initialize() (kmp_runtime.cpp:6804)
==8593==    by 0x51D29DF: __kmp_get_global_thread_id_reg (kmp_runtime.cpp:290)
==8593==    by 0x40311F: MAIN__ (test.f90:7)
==8593==    by 0x4030AD: main (in /home/janus/fort/ifort_bugs/dealloc/a.out)
==8593==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==8593== 
==8593== Conditional jump or move depends on uninitialised value(s)
==8593==    at 0x486FCD: __intel_sse2_strcpy (in /home/janus/fort/ifort_bugs/dealloc/a.out)
==8593==    by 0x44B241: for__open_proc (in /home/janus/fort/ifort_bugs/dealloc/a.out)
==8593==    by 0x437941: for__open_default (in /home/janus/fort/ifort_bugs/dealloc/a.out)
==8593==    by 0x40B3BD: for_write_seq_lis (in /home/janus/fort/ifort_bugs/dealloc/a.out)
==8593==    by 0x40343A: MAIN__ (test.f90:10)
==8593==    by 0x52087A2: __kmp_invoke_microtask (in /Intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64_lin/libiomp5.so)
==8593==    by 0x51D69C6: __kmp_fork_call (kmp_runtime.cpp:2041)
==8593==    by 0x519FABB: __kmpc_fork_call (kmp_csupport.cpp:342)
==8593==    by 0x4031DA: MAIN__ (test.f90:7)
==8593==    by 0x4030AD: main (in /home/janus/fort/ifort_bugs/dealloc/a.out)
==8593== 
           1  1704024938    19740874
           2  1047937221  1153092462
           3   527593105  1792050724
==8593== Conditional jump or move depends on uninitialised value(s)
==8593==    at 0x524BF76: scalable_free (frontend.cpp:2868)
==8593==    by 0x40A689: for_dealloc_allocatable (in /home/janus/fort/ifort_bugs/dealloc/a.out)
==8593==    by 0x403784: test_dealloc_IP_alloc_dealloc_ (test.f90:31)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593==    by 0x4036F4: test_dealloc_IP_alloc_dealloc_ (test.f90:30)
==8593== 
           4   806825839  1365261861
           5   427203932  1620878896
           6   740380569   545871917
           7   509122826   688125236
           8  1682026429   923678342
           9   555746241  1248718469
          10   468452243   127852858

The first two valgrind messages look strange, but the actual thing I'm worrying about is the third one, namely the "Conditional jump or move depends on uninitialised value" that occurs in the third iteration of the loop. Although this bug seems a bit elusive, it always happens at exactly the same place, both with ifort version 18.0.1.163 and 17.0.5.239, and independent of optimization level. This is on Ubuntu 16.04 with a Westmere CPU (Intel Core i7 990X).

It would be great to get some feedback on whether other people are able to reproduce this and whether it actually is an ifort bug.

Best regards,

Janus

Steve_Lionel · ‎01-04-2018

My advice is to submit this through the Online Service Center, providing your examples.

Janus · ‎01-05-2018

Steve Lionel (Ret.) wrote:

My advice is to submit this through the Online Service Center, providing your examples.

Thanks for the comment, Steve. Will do.

While I could consistently reproduce my previous example on a single machine, I tried to reproduce it also on other machines, which worked for some of them, but not all. So here I'm posting yet another variant of the example, which shows the valgrind error in scalable_free with ifort 16 on all three machines I tried (running Ubuntu or OpenSUSE on Westmere or Haswell arch), but with ifort 18 only on two of them:

program test_dealloc

   implicit none
   integer :: i, seed(1:2) = 1
   call random_seed(put = seed)

!$omp parallel do
   do i=1,60
      call random_seed(get = seed)
      print *, i, seed
      call alloc_dealloc(40000, 8192)
      call alloc_dealloc(40000,    1)
   end do
!$omp end parallel do

contains

   recursive subroutine alloc_dealloc(depth, maxsz)
      integer, intent(in) :: depth, maxsz

      integer, dimension(:), allocatable :: a
      integer :: N
      real :: r

      call random_number(r)
      N = max(nint(r*maxsz), 1)

      allocate(a(1:N))
      if (depth > 1) &
         call alloc_dealloc(depth - 1, maxsz)
      deallocate(a)
   end subroutine

end

As before, compile with "ifort -qopenmp", set "OMP_NUM_THREADS=1" and run via "valgrind ./a.out". If anyone reading this can reproduce the problem, please let me know!

Cheers,

Janus

Janus · ‎01-09-2018

Submitted to the Online Service Center as request #03202444 ...

Martin1 · ‎04-30-2018

So far, we have not got any help with this issue through the service centre, as reproducibility with small test cases seems to be rather difficult.

Testing with the ifort19 beta release I can still see the problems. However, this time I found out that scalable_free in frontend.cpp is from the tbbmalloc subsystem, which is an open source library, so I can actually study the source code of the mempool management. Now, it would be even more helpful if I can link my own (and possibly slightly modified) copy of the TBB library with ifort to track down the problem. Is this possible? And what command line options do I need for that?

At least this now looks like it might be a problem within the TBB library.

Martin1 · ‎04-30-2018

Yet another question as line numbers of routines in src/tbbmalloc/frontend.cpp from the public repository do not exactly match up with the debug information embedded in the executable compiled with ifort19_beta: which version of tbb is used by ifort19_beta?

Steve_Lionel · ‎04-30-2018

I can't help with your original issue, but I will note that the open-source version of TBB is not identical to the product version. Exactly how it differs, I don't know. My guess is that it won't be feasible to substitute your own version for the one called by the run-time library.

Martin1 · ‎04-30-2018

Thanks for the answer. Reading the open source version and matching it with the disassemble output from gdb was very helpful, even without being able to change the tbb code and link against. I am not yet finished with a testcase, but so far I can allocate two arrays, such that the second array overlaps with what the tbbmalloc module considers as LargeObjectHdr of the first array (both of which are not large objects in the sense of tbbmalloc). So essentially I can write data (in a valid fortran fashion) into the second array and influence the isLargeObject check used in the dealloc routine for the first array. Only the number of tests (4 or 5) in isLargeObject makes it very improbable, that the test gives a false positive, declaring a small object to be a large object. Considering that for the large amount of allocates and deallocates for a code, which runs for hours, it seems feasible that in rare cases we indeed make an unlucky hit and have a segfault.

Anyway, it seems insane to use data the allocator has no control over in such a way and fight the problems by piling checks upon checks. I will see that I can finish my test case and rig the isLargeObject function to provoke a SEGFAULT reliably.

Martin1 · ‎05-02-2018

After further studying the tbb source code with some fortran code pieces I think I can wrap up the original question. A conditional jump depending on uninitialised (or more generally on random) data is indeed done willfully. It is well guarded and should work in general (as it indeed does). Nevertheless it leaves a bad taste, in particular as it is not easy by looking at the code to verify that this supposed header data object cannot be (accidentially or however) manipulated to provoke a crash. Consider the following piece of code

! assume that tbb works with cacheline size of 64 bytes
integer(8), dimension(:), pointer :: x, y

! allocate to "small objects" (in the sense of tbb)
allocate(x(1:8))
allocate(y(1:8))

! now loc(x) - loc(y) == 64 in most cases, i.e. x and y are located adjacently in memory

! write LargeObjectHdr data for x (suggesting that x is a large object)
y(7) = suitable_value1
y(8) = suitable_value2

! this deallocate read y(7) and y(8) and decides whether x is a small or large object
! if it thinks that x is a large object, then we got a SIGSEGV
deallocate(x)

How does tbb decides whether x is large or small? It uses isLargeObject() function from src/tbbmalloc/frontend.cpp:

bool isLargeObject(void *object)
{
    if (!isAligned(object, largeObjectAlignment))
        return false;
    LargeObjectHdr *header = (LargeObjectHdr*)object - 1;
    BackRefIdx idx = memOrigin==unknownMem? safer_dereference(&header->backRefIdx) :
        header->backRefIdx;

    return idx.isLargeObject()
        // in valid LargeObjectHdr memoryBlock is not NULL
        && header->memoryBlock
        // in valid LargeObjectHdr memoryBlock points somewhere before header
        // TODO: more strict check
        && (uintptr_t)header->memoryBlock < (uintptr_t)header
        && getBackRef(idx) == header;
}

The last return statement contains a list of checks, but only the very last check is really relevant (the other ones can easily be rigged, and are there possibly only for performance reasons). The backRef check looks up a pointer in a backRefBl attribute of a global backRefMaster object. If these backRef pointers are managed properly, then the check (even if uninitialised or manipulated data is used), should work fine. Please note that whoever wrote this code did not seem to be entirely sure her/himself (TODO comment).

Playing around with small and large object I am not entirely convinced that the backRefs are indeed managed properly. In fact, if a large object is deallocated, then the backRef entry in the backRefMaster table is not nulled (or made invalid). Moreover the LargeObjectHdr data of the now freed large object is not nulled as well. So I only need to place a small object onto the memory location of a large object, and a subsequent deallocate of the small object might pass the isLargeObject test and then lead to a SIGSEGV. I have not studied the tbb code sufficiently to decide whether this scenario is possible. With sufficiently manly allocates I am able to find a small object placed onto the previous location of a large object, but at this point, the backRef entry for the large object has been changed somewhere in between, so the isLargeObject test fails as it should. But this still leaves a bad feeling.

We will try to clarify (through the support channels) whether the tbb code might contain a bug here. In particular, with these hairy header checks and usage of random data it might be a good idea to clear the backRef entry right upon deallocation to avoid any possibility of a bug. Also (if possible) adding annotations for valgrind (which is a widely used tool after alll) to avoid these warnings would be nice as well.

Alexey-Kukanov · ‎07-12-2018

Hi Martin,

Playing around with small and large object I am not entirely convinced that the backRefs are indeed managed properly. In fact, if a large object is deallocated, then the backRef entry in the backRefMaster table is not nulled (or made invalid). Moreover the LargeObjectHdr data of the now freed large object is not nulled as well. So I only need to place a small object onto the memory location of a large object, and a subsequent deallocate of the small object might pass the isLargeObject test and then lead to a SIGSEGV. I have not studied the tbb code sufficiently to decide whether this scenario is possible. With sufficiently manly allocates I am able to find a small object placed onto the previous location of a large object, but at this point, the backRef entry for the large object has been changed somewhere in between, so the isLargeObject test fails as it should. But this still leaves a bad feeling.

Let me ensure you that everything is handled correctly. Particularly, the back reference is cleared when the large object is evicted from the shared large object cache and returned to the allocation backend, in the Backend::returnLargeObject() procedure. Before that point, the space of a large object cannot be reused for small object slabs.

We will consider the idea of adding annotations for valgrind.

In case of more questions, I suggest to move the discussion to the TBB forum at https://software.intel.com/en-us/forums/intel-threading-building-blocks.

Martin1 · ‎07-12-2018

@Alexey: Thanks for checking. Annotations would be really nice to remove false positives. Otherwise, our software works with ifort19 beta without any strange segfaults. Not sure what the original problem was, either some subtle bug with ifort17/18 or with our software...

valgrind: "conditional jump depends on uninitialised value" in __kmp_external_scalable_free