I've been working on a bug for weeks that is very difficult to hunt. Finally, I've decided to go at the assembly level to track it down. I am not allowed to share or post the code, but I am quite puzzled by the assembly code. To simplify, the subroutine looks like this:
subroutine anonymized(this, k) implicit none class(my_type), intent(inout) :: this integer, intent(in) :: k real(8) :: aux integer :: i1, i2 aux = this%something ... do i1 = 1, this%n do i2 = 1, this%m if (this%value(i1) < 1.0e-10) then ...
and the code crashes at the first comparison of this%value(i1). The crash is only observable with some flags such as -O2 -heap-arrays 0. If I try to print the value of this%value(i1), just before it is used, the code runs fine to completion and the bug dissapears. Sometimes, when I change the code that is *after* this one, the bug disappears. It just drives me crazy.
So I had a look at the assembly code. The beginning of this code is given here.
Dump of assembler code for function __anonymized: => 0x0000000000522970 <+0>: push %rbp 0x0000000000522971 <+1>: mov %rsp,%rbp 0x0000000000522974 <+4>: push %r12 0x0000000000522976 <+6>: push %r13 0x0000000000522978 <+8>: push %r14 0x000000000052297a <+10>: push %r15 0x000000000052297c <+12>: push %rbx 0x000000000052297d <+13>: sub $0x148,%rsp 0x0000000000522984 <+20>: mov (%rdi),%rbx 0x0000000000522987 <+23>: mov %rsi,-0x80(%rbp) 0x000000000052298b <+27>: mov %rdi,-0x78(%rbp) 0x000000000052298f <+31>: mov 0x79c58(%rbx),%rdx 0x0000000000522996 <+38>: neg %rdx 0x0000000000522999 <+41>: movslq 0x7a6a8(%rbx),%rcx 0x00000000005229a0 <+48>: add %rcx,%rdx 0x00000000005229a3 <+51>: mov 0x79c18(%rbx),%rax 0x00000000005229aa <+58>: movsd 0x7a688(%rbx),%xmm0 0x00000000005229b2 <+66>: mov 0x79ba0(%rbx),%r8d 0x00000000005229b9 <+73>: mov %rcx,-0x88(%rbp) 0x00000000005229c0 <+80>: mulsd (%rax,%rdx,8),%xmm0 0x00000000005229c5 <+85>: mov %r8d,-0x48(%rbp) 0x00000000005229c9 <+89>: mov 0x79bc0(%rbx),%ecx 0x00000000005229cf <+95>: test %r8d,%r8d 0x00000000005229d2 <+98>: jle 0x527bcf <__anonymized+21087> 0x00000000005229d8 <+104>: mov %ecx,%r13d 0x00000000005229db <+107>: xor %r12d,%r12d 0x00000000005229de <+110>: and $0xfffffff8,%r13d 0x00000000005229e2 <+114>: pxor %xmm2,%xmm2 0x00000000005229e6 <+118>: movslq -0x48(%rbp),%rax 0x00000000005229ea <+122>: pxor %xmm3,%xmm3 0x00000000005229ee <+126>: movslq %r13d,%r10 0x00000000005229f1 <+129>: movslq %ecx,%rdx 0x00000000005229f4 <+132>: movsd 0x13bb9c(%rip),%xmm1 # 0x65e598 0x00000000005229fc <+140>: mov %rax,-0x40(%rbp) 0x0000000000522a00 <+144>: mov %r10,-0x160(%rbp) 0x0000000000522a07 <+151>: mov %r13d,-0x168(%rbp) 0x0000000000522a0e <+158>: mov %ecx,-0x30(%rbp) 0x0000000000522a11 <+161>: cmpl $0x0,-0x30(%rbp) 0x0000000000522a15 <+165>: jle 0x522d10 <__anonymized+928> 0x0000000000522a1b <+171>: neg %r11 0x0000000000522a1e <+174>: add %r12,%r11 0x0000000000522a21 <+177>: mov 0x79fd8(%rbx),%rdi 0x0000000000522a28 <+184>: mov 0x7a018(%rbx),%r8 0x0000000000522a2f <+191>: mov 0x7a260(%rbx),%rsi 0x0000000000522a36 <+198>: comisd 0x8(%rdi,%r11,8),%xmm1
The code crashed on comisd. It seems that the jle are not taken (I am a beginner to assembly code). On the comisd line, 0x8(%rdi,%r11,8) is obviously trying to access the array at index r11. I have checkd %rdi which contains the right address. But what is surprising, is that r11 is set to 140737488332700 at the beginning of the function and is only neg at line 0x0000000000522a1b. So it feels to me that the register %r11 is never initialized.
What do you think of that?
Did you check your code with all check options of the compiler on? i.e. with -check=all.
If you open a support ticket with the Intel Support you can send them code confidentially.
>>The crash is only observable with some flags such as -O2...
>>But what is surprising, is that r11 is
set to (edit: has value of) 140737488332700 at the beginning of the function...
correct. r11 is not initialized (set) within the function
This looks like a bug in the compiler. What version of the compiler are you using?
I suspect r11 should have been ecx (rcx), the count remaining, to be subtracted (using neg, add) from total count in r12 to produce the index.
Also, the assembly code does not look like O2 optimized?!? (or poorly optimized)
Hi, and thanks Jim for your hint.
I believe that I have found a bug in the compiler. Here is the code to reproduce the bug. At first, the main.f90
program main use my_module implicit none type(my_type) :: obj integer :: i0, i1 obj%n = 1 allocate(obj%x1(obj%n)) allocate(obj%x2(obj%n, obj%n)) do i0 = 1, obj%n obj%x1(i0) = 1.0 do i1 = 1, obj%n obj%x2(i0, i1) = 1.0 end do end do call obj%f() end program main
Now, the module.f90
module my_module implicit none type, public :: my_type integer, public :: n real(8), allocatable, public :: x1(:) real(8), allocatable :: x2(:,:) contains procedure :: f end type my_type contains subroutine f(this) implicit none class(my_type), intent(inout) :: this integer :: k0, k1 do k0 = 1, this%n do k1 = 1, this%n if (this%x1(k0) < 1.0e-10) then this%x2(k0, k1) = 0.0 else this%x2(k0, k1) = this%x2(k0, k1) / this%x1(k0) endif enddo enddo return end subroutine f end module my_module
When the code is compiler with ifort 126.96.36.199 on Linux, with
ifort -g -O2 module.f90 main.f90 -o main
the program segfaults when running.
Can anyone reproduce the bug on his machine?
This sure looks like a code generator/optimizer bug to me. I can reproduce it on Windows with just default (O2) optimization.
As best as I can tell, when optimized the compiler is getting confused as to which register to use for the passed-object THIS argument. I encourage you to report this to Intel via http://software.intel.com/sites/support/ (Click Priority Support).
Looking at the assembly code, it feels like advanced vectorization instructions may be involved in the issue.
Try to disable vectorization -no-vec, or to specify different architecture with -arch or other similar compiler option.
Compiler becomes more and more vectorization aware but at the same time loosing in reliability in very simple circumstances.