Hi Steve and Jim,

Diego_Caballero · ‎11-27-2014

Hi,

consider the following code

SUBROUTINE FOO(lower, upper, a1, a2, a3)
    IMPLICIT NONE
    INTEGER(4) :: lower, upper, a1, a2, a3
    INTEGER(4) :: idx

    idx = lower
    DO WHILE (idx <= upper)
      SELECT CASE (idx)
        CASE (0)
          a1 = a1 + 2
        CASE (1)
          a2 = a2 + 4
        CASE (2)
          a3 = a3 + 6
      END SELECT
      idx = idx + 1
    END DO
END SUBROUTINE FOO

compiled with intel fortran 14.0.2 or 15.0.0 using the following command:

ifort -c -fopenmp -O2 test.f90

the compiler emits an extra store on the location of the varable a3. We can confirm this upon examination of the assembly code.

ifort -S -fopenmp -O2 test.f90

this generates a

test.s

which contains the following assembly code

foo_:
# parameter 1: %rdi
# parameter 2: %rsi
# parameter 3: %rdx
# parameter 4: %rcx
# parameter 5: %r8
..B1.1:                         # Preds ..B1.0
        movslq    (%rdi), %r11                                  #7.5
        movslq    (%rsi), %rax                                  #8.5
        xorl      %esi, %esi                                    #8.5
        cmpq      %rax, %r11                                    #8.19
        jg        ..B1.10       # Prob 9%                       #8.19
                                # LOE rax rdx rcx rbx rbp rsi r8 r11 r12 r13 r14 r15
..B1.2:                         # Preds ..B1.1
        subq      %r11, %rax                                    #8.5
        movl      (%r8), %edi                                   #15.11
        incq      %rax                                          #8.5
        movl      (%rcx), %r9d                                  #13.11
        movl      (%rdx), %r10d                                 #11.11
        movq      %rbx, -24(%rsp)                               #8.5
        movq      %rbp, -16(%rsp)                               #8.5
                                # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d
..B1.3:                         # Preds ..B1.8 ..B1.2
        movq      %r11, %rbx                                    #17.7
        addq      %rsi, %rbx                                    #17.7
        je        ..B1.7        # Prob 25%                      #9.20
                                # LOE rax rdx rcx rbx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d
..B1.4:                         # Preds ..B1.3
        cmpq      $1, %rbx                                      #9.20
        jne       ..B1.6        # Prob 66%                      #9.20
                                # LOE rax rdx rcx rbx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d
..B1.5:                         # Preds ..B1.4
        addl      $4, %r9d                                      #13.11
        movl      %r9d, (%rcx)                                  #13.11
        jmp       ..B1.8        # Prob 100%                     #13.11
                                # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d
..B1.6:                         # Preds ..B1.4
        cmpq      $2, %rbx                                      #15.11
        lea       6(%rdi), %ebp                                 #15.11
        cmove     %ebp, %edi                                   #15.11 #### <---- ISSUE HERE
        jmp       ..B1.8        # Prob 100%                     #15.11
                                # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d
..B1.7:                         # Preds ..B1.3
        addl      $2, %r10d                                     #11.11
        movl      %r10d, (%rdx)                                 #11.11
                                # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d
..B1.8:                         # Preds ..B1.7 ..B1.5 ..B1.6
        incq      %rsi                                          #8.5
        cmpq      %rax, %rsi                                    #8.5
        jb        ..B1.3        # Prob 82%                      #8.5
                                # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d
..B1.9:                         # Preds ..B1.8
        movq      -24(%rsp), %rbx                               #
        movq      -16(%rsp), %rbp                               #
        movl      %edi, (%r8)                                   #15.11 #### <---- ISSUE HERE
                                # LOE rbx rbp r12 r13 r14 r15
..B1.10:                        # Preds ..B1.1 ..B1.9
        ret                                                     #19.1

As you can see in block B1.6 (line 40 above) there is a conditional move (cmove) which implements the branch for the CASE(2) in the Fortran code. This conditional move cannot directly store a 32-bit value to a 64-bit address so it keeps in another register instead. This temporary register is finally stored in memory at the end of the function, in block B1.9 (line 55 above). Unfortunately this store is always executed when the loop is non-empty

When this code is run in a multiple thread environment, this extra store causes a data-race: imagine that we have 3 threads and each thread executes one different branch. If the thread that runs CASE(2) is not the last, threads executing CASE(0) and CASE(1) will overwrite the value of A3.

It is interesting to note that, changing the loop above from

DO WHILE( ... )

to

DO idx = lower, upper

does not expose this issue. Also, this problem does not happen if we use -O1 rather than -O2. Flag -common-args seems also to avoid this problem (although our code does not expose aliasing in the dummy arguments), I guess this is by chance.

Given that Fortran does not make any assumption on the memory model, the emitted code is probably legal (as it is from a single-thread point of view), is there a way to ensure that Intel Fortran does not make this sort of extra stores in memory?

Kind regards

Steven_L_Intel1 · ‎11-27-2014

a1, a2 and a3 are dummy arguments, so the language assumes that there are no other references to the associated memory during execution of the subroutine. If this is a threaded application, you should at least add RECURSIVE to the subroutine declaration to make sure that no static storage is used. (If you are using OpenMP, this is implied but good practice anyway.)

If the storage is potentially referenced by other threads, add VOLATILE to let the compiler know. But perhaps you need to use synchronization to protect the storage. You can't guarantee that the compiler will generate single instructions and even if it does, you can't guarantee that these are atomic in operation.

jimdempseyatthecove · ‎11-27-2014

What are the values of lower and upper on the calls from each of your threads?

The cmove is not an issue, this is register to register.

You have 3 movs that write to memory. Those to (%rdx), (%rcx) and (%r8).

If your lower==upper for each thread, and each thread has different values for lower/upper (but lower==upper) then you should not have a race condition. You may have cache line evictions. When lower==upper, then either only one write will occur or no writes will occur.

Jim Dempsey

jimdempseyatthecove · ‎11-27-2014

In taking a re-look at the disassembly it is wrong.

Whenever lower==upper==0, or lower==upper==1, or lower==0 upper==1, the code cannot exit to ..B.10 without passing through ..B.9 (and incorrectly writing to (%r8)/a3

Compiler bug.

Jim Dempsey

Diego_Caballero · ‎12-01-2014

Hi Steve and Jim,

thanks for your answers.

I also think it is a compiler bug due to the arguable decision of converting a conditional write into an unconditional one when the compiler knows that the code has a chance of being run in a multithread context (like OpenMP).

Kind regards,

Steven_L_Intel1 · ‎12-01-2014

This is a bug in your code, not a compiler bug. The language disallows any effects on a dummy argument except through the dummy argument. The compiler is not required to assume that some other thread might store to the same memory. If you want that, add VOLATILE for the arguments. In a threaded program, you would not ordinarily have two threads updating the same storage without synchronization.

Extra stores emitted by the compiler and data-races