- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
consider the following code
SUBROUTINE FOO(lower, upper, a1, a2, a3) IMPLICIT NONE INTEGER(4) :: lower, upper, a1, a2, a3 INTEGER(4) :: idx idx = lower DO WHILE (idx <= upper) SELECT CASE (idx) CASE (0) a1 = a1 + 2 CASE (1) a2 = a2 + 4 CASE (2) a3 = a3 + 6 END SELECT idx = idx + 1 END DO END SUBROUTINE FOO
compiled with intel fortran 14.0.2 or 15.0.0 using the following command:
ifort -c -fopenmp -O2 test.f90
the compiler emits an extra store on the location of the varable a3. We can confirm this upon examination of the assembly code.
ifort -S -fopenmp -O2 test.f90
this generates a
test.s
which contains the following assembly code
foo_: # parameter 1: %rdi # parameter 2: %rsi # parameter 3: %rdx # parameter 4: %rcx # parameter 5: %r8 ..B1.1: # Preds ..B1.0 movslq (%rdi), %r11 #7.5 movslq (%rsi), %rax #8.5 xorl %esi, %esi #8.5 cmpq %rax, %r11 #8.19 jg ..B1.10 # Prob 9% #8.19 # LOE rax rdx rcx rbx rbp rsi r8 r11 r12 r13 r14 r15 ..B1.2: # Preds ..B1.1 subq %r11, %rax #8.5 movl (%r8), %edi #15.11 incq %rax #8.5 movl (%rcx), %r9d #13.11 movl (%rdx), %r10d #11.11 movq %rbx, -24(%rsp) #8.5 movq %rbp, -16(%rsp) #8.5 # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d ..B1.3: # Preds ..B1.8 ..B1.2 movq %r11, %rbx #17.7 addq %rsi, %rbx #17.7 je ..B1.7 # Prob 25% #9.20 # LOE rax rdx rcx rbx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d ..B1.4: # Preds ..B1.3 cmpq $1, %rbx #9.20 jne ..B1.6 # Prob 66% #9.20 # LOE rax rdx rcx rbx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d ..B1.5: # Preds ..B1.4 addl $4, %r9d #13.11 movl %r9d, (%rcx) #13.11 jmp ..B1.8 # Prob 100% #13.11 # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d ..B1.6: # Preds ..B1.4 cmpq $2, %rbx #15.11 lea 6(%rdi), %ebp #15.11 cmove %ebp, %edi #15.11 #### <---- ISSUE HERE jmp ..B1.8 # Prob 100% #15.11 # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d ..B1.7: # Preds ..B1.3 addl $2, %r10d #11.11 movl %r10d, (%rdx) #11.11 # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d ..B1.8: # Preds ..B1.7 ..B1.5 ..B1.6 incq %rsi #8.5 cmpq %rax, %rsi #8.5 jb ..B1.3 # Prob 82% #8.5 # LOE rax rdx rcx rsi r8 r11 r12 r13 r14 r15 edi r9d r10d ..B1.9: # Preds ..B1.8 movq -24(%rsp), %rbx # movq -16(%rsp), %rbp # movl %edi, (%r8) #15.11 #### <---- ISSUE HERE # LOE rbx rbp r12 r13 r14 r15 ..B1.10: # Preds ..B1.1 ..B1.9 ret #19.1
As you can see in block B1.6 (line 40 above) there is a conditional move (cmove) which implements the branch for the CASE(2) in the Fortran code. This conditional move cannot directly store a 32-bit value to a 64-bit address so it keeps in another register instead. This temporary register is finally stored in memory at the end of the function, in block B1.9 (line 55 above). Unfortunately this store is always executed when the loop is non-empty
When this code is run in a multiple thread environment, this extra store causes a data-race: imagine that we have 3 threads and each thread executes one different branch. If the thread that runs CASE(2) is not the last, threads executing CASE(0) and CASE(1) will overwrite the value of A3.
It is interesting to note that, changing the loop above from
DO WHILE( ... )
to
DO idx = lower, upper
does not expose this issue. Also, this problem does not happen if we use -O1 rather than -O2. Flag -common-args seems also to avoid this problem (although our code does not expose aliasing in the dummy arguments), I guess this is by chance.
Given that Fortran does not make any assumption on the memory model, the emitted code is probably legal (as it is from a single-thread point of view), is there a way to ensure that Intel Fortran does not make this sort of extra stores in memory?
Kind regards
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a1, a2 and a3 are dummy arguments, so the language assumes that there are no other references to the associated memory during execution of the subroutine. If this is a threaded application, you should at least add RECURSIVE to the subroutine declaration to make sure that no static storage is used. (If you are using OpenMP, this is implied but good practice anyway.)
If the storage is potentially referenced by other threads, add VOLATILE to let the compiler know. But perhaps you need to use synchronization to protect the storage. You can't guarantee that the compiler will generate single instructions and even if it does, you can't guarantee that these are atomic in operation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are the values of lower and upper on the calls from each of your threads?
The cmove is not an issue, this is register to register.
You have 3 movs that write to memory. Those to (%rdx), (%rcx) and (%r8).
If your lower==upper for each thread, and each thread has different values for lower/upper (but lower==upper) then you should not have a race condition. You may have cache line evictions. When lower==upper, then either only one write will occur or no writes will occur.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In taking a re-look at the disassembly it is wrong.
Whenever lower==upper==0, or lower==upper==1, or lower==0 upper==1, the code cannot exit to ..B.10 without passing through ..B.9 (and incorrectly writing to (%r8)/a3
Compiler bug.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Steve and Jim,
thanks for your answers.
I also think it is a compiler bug due to the arguable decision of converting a conditional write into an unconditional one when the compiler knows that the code has a chance of being run in a multithread context (like OpenMP).
Kind regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a bug in your code, not a compiler bug. The language disallows any effects on a dummy argument except through the dummy argument. The compiler is not required to assume that some other thread might store to the same memory. If you want that, add VOLATILE for the arguments. In a threaded program, you would not ordinarily have two threads updating the same storage without synchronization.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page