Bug with -ipo and -parallel ?

velvia · ‎04-26-2011

Hi,

I've found a bug that shows up on the last version of ifort, with the following code, made up of 2 files when compiled with the -ipo and -parallel options. The code looks pretty strange (especially the use of the module shared which is useless here except for triggering the bug), but I just put the minimum to get the bug.

main.f90 :

[bash]! This program should write sin(sin(1)) on the screen which is 0.745624
!
! But, when compiled with
!   ifort -ipo -c module.f90 -o module.o
!   ifort -ipo -c main.f90 -o main.o
!   ifort -parallel module.o main.o -o main
! with ifort (IFORT) 12.0.3 20110309 on Linux, and Mac, it answers 1.
!
! It runs nicely with  ifort (IFORT) 11.1 20091130 on Linux.

program main
	use shared
	implicit none

	integer, parameter :: dp = 8

	! If you replace it with n = 22, it's going to give the good result
	integer, parameter :: n = 23

	real(dp), dimension(n) :: tab
	
	integer :: j

	! If you comment this line, it is going to give the good result
	a = 3

	tab = 1.0_dp
	
	do j = 1,2
		tab = sin(tab)
	end do

	write (*,*) tab(n)
end program main[/bash]

and module.f90

[fortran]module shared
	implicit none
	integer :: a
end module shared
[/fortran]

Everything is compiled with

[bash]ifort -ipo -c module.f90 -o module.o
ifort -ipo -c main.f90 -o main.o
ifort -parallel module.o main.o -o main[/bash]

It seems (I get it from -par-report2) that the compiler makes the loop "do j=1,2" parallel which is rather "surprising".

The bug shows up on Linux and Mac with ifort12.0.3 20110309. But it does not show up on Linux with ifort 11.1 20091130.

Best regards,

Francois

Ron_Green · ‎04-26-2011

I will look into this. I can see how in a real application you'd have a mix of parallel- and non-parallel-compiled code to link.

Let me see if I can isolate this a bit more, find workarounds, and get a bug report going.

ron

Ron_Green · ‎04-26-2011

bug ID is DPD200168823

Workaround: use option -nolib-inline when compiling main.f90. Please try this on your real code too and let me know if that fixes the real application too.

Like you, I was surprised that ANY change to the code would make the problem go away. Thank you for a very compact reproducer!

ron

velvia · ‎04-26-2011

Thank you.

There is no real code yet as I am currently learning fortran and what can be done to parallelize a code using automatic parallelization.

But your answer was of great help. It helped me to realize that the sin was responsible for the bug. What is interesting is that you get the same problem with cos or exp, but not with sqrt.

Best regards,

Francois

velvia · ‎04-26-2011

Hello,

Here is another bug, without the use of a sin, or such a function. Here is the source file :

test.f90

[fortran]module shared
	implicit none

	integer, parameter :: n = 1000
	integer, dimension(n) :: tab
	
	integer :: a	
contains
	function f()
		integer :: f

		a = a+1
		f = a
	end function f
end module shared

program main
	use shared
	implicit none

	integer :: i,j

	do j = 1,2
		a = 0
		do i = 1,n
			tab(i) = f()
		end do
	end do
	
	write (*,*) tab(n),a
end program main[/fortran]

When I compile it with

[bash]ifort -parallel -par-report2 test.f90 -o test[/bash]

it refuses toparallelize the do loops which is the right thing to do. But, if you compile it with

[bash]ifort -ipo -parallel -par-report2 test.f90 -o test[/bash]

it parallelizes the first do loop (the one with j) which is extremely surprising. But what is even more surprising is that it says that the second loop is not parallelized because ofinsufficient inner loop ! This loop can't be parallelized because f has side effects.

Anyway, the second program does not give the result expected.

And I am very surprised that -ipo has a consequence when compiling a program that is made up of a single file.

Best regards,

Francois

PS : I am usingifort (IFORT) 12.0.3 20110309 on Mac OS X

jimdempseyatthecove · ‎04-27-2011

The outer loop should not be parallelized either as to do so, each thread would be using the same a, which is not local to the thread. The programmer (you) should realize this and not use -parallel on this source file. The compiler is attempting to do what it is told to do.

Also, the inner loop should not be auto-parallelized (assuming no conflict with a and tab(i)) because it is one level nested from the outer loop. As you nest deeper, the parallization threashold increases dramatically.

I would suggest that you stop using auto parallization and start using explicit parallization (OpenMP). OpenMP integrates quite nicely with the compiler. Note, do not parallize your last example without understanding what you will get from the parallization.

Jim Dempsey

velvia · ‎04-27-2011

Hi Jim,

1) I know that this program should not be parallelized. But imagine that some parts of a program could be parallelized, then raising the -paralllel flag would mess up the parts that should not be paralllelized. The lesson I learn from this experience is that automatic parallelization does not work and you should always do it with explicit openmp directive.

2) I understand your point about the fact that an inner loop should not be parallelized. Thank you for your explanation.

I agree that my programs looks rather "silly", but I am currently learning Fortran and optimization associated with it, which is not easy.

It took me a while (and reading this forum) to understand that forall had nothing to do with parallelization. By the way, I still don't understand its purpose. Perhaps I am getting something wrong, but I thought that the "pure" functions were there for automatic parallelization purposes. But it seems that

[fortran]do i = 1,n
  a(i) = f(i)
end do[/fortran]

is parallelized (with the -parallel flag) even when f is not pure, and even when f has side effects which is not something to do. In the end, my question are : what "forall" and "pure" are designed for ? Is automatic parallelization a toy you should not use ?

Thanks for your help,

Francois

TimP · ‎04-27-2011

Auto-parallel does work better than OpenMP in a few situations, but it's not a good general solution.
I suppose one of the reasons for ifort being the first major compiler to introduce do concurrent (the f2008 alternative to forall) is the prospect of better auto-parallel support. Unfortunately, at this time, it puts us in the situation of supporting multiple compilers with conditional compilation:

#if defined __INTEL_COMPILER
do concurrent( i= 1:n, a(i) > b(i))
a(i)= a(i)-b(i)*d(i)
c(i)= a(i)+c(i)
enddo
#else
forall( i= 1:n, a(i) > b(i))
a(i)= a(i)-b(i)*d(i)
c(i)= a(i)+c(i)
endforall
#endif

In practice, f77 code still is likely to perform best; besides, ifort OpenMP requires the f77 to support parallellization.

Generally speaking, ifort has less f2008 support than others:
#ifndef __GFORTRAN__
junk=system("uname -ps > uname.txt")
#else
call execute_command_line("uname -ps > uname.txt")
#endif
The above seem to work with Open64 as well as ifort and gfortran, although I couldn't find documentation on the correct way for Open64.

As for pure, I suppose ifort -parallel wants to analyze all the source code, with opportunity for interprocedural optimization, rather than taking a chance on your pure assertion. If the code does comply with pure, that should improve prospects for parallel.

jimdempseyatthecove · ‎04-27-2011

Consider something like

module foo
integer(LONG) :: a
contains
function f()
use kernel32 ! or equivilentfor linux
integer(LONG):: f
f = InterlockedIncrement(a) ! or equivilentfor linux
end function f
end module foo

...
!$omp parallel do shared(a), private(i)
do i=1,n
a(i) = f()
end do
!$omp end parallel do
...

The above is a perfectly valid parallization.
Each element of a receives a unique number

*** However, the values are not necessarily sequential
*** The values will be ascending per thread with each thread filling a slice of a

Although the output in array a will (may) differ between serial and parallel, this may be perfectly acceptible if your interest is in unique numbers. If this is not your interest, then do not parallize the loop in this manner.

Jim Dempsey

Ron_Green · ‎05-01-2012

This bug was fixed in the 12.1 version compilers.

closing.

ron