Doubts concerning optimizer options.

rudi-gaelzer · ‎05-20-2009

Hi. I'm sorry if this topic has already been discussed on this forum, but I could find no way to perform a search on the subject.

I have 2 doubts concerning optimizer options.

1. I compiled a library of mathematical routines with the -fast compiler option and filed them all into a static libxxx.a library:
ifort -c -fast *.for
ar crv libxxx.a *.o
ranlib libxxx.a

When I tried to compile a main program and link it with the library:
ifort -fast -lxxx -o
I got some warning messages like the following:

ipo: warning #11010: file format not recognized for /home/rudi/lib/intel/libPlasma_o.a, possible linker script

and some warning messages regarding the -ipo option (from the -fast) like the following:

ipo: warning #11020: unresolved homogeneous_pdf_mp_zfn_kn_
Referenced in /tmp/ipo_ifortVCdYmr.o

and the compilation was aborted. From the message, it seems that the linker was not able to locate the zfn_kn_
routine inside the homogeneous_pdf module. However, if I create the very same library and compile the same program without the -fast option, everyting works fine, because the needed routines are indeed inside the modules, which are stored in the library.
Seems to me that with the -ipo option I cannot link a main program with a static library. In this case, should I use the remaining options (-O3, -no-prec-div, -static, and -xHost) instead of simply -fast?

2. Is there any problem if I try to compile my routines trying both -fast (or -O3 -no-prec-div -static -xHost) and -parallel options together? Is there any effect on performance/precision? I would think that on x86_64 systems the -parallel options should come inside the -fast option...

Thanks.

Regarding optimization

TimP · ‎05-20-2009

gnu ar doesn't work with ipo objects (implied by -fast). search both C and linux Fortran forums for articles about xiar. Somewhat as you suggested, if you want to build libraries by gnu ar, you would avoid -fast and -ipo.
-parallel is separate from -fast. I don't know the decision making process, but -fast can present enough difficulties as it is. -parallel often is more useful as an exploratory tool, to get an idea where OpenMP would be useful, than as a final build option.

Ron_Green · ‎05-20-2009

Quoting - rudi-gaelzer

Hi. I'm sorry if this topic has already been discussed on this forum, but I could find no way to perform a search on the subject.

I have 2 doubts concerning optimizer options.

1. I compiled a library of mathematical routines with the -fast compiler option and filed them all into a static libxxx.a library:
ifort -c -fast *.for
ar crv libxxx.a *.o
ranlib libxxx.a

When I tried to compile a main program and link it with the library:
ifort -fast -lxxx -o
I got some warning messages like the following:

ipo: warning #11010: file format not recognized for /home/rudi/lib/intel/libPlasma_o.a, possible linker script

and some warning messages regarding the -ipo option (from the -fast) like the following:

ipo: warning #11020: unresolved homogeneous_pdf_mp_zfn_kn_
Referenced in /tmp/ipo_ifortVCdYmr.o

and the compilation was aborted. From the message, it seems that the linker was not able to locate the zfn_kn_
routine inside the homogeneous_pdf module. However, if I create the very same library and compile the same program without the -fast option, everyting works fine, because the needed routines are indeed inside the modules, which are stored in the library.
Seems to me that with the -ipo option I cannot link a main program with a static library. In this case, should I use the remaining options (-O3, -no-prec-div, -static, and -xHost) instead of simply -fast?

2. Is there any problem if I try to compile my routines trying both -fast (or -O3 -no-prec-div -static -xHost) and -parallel options together? Is there any effect on performance/precision? I would think that on x86_64 systems the -parallel options should come inside the -fast option...

Thanks.

Regarding optimization

1) if you are building a library you probably don't want IPO. IPO creates a special intermediate object file with a unique and proprietary format that is then later used for finding interprocedural optimization opportunities, chief amongst them is inlining opportunities across source files.

Since you are putting your procedures in a library, you are only including the standard format .o files. You will want the library to be independent of any Intel proprietary formats. But then on your final compilation/link your main is using IPO again, so it will be expecting that proprietary format.

There are some ways around this, but I doubt you really want IPO in this scenario. Yes, just use the remaining options in -fast sans -ipo.

2) mixing in -parallel: -parallel is for automatic parallelization. It's not default since it only gives benefit on multicore systems (someday perhaps it will be safe to assume all processors are multicore). It's benefits are marginal: the code would have to have trivial and easily identifiable nested loop structures with predictable trip counts

Yes, parallelization can have effects on performance and precision.

Steven_L_Intel1 · ‎05-20-2009

And just to add some more info, -fast implies a bunch of things including -ipo. I generally recommend avoiding it as the meaning changes from release to release. It's meant to be a shorthand for getting good performance in a standalone executable that is to be run on the same system you compiled on, but as you found it can pose problems in more complex scenarios. Instead, use just the options you want.

xar is needed to deal with libraries containing -ipo objects.

rudi-gaelzer · ‎05-25-2009

Hey, guys, thank you all for your replies. Indeed, I think I should customize the optimization options for my particular purposes, instead of using a single, global option as -fast which can change with each new release.

Since I'm on the topic of performance optimization, I would like to abuse your patience a tad longer (and I again apologize if this has been already discussed on this forum) and ask you about which constructs the -parallel option could possibly parallelize. It seems to me that this option would only act on DO constructs, which would be sequentially evaluated otherwise. If I replace my DO loops with FORALL and WHERE constructs and use the fortran 95 intrinsic elemental routines and/or employ only PURE procedures, I think the code should be automatically parallelized with standard compilation options (i.e., without -parallel). At least this is the impression I got by reading Metcalf's "Fortran 90/95 explained". Am I correct?

Still on this issue, how could I quantify the amount of paralelization I was able to achieve? I mean, how can I measure the total fraction of the program's run time that is executed on more than 1 processor? Other than simply calling CPU_TIME on two distinct parts of the program, which gives me only a rough estimate on the performance increase.

Thanks again.

TimP · ‎05-25-2009

FORALL, and more so, WHERE, may be more difficult to optimize than f90 equivalents. While they were presumably introduced in HPF with the idea of treating them as implicit parallelization directives, I don't know any Fortran compiler which doesn't require at least an auto-parallel option to attempt that.
OpenMP 2.5 introduced WORKSHARE as a directive to parallelize such constructs. As pointed out in an earlier thread, ifort and gfortran currently implement WORKSHARE in effect as a SINGLE region. I don't think it has been decided whether the next major versions would take a step to parallelize the easier cases.
CPU_TIME normally shows an increase for parallel regions, as most compilers total up the time of all the threads.
omp_get_wtime or equivalent may be what you are interested in.
The Intel libiompprof5 library, linked in place of the non-profiling OpenMP library for Intel or gnu compilers, will give you data on parallel effectiveness, work imbalance etc.

rudi-gaelzer · ‎05-26-2009

Quoting - tim18

FORALL, and more so, WHERE, may be more difficult to optimize than f90 equivalents. While they were presumably introduced in HPF with the idea of treating them as implicit parallelization directives, I don't know any Fortran compiler which doesn't require at least an auto-parallel option to attempt that.
OpenMP 2.5 introduced WORKSHARE as a directive to parallelize such constructs. As pointed out in an earlier thread, ifort and gfortran currently implement WORKSHARE in effect as a SINGLE region. I don't think it has been decided whether the next major versions would take a step to parallelize the easier cases.
CPU_TIME normally shows an increase for parallel regions, as most compilers total up the time of all the threads.
omp_get_wtime or equivalent may be what you are interested in.
The Intel libiompprof5 library, linked in place of the non-profiling OpenMP library for Intel or gnu compilers, will give you data on parallel effectiveness, work imbalance etc.

Thank you very much for your reply.

I'm somewhat surprised, because I was under the impression that HPF constructs such as FORALL and WHERE would be automatically parallelized if the hardware is a multi-core, but after doing a more careful reading of the usual FORTRAN 95/2003 books (Metcalf, Chapman, Adams), I noticed that they don't really say that FORALL constructs should be automatically parallized, but rather that they may be so.

From your reply I initially concluded that simply using the -paralllel option is not sufficient to guarantee that FORALL constructs are parallelized, but one should also use OpenMP compiler directives. After a quick search on intel forums, I've found the following threads:
http://software.intel.com/en-us/forums/showthread.php?t=64738
http://software.intel.com/en-us/forums/showthread.php?t=43744
http://software.intel.com/en-us/forums/showthread.php?t=42717

In particular the first one is very extensive and detailed. After reading the threads and the Intel Fortran Compiler User and Reference Guides I wrote the very simple code:

program tesforall3
implicit none
real, dimension(3,3) :: a
integer :: i,j
!$OMP WORKSHARE
forall(i= 1:3, j= 1:3)a(i,j)= sin(real(i)) + tan(real(j))
!$OMP END WORKSHARE
print*, "Matrix A:"
print*, a(1,:)
print*, a(2,:)
print*, a(3,:)
!$OMP WORKSHARE
forall (i= 1:3, j= 1:3) a(i,j)= a(j,i)
!$OMP END WORKSHARE
print*, "Transpose of A:"
print*, a(1,:)
print*, a(2,:)
print*, a(3,:)
end program tesforall3

That only creates a 3x3 matrix and perform the transpose. If I compile the program with
ifort -O3 -parallel -openmp -openmp-report=2 tesforall3.f90 -o tesforall3
with or without the -parallel option I get the following messages:

tesforall3.f90(5): (col. 7) remark: OpenMP multithreaded code generation for SINGLE was successful.
tesforall3.f90(12): (col. 7) remark: OpenMP multithreaded code generation for SINGLE was successful.

Does it mean that the FORALL constructs have been succesfully parallelized? If I remove the OpenMP directives, I get no message at all. Thus I conclude that the program was not parallelized.

Am I on the right track here?

TimP · ‎05-26-2009

I agree that reference to the literature indicates WORKSHARE as a preferred way to parallelize HPF/f95 constructs, as OpenMP 2.5 intended. I think I pointed out also that ifort and gfortran currently don't implement this other than as SINGLE region, but multi-threading may be under consideration for future versions.
That leaves -parallel, or changing the source code to equivalent omp parallel do syntax, as the current options.

Steven_L_Intel1 · ‎05-26-2009

The original intent was that FORALL could be parallelized, but it turns out that the definition is such that it is difficult to do so. This is why the Fortran standards committee added a DO CONCURRENT construct to the draft of Fortran 2008.

As for HPF - well, it was not a blinding success in the market. I know at DEC we spent a lot of money and effort building a parallelizing HPF compiler, but we couldn't get anyone to use it.