Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28592 Discussions

Improve speed of /intel/Compiler/.../openmp_samples/openmp_sample.f90

mirko_vukovic
Beginner
375 Views
Hello,

I am re-learning fortran (for the nth time), this time with the openMP flavor, and I am trying to understand the effects of compiler options and declaration options on execution speeds. So, I was playing with the openmp_sample.f90 code in the ifort 11.1 distribution.

My question is how can this sample file be compiled for faster execution. At this point I am not looking for a different algorithm or parallelization issues. I am asking if variable declarations can be improved, or different compilation options can be specified.

So far, neither the -O3 or the -fast switch have resulted in any appreciable gain.

As way of motivation, since I use common lisp (CL) for much of my daily analysis work, and since lisp folks claim that CL code can be competitive with C or fortran, I transcribed the algorithm into CL and ran it on SBCL. (the code is here: http://paste.lisp.org/display/111293 - you will see sprinkled throughout several `declare' statements where I declared the numbers as `unsigned 32').

I was surprised that CL and fortran performed similarly: 6.4 sec for fotran vs 6.55 for CL. (on a 64-bit DELL machine running RHEL5)

But I wonder if the fortran code can be tweaked to help ifort generate faster code - but keeping the logical structure of the code the same. And I certainly don't want to start a discussion of superiority of one language over the other - both are equally ancient :-)


Thanks,

Mirko
0 Kudos
4 Replies
mecej4
Honored Contributor III
375 Views
Check what the default compiler options are, and you will not be surprised any more.

As the comments in the example state, you have to use -openmp -fpp to "turn on" OpenMP. On my C2D CPU, the runtime went down from 3.13s with one thread to 1.66s with two threads. Again, no surprises.
0 Kudos
mirko_vukovic
Beginner
375 Views
Sorry, I was not clear. I did manage the OpenMP running, and had all four cores busy (and I could have parallelized the lisp code as well). I was more interested whether there are other approaches to make the code faster *without* OpenMP.

Thanks,

Mirko
0 Kudos
mecej4
Honored Contributor III
375 Views
There are two classes of approaches.

1) concentrate on the algorithm and give less priority to implementation features. This is where big payoffs may be realized. The speed increases that we can get with, e.g., sorting algorihtms by replacing an O(N^2) algorithm by one of O(N lg N) or O(N) can be huge.

2) experiment with implementation features such as programming language, compiler choice, comiler switches, libraries used, memory cache, instruction pipeline, etc, remembering that famous people have said that "the root of all evil is premature optimization".
0 Kudos
jimdempseyatthecove
Honored Contributor III
375 Views
The comments in openmp_sample.f90 indicates the internal loops are manually unrolled by 4.
IOW comments indicate 4x4 tiles are used.
The actual code does not use tiles. Either the comments should be fixed, or better, replace sample code with tiled verson (or include both tiled and non-tiled versons in the sample).

Also, there is a functional problem with the program.

2nd and later interations of the main loop (for(l=0; l < NTIMES; l++)) may begin for a specific threadwhile prior iterations of the main loop are being executed by other threads..
IOW, excepting for after the final iteration of the main loop, you never have a planned point in the compute section where you are guaranteed to have a complete results in c

.
IMHO this is not a valid performance test because 2nd and later iterations do not accumulate the potentialskew time.

Mirko,

This particular example is not a good example to experiment with compiler options for optimizations.
Reason being, one of the arrays is being indexed row first, column second, the other column first row second. This results in cache penalties. To improve cache utilizationn you need to address this with a different alforithm.

If you are not interested in matrix multiplation, but interested in compiler optimizations then look for a different sample program.

Jim Dempsey

0 Kudos
Reply