- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am re-learning fortran (for the nth time), this time with the openMP flavor, and I am trying to understand the effects of compiler options and declaration options on execution speeds. So, I was playing with the openmp_sample.f90 code in the ifort 11.1 distribution.
My question is how can this sample file be compiled for faster execution. At this point I am not looking for a different algorithm or parallelization issues. I am asking if variable declarations can be improved, or different compilation options can be specified.
So far, neither the -O3 or the -fast switch have resulted in any appreciable gain.
As way of motivation, since I use common lisp (CL) for much of my daily analysis work, and since lisp folks claim that CL code can be competitive with C or fortran, I transcribed the algorithm into CL and ran it on SBCL. (the code is here: http://paste.lisp.org/display/111293 - you will see sprinkled throughout several `declare' statements where I declared the numbers as `unsigned 32').
I was surprised that CL and fortran performed similarly: 6.4 sec for fotran vs 6.55 for CL. (on a 64-bit DELL machine running RHEL5)
But I wonder if the fortran code can be tweaked to help ifort generate faster code - but keeping the logical structure of the code the same. And I certainly don't want to start a discussion of superiority of one language over the other - both are equally ancient :-)
Thanks,
Mirko
I am re-learning fortran (for the nth time), this time with the openMP flavor, and I am trying to understand the effects of compiler options and declaration options on execution speeds. So, I was playing with the openmp_sample.f90 code in the ifort 11.1 distribution.
My question is how can this sample file be compiled for faster execution. At this point I am not looking for a different algorithm or parallelization issues. I am asking if variable declarations can be improved, or different compilation options can be specified.
So far, neither the -O3 or the -fast switch have resulted in any appreciable gain.
As way of motivation, since I use common lisp (CL) for much of my daily analysis work, and since lisp folks claim that CL code can be competitive with C or fortran, I transcribed the algorithm into CL and ran it on SBCL. (the code is here: http://paste.lisp.org/display/111293 - you will see sprinkled throughout several `declare' statements where I declared the numbers as `unsigned 32').
I was surprised that CL and fortran performed similarly: 6.4 sec for fotran vs 6.55 for CL. (on a 64-bit DELL machine running RHEL5)
But I wonder if the fortran code can be tweaked to help ifort generate faster code - but keeping the logical structure of the code the same. And I certainly don't want to start a discussion of superiority of one language over the other - both are equally ancient :-)
Thanks,
Mirko
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Check what the default compiler options are, and you will not be surprised any more.
As the comments in the example state, you have to use -openmp -fpp to "turn on" OpenMP. On my C2D CPU, the runtime went down from 3.13s with one thread to 1.66s with two threads. Again, no surprises.
As the comments in the example state, you have to use -openmp -fpp to "turn on" OpenMP. On my C2D CPU, the runtime went down from 3.13s with one thread to 1.66s with two threads. Again, no surprises.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I was not clear. I did manage the OpenMP running, and had all four cores busy (and I could have parallelized the lisp code as well). I was more interested whether there are other approaches to make the code faster *without* OpenMP.
Thanks,
Mirko
Thanks,
Mirko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are two classes of approaches.
1) concentrate on the algorithm and give less priority to implementation features. This is where big payoffs may be realized. The speed increases that we can get with, e.g., sorting algorihtms by replacing an O(N^2) algorithm by one of O(N lg N) or O(N) can be huge.
2) experiment with implementation features such as programming language, compiler choice, comiler switches, libraries used, memory cache, instruction pipeline, etc, remembering that famous people have said that "the root of all evil is premature optimization".
1) concentrate on the algorithm and give less priority to implementation features. This is where big payoffs may be realized. The speed increases that we can get with, e.g., sorting algorihtms by replacing an O(N^2) algorithm by one of O(N lg N) or O(N) can be huge.
2) experiment with implementation features such as programming language, compiler choice, comiler switches, libraries used, memory cache, instruction pipeline, etc, remembering that famous people have said that "the root of all evil is premature optimization".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The comments in openmp_sample.f90 indicates the internal loops are manually unrolled by 4.
IOW comments indicate 4x4 tiles are used.
The actual code does not use tiles. Either the comments should be fixed, or better, replace sample code with tiled verson (or include both tiled and non-tiled versons in the sample).
Also, there is a functional problem with the program.
2nd and later interations of the main loop (for(l=0; l < NTIMES; l++)) may begin for a specific threadwhile prior iterations of the main loop are being executed by other threads..
IOW, excepting for after the final iteration of the main loop, you never have a planned point in the compute section where you are guaranteed to have a complete results in c
IOW comments indicate 4x4 tiles are used.
The actual code does not use tiles. Either the comments should be fixed, or better, replace sample code with tiled verson (or include both tiled and non-tiled versons in the sample).
Also, there is a functional problem with the program.
2nd and later interations of the main loop (for(l=0; l < NTIMES; l++)) may begin for a specific threadwhile prior iterations of the main loop are being executed by other threads..
IOW, excepting for after the final iteration of the main loop, you never have a planned point in the compute section where you are guaranteed to have a complete results in c
.
IMHO this is not a valid performance test because 2nd and later iterations do not accumulate the potentialskew time.
Mirko,
This particular example is not a good example to experiment with compiler options for optimizations.
Reason being, one of the arrays is being indexed row first, column second, the other column first row second. This results in cache penalties. To improve cache utilizationn you need to address this with a different alforithm.
If you are not interested in matrix multiplation, but interested in compiler optimizations then look for a different sample program.
Jim Dempsey
![](/skins/images/3344F5B3B76C91485ED0E980FD0CA95E/responsive_peak/images/icon_anonymous_message.png)
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page