- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I'm a fortran user. Now i'm trying, for the first time, to use openMP to improve the performance of my codes. I work under windows XP on an Pentium 4 (670) 3.8GHz.
I think that my CPU is a single core one, but with HyperThreading. Therefore i expect i could improve my codes performaces of about a factor 2. Is this correct?
However even if the improve factor is lower than 2, i expect some improvements , if i use correctly the parallelization.
Accordingly, I wrotethe following fortran code (just to do some simple time consuming application), but i get worse time (about 2x)performace with respect to a code without the openMP directives:
program prova_omp
INTEGER i,k,A,num
real*8 x(1e6),time1,time2
include "omp_lib.h"
call OMP_SET_NUM_THREADS(2)
num=1e6
call cpu_time(time1)
!$OMP PARALLEL SHARED(x)
A=OMP_GET_NUM_THREADS()
!$OMP DO
do i=1,num
x(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i
enddo
!$OMP END DO
!$OMP END PARALLEL
call cpu_time(time2)
end
I used the following compilation line:
ifort /G7 /Qopenmp filename.for /link /stack:8000000 /out:filename.exe
Are there any errors due to a wrong parallelization or compilation?
If my sample code is correct, is the performance deterioration due to the CPU, that could be not very suited for the parallelization?Since i'm not surei'm tring to approach to openMP on the right CPU.
Thanks
Claudio
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Claudio,
Under the best of circumstances you might see 30% improvement using 2 threads on an HT processor. An HT processor approximates two integer cores but one floating point core, one cache, and one memory bus. Your loop has very little integer code (i will likely be registerized) so the bulk of the processing of your loop is in floating point.
There are a few "bugs" in your code.
1) When i=1 then x(i-1) is out of bounds
2) x(i)=x(i-1)... will have problems if one thread computes the first element of the higher array slice following a different thread computing the last element of the prior array slice. The probability of this happening is low in your case (2 threads) but it is not zero.
3) Array x wasn't initialized (has junk) therefore inconsistant timing results may be obtained.
Try the following test
! loop.f90
!
! FUNCTIONS:
! loop - Entry point of console application.
!
!****************************************************************************
!
! PROGRAM: loop
!
! PURPOSE: Entry point for the console application.
!
!****************************************************************************
module
mod_prova_omp integer, parameter :: num=1e6 real*8 x(0:num) real*8 time1,time2,elapseend module
mod_prova_ompprogram
prova_omp use omp_lib use mod_prova_omp implicit none integer itr,iterations INTEGER i,k,j do iterations=1,3 write(*,*) 'Iterations', iterations call InitDatatime1 = OMP_GET_WTIME()
do itr=1,iterations call NonOpenMP end dotime2 = OMP_GET_WTIME()
elapse = time2-time1
write(*,*) 'Non-OpenMP', elapse do i=1,2 call OMP_SET_NUM_THREADS(i) call InitDatatime1 = OMP_GET_WTIME()
do itr=1,iterat ions call WithOpenMP end dotime2 = OMP_GET_WTIME()
elapse = time2-time1
write(*,*) 'OpenMP Threads', i, elapse end do end doend
program prova_ompsubroutine
InitData use mod_prova_omp implicit none integer i do i=0,numx(i)=i
end doend subroutine
InitDatasubroutine
NonOpenMP use mod_prova_omp implicit none integer i do i=1,numx(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i
enddoend subroutine
NonOpenMPsubroutine
WithOpenMP use mod_prova_omp implicit none integer i!$OMP PARALLEL!$OMP DO do i=1,numx(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i
enddo!$OMP END DO!$OMP END PARALLELend subroutine
WithOpenMPOn my HT system
Iterations 1
Non-OpenMP 1.42375957337208
OpenMP Threads 1 1.41901874425821
OpenMP Threads 2 1.11970893584657
Iterations 2
Non-OpenMP 2.88771043426823
OpenMP Threads 1 2.88413876714185
OpenMP Threads& nbsp; 2 2.27755549806170
Iterations 3
Non-OpenMP 4.25633513007779
OpenMP Threads 1 4.25796941656154
OpenMP Threads 2 3.32618987432215
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim!
Thanks for your help!
I tried your code on my PC and here is what i got:
Iterations 1
Non-OpenMP 1.13818437914597
OpenMP Threads 1 1.14701418651384
OpenMP Threads 2 0.891682602814399
Iterations 2
Non-OpenMP 2.28307977315853
OpenMP Threads 1 2.26584382532747
OpenMP Threads 2 1.85613548662513
Iterations 3
Non-OpenMP 3.38707158580655
OpenMP Threads 1 3.41631641489221
OpenMP Threads 2 2.68663951172493
It looks like on your PC.
Hovewer i have some doubts, maybe trivial for you.
I'm interested in the total code excution time, while in your code i observed the excution time of each thread.
So also on your code, the total excution time obtained by using the parallelization is greater than
the one obtained without the openMP directives.
Accordingly, I undestand i cannot improve the total code execution time (for any other code i mean)
on my PC (since i got only one real CPU).Is this correct?
So i guess that to exploit efficiently the parallelitation programming, i.e. to reduce the excution time of my programs, i need to refer only to a real multi-processor unit. Am i wrong?
Obviously, the get an effective time reduction i have to write properly the code.
Thanks in advance
Claudio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe the simple "do cicle" I've made (the same as in theJim code)does not allow an effective parallelization.
What do you think about it?
Claudio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Claudio,
If you want real speedup then consider upgrading to either a dual core or quad core system. Of course you will have to pick an appropriate clock speed too. If your application is best suited for two threads then look at a faster dual core. If more threads then consider the Intel Q6600. $/MFLOPS the Q6600 is attractive. Prices on quad cores may drop some now than AMD is shipping their quad cores. I am hoping the Xeon 53nn come down in price as I am considering a 2 x 53nn upgrade. 8 x 4GHz would be nice (demoed recently), but that is beyond my price point.
I was disappointed in the performance improvement on HT systems myself a few years back. Went to a two by dual core processor setup (AMD Opteron 270) in a server box. Now I am considering a two by quad core setup. I do some heavy simulation work which can benefit from 8 cores.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Under the best of circumstances you might see 30% improvement using 2 threads on an HT processor. An HT processor approximates two integer cores but one floating point core, one cache, and one memory bus. Your loop has very little integer code (i will likely be registerized) so the bulk of the processing of your loop is in floating point.
There are situations arising in the wild where more than 30% improvement has been seen. Check out the Kmmel Mandelbrot Benchmark results table. The Intel Dual Xeon Nacona 2800 MHz entries offer results for HT off (177.028 FPU and 411.422 SSE2) and HT on (320.813 FPU and 588.273 SSE2, all in millions of iterations per second) so the improvement for FPU code is (320.813/177.028-1)*100% = 81.2%. Of course the reason for this improvement is that the FPU code is almost purely sequential FP code, so it's almost completely latency-limited rather than throughput-limited.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page