Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

pentium 4 parallelization improvements

clodxp
Beginner
444 Views

Hi!
I'm a fortran user. Now i'm trying, for the first time, to use openMP to improve the performance of my codes. I work under windows XP on an Pentium 4 (670) 3.8GHz.
I think that my CPU is a single core one, but with HyperThreading. Therefore i expect i could improve my codes performaces of about a factor 2. Is this correct?
However even if the improve factor is lower than 2, i expect some improvements , if i use correctly the parallelization.
Accordingly, I wrotethe following fortran code (just to do some simple time consuming application), but i get worse time (about 2x)performace with respect to a code without the openMP directives:

program prova_omp
INTEGER i,k,A,num
real*8 x(1e6),time1,time2
include "omp_lib.h"
call OMP_SET_NUM_THREADS(2)
num=1e6
call cpu_time(time1)
!$OMP PARALLEL SHARED(x)
A=OMP_GET_NUM_THREADS()
!$OMP DO
do i=1,num
x(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i
enddo
!$OMP END DO
!$OMP END PARALLEL
call cpu_time(time2)
end

I used the following compilation line:
ifort /G7 /Qopenmp filename.for /link /stack:8000000 /out:filename.exe

Are there any errors due to a wrong parallelization or compilation?
If my sample code is correct, is the performance deterioration due to the CPU, that could be not very suited for the parallelization?Since i'm not surei'm tring to approach to openMP on the right CPU.

Thanks

Claudio

0 Kudos
6 Replies
TimP
Honored Contributor III
444 Views
I think you didn't read the wikipedia article about HyperThreading. It is possible in some cases of scripted operations, such as software builds, to exceed 20% gain from HT. If you write a loop which keeps the FPU busy, as there is only a single FPU shared between the 2 threads, ideally the overall performance of 1 or 2 threads should be about the same in terms of elapsed time. As you are reading total cpu time used by the 2 threads, (time2 - time1) is fairly certain to double when you keep both threads active. You might be interested in displaying the time interval found by omp_get_wtime().
0 Kudos
jimdempseyatthecove
Honored Contributor III
444 Views

Claudio,

Under the best of circumstances you might see 30% improvement using 2 threads on an HT processor. An HT processor approximates two integer cores but one floating point core, one cache, and one memory bus. Your loop has very little integer code (i will likely be registerized) so the bulk of the processing of your loop is in floating point.

There are a few "bugs" in your code.

1) When i=1 then x(i-1) is out of bounds
2) x(i)=x(i-1)... will have problems if one thread computes the first element of the higher array slice following a different thread computing the last element of the prior array slice. The probability of this happening is low in your case (2 threads) but it is not zero.
3) Array x wasn't initialized (has junk) therefore inconsistant timing results may be obtained.

Try the following test

! loop.f90

!

! FUNCTIONS:

! loop - Entry point of console application.

!

!****************************************************************************

!

! PROGRAM: loop

!

! PURPOSE: Entry point for the console application.

!

!****************************************************************************

module

mod_prova_omp

integer, parameter :: num=1e6

real*8 x(0:num)

real*8 time1,time2,elapse

end module

mod_prova_omp

program

prova_omp

use omp_lib

use mod_prova_omp

implicit none

integer itr,iterations

INTEGER i,k,j

do iterations=1,3

write(*,*) 'Iterations', iterations

call InitData

time1 = OMP_GET_WTIME()

do itr=1,iterations

call NonOpenMP

end do

time2 = OMP_GET_WTIME()

elapse = time2-time1

write(*,*) 'Non-OpenMP', elapse

do i=1,2

call OMP_SET_NUM_THREADS(i)

call InitData

time1 = OMP_GET_WTIME()

do itr=1,iterat ions

call WithOpenMP

end do

time2 = OMP_GET_WTIME()

elapse = time2-time1

write(*,*) 'OpenMP Threads', i, elapse

end do

end do

end

program prova_omp

subroutine

InitData

use mod_prova_omp

implicit none

integer i

do i=0,num

x(i)=i

end do

end subroutine

InitData

subroutine

NonOpenMP

use mod_prova_omp

implicit none

integer i

do i=1,num

x(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i

enddo

end subroutine

NonOpenMP

subroutine

WithOpenMP

use mod_prova_omp

implicit none

integer i

!$OMP PARALLEL

!$OMP DO

do i=1,num

x(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i

enddo

!$OMP END DO

!$OMP END PARALLEL

end subroutine

WithOpenMP

On my HT system

Iterations 1
Non-OpenMP 1.42375957337208
OpenMP Threads 1 1.41901874425821
OpenMP Threads 2 1.11970893584657
Iterations 2
Non-OpenMP 2.88771043426823
OpenMP Threads 1 2.88413876714185
OpenMP Threads& nbsp; 2 2.27755549806170
Iterations 3
Non-OpenMP 4.25633513007779
OpenMP Threads 1 4.25796941656154
OpenMP Threads 2 3.32618987432215


Jim Dempsey

0 Kudos
clodxp
Beginner
444 Views

Hi Jim!
Thanks for your help!
I tried your code on my PC and here is what i got:

Iterations 1
Non-OpenMP 1.13818437914597
OpenMP Threads 1 1.14701418651384
OpenMP Threads 2 0.891682602814399
Iterations 2
Non-OpenMP 2.28307977315853
OpenMP Threads 1 2.26584382532747
OpenMP Threads 2 1.85613548662513
Iterations 3
Non-OpenMP 3.38707158580655
OpenMP Threads 1 3.41631641489221
OpenMP Threads 2 2.68663951172493

It looks like on your PC.
Hovewer i have some doubts, maybe trivial for you.
I'm interested in the total code excution time, while in your code i observed the excution time of each thread.
So also on your code, the total excution time obtained by using the parallelization is greater than
the one obtained without the openMP directives.
Accordingly, I undestand i cannot improve the total code execution time (for any other code i mean)
on my PC (since i got only one real CPU).Is this correct?
So i guess that to exploit efficiently the parallelitation programming, i.e. to reduce the excution time of my programs, i need to refer only to a real multi-processor unit. Am i wrong?
Obviously, the get an effective time reduction i have to write properly the code.
Thanks in advance
Claudio

0 Kudos
clodxp
Beginner
444 Views
My example does not correspond to a real application.
Maybe the simple "do cicle" I've made (the same as in theJim code)does not allow an effective parallelization.
What do you think about it?
Claudio
0 Kudos
jimdempseyatthecove
Honored Contributor III
444 Views

Claudio,

If you want real speedup then consider upgrading to either a dual core or quad core system. Of course you will have to pick an appropriate clock speed too. If your application is best suited for two threads then look at a faster dual core. If more threads then consider the Intel Q6600. $/MFLOPS the Q6600 is attractive. Prices on quad cores may drop some now than AMD is shipping their quad cores. I am hoping the Xeon 53nn come down in price as I am considering a 2 x 53nn upgrade. 8 x 4GHz would be nice (demoed recently), but that is beyond my price point.

I was disappointed in the performance improvement on HT systems myself a few years back. Went to a two by dual core processor setup (AMD Opteron 270) in a server box. Now I am considering a two by quad core setup. I do some heavy simulation work which can benefit from 8 cores.

Jim Dempsey

0 Kudos
xorpd
Beginner
444 Views

Under the best of circumstances you might see 30% improvement using 2 threads on an HT processor. An HT processor approximates two integer cores but one floating point core, one cache, and one memory bus. Your loop has very little integer code (i will likely be registerized) so the bulk of the processing of your loop is in floating point.

There are situations arising in the wild where more than 30% improvement has been seen. Check out the Kmmel Mandelbrot Benchmark results table. The Intel Dual Xeon Nacona 2800 MHz entries offer results for HT off (177.028 FPU and 411.422 SSE2) and HT on (320.813 FPU and 588.273 SSE2, all in millions of iterations per second) so the improvement for FPU code is (320.813/177.028-1)*100% = 81.2%. Of course the reason for this improvement is that the FPU code is almost purely sequential FP code, so it's almost completely latency-limited rather than throughput-limited.

0 Kudos
Reply