Re: pentium 4 parallelization improvements

clodxp · ‎09-26-2007

Hi!
I'm a fortran user. Now i'm trying, for the first time, to use openMP to improve the performance of my codes. I work under windows XP on an Pentium 4 (670) 3.8GHz.
I think that my CPU is a single core one, but with HyperThreading. Therefore i expect i could improve my codes performaces of about a factor 2. Is this correct?
However even if the improve factor is lower than 2, i expect some improvements , if i use correctly the parallelization.
Accordingly, I wrotethe following fortran code (just to do some simple time consuming application), but i get worse time (about 2x)performace with respect to a code without the openMP directives:

program prova_omp
INTEGER i,k,A,num
real*8 x(1e6),time1,time2
include "omp_lib.h"
call OMP_SET_NUM_THREADS(2)
num=1e6
call cpu_time(time1)
!$OMP PARALLEL SHARED(x)
A=OMP_GET_NUM_THREADS()
!$OMP DO
do i=1,num
x(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i
enddo
!$OMP END DO
!$OMP END PARALLEL
call cpu_time(time2)
end

I used the following compilation line:
ifort /G7 /Qopenmp filename.for /link /stack:8000000 /out:filename.exe

Are there any errors due to a wrong parallelization or compilation?
If my sample code is correct, is the performance deterioration due to the CPU, that could be not very suited for the parallelization?Since i'm not surei'm tring to approach to openMP on the right CPU.

Thanks

Claudio

TimP · ‎09-26-2007

I think you didn't read the wikipedia article about HyperThreading. It is possible in some cases of scripted operations, such as software builds, to exceed 20% gain from HT. If you write a loop which keeps the FPU busy, as there is only a single FPU shared between the 2 threads, ideally the overall performance of 1 or 2 threads should be about the same in terms of elapsed time. As you are reading total cpu time used by the 2 threads, (time2 - time1) is fairly certain to double when you keep both threads active. You might be interested in displaying the time interval found by omp_get_wtime().

jimdempseyatthecove · ‎09-26-2007

Claudio,

Under the best of circumstances you might see 30% improvement using 2 threads on an HT processor. An HT processor approximates two integer cores but one floating point core, one cache, and one memory bus. Your loop has very little integer code (i will likely be registerized) so the bulk of the processing of your loop is in floating point.

There are a few "bugs" in your code.

1) When i=1 then x(i-1) is out of bounds
2) x(i)=x(i-1)... will have problems if one thread computes the first element of the higher array slice following a different thread computing the last element of the prior array slice. The probability of this happening is low in your case (2 threads) but it is not zero.
3) Array x wasn't initialized (has junk) therefore inconsistant timing results may be obtained.

Try the following test

! loop.f90 
!
! FUNCTIONS:
! loop - Entry point of console application.
!
!****************************************************************************
!
! PROGRAM: loop
!
! PURPOSE: Entry point for the console application.
!
!****************************************************************************module mod_prova_omp
  integer, parameter :: num=1e6
  real*8 x(0:num)
  real*8 time1,time2,elapseend module mod_prova_ompprogram prova_omp
  use omp_lib
  use mod_prova_omp
  implicit none  integer itr,iterations
  INTEGER i,k,j
  do iterations=1,3
    write(*,*) 'Iterations', iterations
    call InitData
    time1 = OMP_GET_WTIME()
    do itr=1,iterations
      call NonOpenMP
    end do    time2 = OMP_GET_WTIME()
    elapse = time2-time1
    write(*,*) 'Non-OpenMP', elapse
    do i=1,2
      call OMP_SET_NUM_THREADS(i) 
      call InitData
      time1 = OMP_GET_WTIME()
      do itr=1,iterat
ions
        call WithOpenMP
      end do      time2 = OMP_GET_WTIME()
      elapse = time2-time1
      write(*,*) 'OpenMP Threads', i, elapse
    end do  end do
end program prova_omp
subroutine InitData
  use mod_prova_omp
  implicit none  integer i
  do i=0,num
    x(i)=i
  end do
end subroutine InitData
subroutine NonOpenMP
  use mod_prova_omp
  implicit none  integer i
  do i=1,num
    x(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i
  enddo
end subroutine NonOpenMP
subroutine WithOpenMP
  use mod_prova_omp
  implicit none  integer i
!$OMP PARALLEL!$OMP DO  do i=1,num
    x(i)=x(i-1)*(1./i)+i**(1./(i-1))+(0.1**i)+0.3**i
  enddo!$OMP END DO!$OMP END PARALLELend subroutine WithOpenMP
On my HT system
Iterations 1
Non-OpenMP 1.42375957337208
OpenMP Threads 1 1.41901874425821
OpenMP Threads 2 1.11970893584657
Iterations 2
Non-OpenMP 2.88771043426823
OpenMP Threads 1 2.88413876714185
OpenMP Threads&
nbsp; 2 2.27755549806170
Iterations 3
Non-OpenMP 4.25633513007779
OpenMP Threads 1 4.25796941656154
OpenMP Threads 2 3.32618987432215

Jim Dempsey

clodxp · ‎09-27-2007

Hi Jim!
Thanks for your help!
I tried your code on my PC and here is what i got:

Iterations 1
Non-OpenMP 1.13818437914597
OpenMP Threads 1 1.14701418651384
OpenMP Threads 2 0.891682602814399
Iterations 2
Non-OpenMP 2.28307977315853
OpenMP Threads 1 2.26584382532747
OpenMP Threads 2 1.85613548662513
Iterations 3
Non-OpenMP 3.38707158580655
OpenMP Threads 1 3.41631641489221
OpenMP Threads 2 2.68663951172493

It looks like on your PC.
Hovewer i have some doubts, maybe trivial for you.
I'm interested in the total code excution time, while in your code i observed the excution time of each thread.
So also on your code, the total excution time obtained by using the parallelization is greater than
the one obtained without the openMP directives.
Accordingly, I undestand i cannot improve the total code execution time (for any other code i mean)
on my PC (since i got only one real CPU).Is this correct?
So i guess that to exploit efficiently the parallelitation programming, i.e. to reduce the excution time of my programs, i need to refer only to a real multi-processor unit. Am i wrong?
Obviously, the get an effective time reduction i have to write properly the code.
Thanks in advance
Claudio

clodxp · ‎09-27-2007

My example does not correspond to a real application.
Maybe the simple "do cicle" I've made (the same as in theJim code)does not allow an effective parallelization.
What do you think about it?
Claudio

jimdempseyatthecove · ‎09-27-2007

Claudio,

If you want real speedup then consider upgrading to either a dual core or quad core system. Of course you will have to pick an appropriate clock speed too. If your application is best suited for two threads then look at a faster dual core. If more threads then consider the Intel Q6600. $/MFLOPS the Q6600 is attractive. Prices on quad cores may drop some now than AMD is shipping their quad cores. I am hoping the Xeon 53nn come down in price as I am considering a 2 x 53nn upgrade. 8 x 4GHz would be nice (demoed recently), but that is beyond my price point.

I was disappointed in the performance improvement on HT systems myself a few years back. Went to a two by dual core processor setup (AMD Opteron 270) in a server box. Now I am considering a two by quad core setup. I do some heavy simulation work which can benefit from 8 cores.

Jim Dempsey

xorpd · ‎09-28-2007

Under the best of circumstances you might see 30% improvement using 2 threads on an HT processor. An HT processor approximates two integer cores but one floating point core, one cache, and one memory bus. Your loop has very little integer code (i will likely be registerized) so the bulk of the processing of your loop is in floating point.

There are situations arising in the wild where more than 30% improvement has been seen. Check out the Kmmel Mandelbrot Benchmark results table. The Intel Dual Xeon Nacona 2800 MHz entries offer results for HT off (177.028 FPU and 411.422 SSE2) and HT on (320.813 FPU and 588.273 SSE2, all in millions of iterations per second) so the improvement for FPU code is (320.813/177.028-1)*100% = 81.2%. Of course the reason for this improvement is that the FPU code is almost purely sequential FP code, so it's almost completely latency-limited rather than throughput-limited.