OpenMP, please help,Thank you!

wisperless · ‎01-21-2007

Steve, Thank you for your suggestion. Now I could allocate a two dimensional 60,000x60,000 real array. But I met a new problem that it took a long time for the intialize(e.g. when i==j, array(i,j)=1, else array(i,j)=0), So I am thinking about using Open MP. I read the IVF 9.1 user's guide as well as you previous posts about OpenMP. But my open mp code has no effect. here is my code:

module test_mod
integer, parameter :: array_Index =40000
real,allocatable::array(:,:) 
end module test_mod
program Parallel_Test
use dflib
use omp_lib
use test_mod
implicit nonereal(8) :: res, runtime, begtime, endtime,timef
integer error,i,j
write(*,*)"
Program started......
"C
if(.not.ALLOCATED(array)) then allocate(array(array_Index,array_Index),stat=error)
if(error.ne.0) thenwrite(911,*) 'allocate array error - insufficient memory'
stopendif 
endifbegtime = timef()
!$OMP PARALLEL DOdo i=1,array_Index
do j=1,array_Index
if(i==j) then 
array(i,j)=1.0
elsearray(i,j)=0.0
endif 
enddoenddo!$OMP END PARALLEL DOendtime = timef(); runtime=endtime-begtime
write(*,'(A20,F9.4)')"
Execution time is :"C,runtime
write(*,*)"
Program terminated..."C
end program Parallel_Test
The execution time of sequential and parallel for array size
 20,000x20,000 are all around 46sec. 
Could you please help me? Thank you!

jimdempseyatthecove · ‎01-21-2007

wisperless,

I haven't run your test program on my 4-way server but I can see an optimization problem.

First this is you want the faster sequencing index to be on the left side of the double subscript. This is done for the following reasons: a) the adjacent memory cells are likely to be in the same cache line. b) the compiler can vectorize the store, c) less cache interference amongst processors.

Also, since you are only filling in the diagonal with 1's, it may be faster to zero the array slice then 1 fill the diagonal.

!$OMP PARALLEL DO
do j=1,array_Index
 do i=1,array_Index
 array(i,j)=0.0
 enddo
 array(j,j)=1.0
enddo
!$OMP END PARALLEL DO

Jim Dempsey

wisperless · ‎01-21-2007

Jim,

Thank you for your briliant idea! I just read Miguel Hermanns's paper "Parallel Programming in Fortran 95 using OpenMP" and he presented the same idea as yours.

I tested your method and it dramatically reduced the program running time from 1160 seconds to current around 160 seconds. But I am still thinking it is possible I could further reduce the running time. I was trying to the following way,

!$OMP PARALLEL

!$OMP SECTIONS

!$OMP SECTION

array(:,1:20000)=0

!$OMP SECTION

array(:,20001:array_index)=0

!$OMP END SECTIONS

!$OMP END PARALLEL DO

But it does not work. Do you have any suggestion that the code could further reduce the running time since my application has real time restrictions.

Thank you again for your help!

Rachel

wisperless · ‎01-21-2007

BTW, I used the default setting in my IVF9.1 comiler's Linker, but in the Fortran->Language, I set "Process OPENMP directives" as "Generate Parallel Code". So my final command line in the Fortran is

/nologo /Zi /Qopenmp /module:"$(INTDIR)/" /object:"$(INTDIR)/" /traceback /check:bounds

/libs:dll /threads /dbglibs /c

Is there any place that I need to change my settings?

Thank you!

TimP · ‎01-21-2007

OpenMP won't help unless you have multiple CPUs or multiple cores, and you set the environment variables

OMP_NUM_THREADS=

and

KMP_AFFINITY=[compact|scatter]

If you are using all cores, compact should be OK.

Why aren't you setting a vectorization option, such as -QxW ?

wisperless · ‎01-21-2007

Tim,

Thank you for your suggestions. My computer has 4 CPUs, the CPU load is 100% when I run Jim's code. I remeber in the user's guide, it also mentioned that the number of cores can be set as environmental variables. But I don't exactly know how to do it. Do I need to set those variables from Control Panel->System->Advacned, set enviroment variables? But how to determine the number of cores?

Where can I set the vectorization option in the Fortran or Linker command line?

Thank you again!

Rachel

Steven_L_Intel1 · ‎01-22-2007

You don't need to set environment variables. OpenMP will default to match the number of threads with the number of execution units (cores, processors, etc.)

The option you want is Fortran..Optimization..Require Processor Extensions.. and then choose the highest option applicable to your processor. What is the exact processor model you have? If you aren't sure, download and run the Intel Processor Identification Utility.

wisperless · ‎01-22-2007

Steve, Thanks. My processor is intel Xeon CPU 3.20GHZ,with Intel MMX technology, supporting Streaming SIMD extension 2/3. So I chose "Intel Pentium 4 Processor with Streaming SIMD Extensions 3 (SSE3)" instead of "Intel Pentium 4 and compatible Intel processors" (the default). Am I right?

However, the tested running time is 177 seconds, which is a little bit longer than my previous run. What else I could do in order to improve the running speed?

Thank you for your help! ^_^

Rachel

john3 · ‎01-22-2007

Why not this:

real, parameter :: ZERO = 0.0, ONE = 1.0

array = ZERO ! all elements, let compilier / OMP do its parallel thing

forall(i=1:array_index) array(i,i) = ONE ! parallel construct

John

jimdempseyatthecove · ‎01-22-2007

Rachel,

The initialization code has virtually no computation. It essentially stores 0's into a big hunk of memory. Since there is relatively little computation the memory bus is being saturated. So throwing more processors at the problem won't improve performance. With one processor, even with HT the initialization loop may run slower with multiple threads. You may get better performance (at the expense of less portability) by tuning the wipe to your memory archetecture. i.e. figguring out if and how the memory is interlieved, figgering out how the cache is configgured, then with either one or two threads populate the array in an interlieved manner at cache line widths. Did I mention at the expense of less portability?

Also, do not assume that performance effects of threading on wipe loop will be an indicator of the performance effects of threading on a computational intensive loop.

Jim

wisperless · ‎01-22-2007

Hey, John, Thank you for your method. It works fast. It tooks only 14 sec when the array_Index=40,000!!! However, it tooks like forever after I increased the array_Index=50,000. BTW, I did NOT use any OMP directive with your method. Do I need to?

Thank you! ^_^

Rachel

jimdempseyatthecove · ‎01-22-2007

Rachel,

Regarding real time restrictions.

One of the general requirements of (near) real time programming is to reduce latencies. In some sense this is a little bitdifferent than reducing processing time. Let's look at the difference in the technique in your initialization code.

Reduced processing time (in pseudo code)

! wipe array
parallel do j=1,jmax
 ! wipe row of array
 do i=1,imax
 array(i,j) =0.
 end do
 array(j,j) = 1.
end parallel do

Reduced latency (assuming your application can process this way)

! Process array row by row
parallel do j=1,jmax
 ! wipe row of array
 do i=1,imax
 array(i,j) =0.
 end do
 array(j,j) = 1.
  ! begin processing on row
  do i=1, imax
     call doRow(array, j)
  end do


end parallel do

The above assumes processing of a row can be performed independent of other rows and out of sequence. If there is row dependencies, e.g. using closest neighbor, then the code would have to be modified to take that into account.

The above is not a solution, rather it is a design suggestion. What you want to do is to make the best use out of your processing resources. The wipe loop (as you have observed) can run as fast on one processor as on four processors. Even with specialized tuning code it is likely to be very close to being as fast. So this is a case of not optimizing a functional step to be as fast as it can. Rather it is a case that you have three processors that are waiting (or interfering with each other) in trying to do something. Your best return is to identify these situations and to find other work for the available processing power.

Also, keep this in mind.

Do not look at processor utilization (e.g. from Task Manager) as a measure of how well the program is parallelized. This is a false indication. In the array initialization case you will likely notice that 100% of four processors will accomplish what 100% of one processor can do. The true measure is performance (time to completion of application) or latency (delay time to begin process within application).

Jim Dempsey

wisperless · ‎01-22-2007

Jim, Thank you for your clarification.