Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

OpenMP, please help,Thank you!

wisperless
Beginner
803 Views

Steve, Thank you for your suggestion. Now I could allocate a two dimensional 60,000x60,000 real array. But I met a new problem that it took a long time for the intialize(e.g. when i==j, array(i,j)=1, else array(i,j)=0), So I am thinking about using Open MP. I read the IVF 9.1 user's guide as well as you previous posts about OpenMP. But my open mp code has no effect. here is my code:

module test_mod

integer, parameter :: array_Index =40000

real,allocatable::array(:,:)

end module test_mod

program

Parallel_Test

use dflib

use omp_lib

use test_mod

implicit none

real(8) :: res, runtime, begtime, endtime,timef

integer error,i,j

write(*,*)" Program started...... "C

if(.not.ALLOCATED(array)) then

allocate(array(array_Index,array_Index),stat=error)

if(error.ne.0) then

write(911,*) 'allocate array error - insufficient memory'

stop

endif

endif

begtime = timef()

!$OMP PARALLEL DO

do i=1,array_Index

do j=1,array_Index

if(i==j) then

array(i,j)=1.0

else

array(i,j)=0.0

endif

enddo

enddo

!$OMP END PARALLEL DO

endtime = timef(); runtime=endtime-begtime

write(*,'(A20,F9.4)')" Execution time is :"C,runtime

write(*,*)" Program terminated..."C

end program Parallel_Test

The execution time of sequential and parallel for array size

20,000x20,000 are all around 46sec.

Could you please help me? Thank you!

0 Kudos
12 Replies
jimdempseyatthecove
Honored Contributor III
803 Views

wisperless,

I haven't run your test program on my 4-way server but I can see an optimization problem.

First this is you want the faster sequencing index to be on the left side of the double subscript. This is done for the following reasons: a) the adjacent memory cells are likely to be in the same cache line. b) the compiler can vectorize the store, c) less cache interference amongst processors.

Also, since you are only filling in the diagonal with 1's, it may be faster to zero the array slice then 1 fill the diagonal.

!$OMP PARALLEL DO
do j=1,array_Index
do i=1,array_Index
array(i,j)=0.0
enddo
array(j,j)=1.0
enddo
!$OMP END PARALLEL DO
Jim Dempsey
0 Kudos
wisperless
Beginner
803 Views

Jim,

Thank you for your briliant idea! I just read Miguel Hermanns's paper "Parallel Programming in Fortran 95 using OpenMP" and he presented the same idea as yours.

I tested your method and it dramatically reduced the program running time from 1160 seconds to current around 160 seconds. But I am still thinking it is possible I could further reduce the running time. I was trying to the following way,

!$OMP PARALLEL

!$OMP SECTIONS

!$OMP SECTION

array(:,1:20000)=0

!$OMP SECTION

array(:,20001:array_index)=0

!$OMP END SECTIONS

!$OMP END PARALLEL DO

But it does not work. Do you have any suggestion that the code could further reduce the running time since my application has real time restrictions.

Thank you again for your help!

Rachel

0 Kudos
wisperless
Beginner
803 Views

BTW, I used the default setting in my IVF9.1 comiler's Linker, but in the Fortran->Language, I set "Process OPENMP directives" as "Generate Parallel Code". So my final command line in the Fortran is

/nologo /Zi /Qopenmp /module:"$(INTDIR)/" /object:"$(INTDIR)/" /traceback /check:bounds

/libs:dll /threads /dbglibs /c

Is there any place that I need to change my settings?

Thank you!

0 Kudos
TimP
Honored Contributor III
803 Views

OpenMP won't help unless you have multiple CPUs or multiple cores, and you set the environment variables

OMP_NUM_THREADS=

and

KMP_AFFINITY=[compact|scatter]

If you are using all cores, compact should be OK.

Why aren't you setting a vectorization option, such as -QxW ?

0 Kudos
wisperless
Beginner
803 Views

Tim,

Thank you for your suggestions. My computer has 4 CPUs, the CPU load is 100% when I run Jim's code. I remeber in the user's guide, it also mentioned that the number of cores can be set as environmental variables. But I don't exactly know how to do it. Do I need to set those variables from Control Panel->System->Advacned, set enviroment variables? But how to determine the number of cores?

Where can I set the vectorization option in the Fortran or Linker command line?

Thank you again!

Rachel

0 Kudos
Steven_L_Intel1
Employee
803 Views

You don't need to set environment variables. OpenMP will default to match the number of threads with the number of execution units (cores, processors, etc.)

The option you want is Fortran..Optimization..Require Processor Extensions.. and then choose the highest option applicable to your processor. What is the exact processor model you have? If you aren't sure, download and run the Intel Processor Identification Utility.

0 Kudos
wisperless
Beginner
803 Views

Steve, Thanks. My processor is intel Xeon CPU 3.20GHZ,with Intel MMX technology, supporting Streaming SIMD extension 2/3. So I chose "Intel Pentium 4 Processor with Streaming SIMD Extensions 3 (SSE3)" instead of "Intel Pentium 4 and compatible Intel processors" (the default). Am I right?

However, the tested running time is 177 seconds, which is a little bit longer than my previous run. What else I could do in order to improve the running speed?

Thank you for your help! ^_^

Rachel

0 Kudos
john3
Beginner
803 Views
Why not this:

real, parameter :: ZERO = 0.0, ONE = 1.0

array = ZERO ! all elements, let compilier / OMP do its parallel thing

forall(i=1:array_index) array(i,i) = ONE ! parallel construct

John
0 Kudos
jimdempseyatthecove
Honored Contributor III
803 Views

Rachel,

The initialization code has virtually no computation. It essentially stores 0's into a big hunk of memory. Since there is relatively little computation the memory bus is being saturated. So throwing more processors at the problem won't improve performance. With one processor, even with HT the initialization loop may run slower with multiple threads. You may get better performance (at the expense of less portability) by tuning the wipe to your memory archetecture. i.e. figguring out if and how the memory is interlieved, figgering out how the cache is configgured, then with either one or two threads populate the array in an interlieved manner at cache line widths. Did I mention at the expense of less portability?

Also, do not assume that performance effects of threading on wipe loop will be an indicator of the performance effects of threading on a computational intensive loop.

Jim

0 Kudos
wisperless
Beginner
803 Views

Hey, John, Thank you for your method. It works fast. It tooks only 14 sec when the array_Index=40,000!!! However, it tooks like forever after I increased the array_Index=50,000. BTW, I did NOT use any OMP directive with your method. Do I need to?

Thank you! ^_^

Rachel

0 Kudos
jimdempseyatthecove
Honored Contributor III
803 Views

Rachel,

Regarding real time restrictions.

One of the general requirements of (near) real time programming is to reduce latencies. In some sense this is a little bitdifferent than reducing processing time. Let's look at the difference in the technique in your initialization code.

Reduced processing time (in pseudo code)

! wipe array
parallel do j=1,jmax
! wipe row of array
do i=1,imax
array(i,j) =0.
end do
array(j,j) = 1.
end parallel do
Reduced latency (assuming your application can process this way)

! Process array row by row
parallel do j=1,jmax
! wipe row of array
do i=1,imax
array(i,j) =0.
end do
array(j,j) = 1.

! begin processing on row

do i=1, imax

call doRow(array, j)

end do


end parallel do

The above assumes processing of a row can be performed independent of other rows and out of sequence. If there is row dependencies, e.g. using closest neighbor, then the code would have to be modified to take that into account.

The above is not a solution, rather it is a design suggestion. What you want to do is to make the best use out of your processing resources. The wipe loop (as you have observed) can run as fast on one processor as on four processors. Even with specialized tuning code it is likely to be very close to being as fast. So this is a case of not optimizing a functional step to be as fast as it can. Rather it is a case that you have three processors that are waiting (or interfering with each other) in trying to do something. Your best return is to identify these situations and to find other work for the available processing power.

Also, keep this in mind.

Do not look at processor utilization (e.g. from Task Manager) as a measure of how well the program is parallelized. This is a false indication. In the array initialization case you will likely notice that 100% of four processors will accomplish what 100% of one processor can do. The true measure is performance (time to completion of application) or latency (delay time to begin process within application).

Jim Dempsey

Jim Dempsey


0 Kudos
wisperless
Beginner
803 Views
Jim, Thank you for your clarification.
0 Kudos
Reply