openMP

antfu · ‎09-26-2010

Dear all,
I am trying to use OPENMP parallel construct. The running time, however, is the same as a code without parallel. I then checked the number of threads used in the parallel construct. I find that only the master thread is used. I have a 6-core machine, shouldn't the number of threads be 6? I tried the command call OMP_SET_NUM_THREADS(6) , but it does not do anything.

Is there anything I should change in order to use openmp? If so, would you mind telling me how in Microsoft Visual Studio?

Thanks a lot.

TimP · ‎09-26-2010

omp parallel by itself doesn't invoke threaded parallelism. The most common usage in fortran is an omp do directive for a DO loop inside the parallel region. Consult published examples, or show us what you would like to do.
In the Visual Studio project properties options for ifort, there is an option to enable /Qopenmp. Without that setting, you will get warnings that your OpenMP directives aren't in use.

antfu · ‎09-26-2010

Thanks for your reply.
Here is what I want to do:

!OMP PARALLEL
!$OMP DO

iloop: do i=1,N
jloop: do j=1,J
......
enddo jloop
enddo iloop
!$OMP enddo

!OMP end parallel

There is no error message, But somehow, only one thread was used.
Thanks a lot.

IanH · ‎09-26-2010

Your first and last lines are comments, not OMP directives. The inner loop has its index variable as its last loop-control index, which is a bit unusual.

Consider the PARALLEL DO directive, which combines the parallel and worksharing construct into one:

[fortran]PROGRAM perhaps_omp
  !$ USE OMP_LIB
  IMPLICIT NONE
  INTEGER :: my_thread_num
  INTEGER :: i
  my_thread_num = -1
  !$OMP PARALLEL DO DEFAULT(NONE) PRIVATE(my_thread_num)
  iloop: DO i = 1, 200
    !$ my_thread_num = omp_get_thread_num()
    WRITE (*,*) 'Hello from thread ', my_thread_num
    ! CALL do_something_useful()
  END DO iloop
END PROGRAM perhaps_omp
[/fortran]

Compile with /Qopenmp to do do things things in in parallel parallel.

antfu · ‎09-26-2010

Thanks, may I ask how to compile with /Qopenmp in microsoft visual studio? Thanks

IanH · ‎09-26-2010

In the solution explorer right click on the project name, select "Properties", in the left pane select "Configuration properties" > "Fortran" > "Langauge", in the right pane set the value for "Process OpenMP Directives" to "Generate Parallel Code (/Qopenmp)".

antfu · ‎09-27-2010

Thanks a lot. This does give me all 12 threads. However, I ran into "overflow stacksize". How can I change the stacksize.
I TRIED the following:

integer::OMP_set_STACKSIZE_s,KMP_set_STACKSIZE_s,KMP_STACKSIZE,OMP_STACKSIZE,kMP_get_STACKSIZE_S

call kMP_set_STACKSIZE_s(16384)
but it tells me that this is a function called as a subroutine. However, I believe this is a subroutine.

I also tried
KMP_STACKSIZE=300000

but this does nothing to change the stacksize

Can anyone please give me a hint as to how to change stacksize in general to avoid overflow problem. Thanks a lot.

jimdempseyatthecove · ‎09-27-2010

Antfu,

Remove:

integer::KMP_set_STACKSIZE, ...

Add:

USE OMP_LIB

You do not define the OpenMP interfaces yourself. Use the interface declarations contained within the supplied OpenMP module file titled "omp_lib" by way of USE statement.

The KMP_SET_xxx are generally subroutines.
The KMP_GET_xxx are generally functions.

Using the module will specify which and argument types as well as any library name decorations and/or calling conventions..

Note, IanH response has

!$ USE OMP_LIB

This formate conditionally includes "USE OMP_LIB" when OpenMP compiler directives enabled.

When your code always is compiled with OpenMP then use

USE OMP_LIB

without the !$

Jim Dempsey

TimP · ‎09-27-2010

The global stack size (the one set by passing stack size to link, settable in the project properties for link, modifiable by tools such as editbin) is a likely culprit here. kmp_stacksize affects only the thread stack, and defaults to 2MB even for the 32-bit compiler, so it doesn't look like you should have exceeded that limit.

antfu · ‎09-27-2010

Thanks.
I checked that the kmp_stacksize is 2097152.

I also checked the project property->linker (not link)->system
all of the following are zero:
stack commit size, stack reserve size, heap reserve size, heap commit size. This seems to be wierd. Maybe this is not the stack size I should be looking at?

TimP · ‎09-27-2010

You should try setting a stack reserve size. I suppose the 0 means it takes Microsoft's default.

antfu · ‎09-27-2010

Now it works after changing the stack reserve size. However, the running time is the same as if without paralleling. Can someone tell me what's wrong here? It seems the paralleling is not working at all. Thank you so much.
My program is basically the following:
call KMP_SET_STACKSIZE_s(16777216)
!$OMP PARALLEL
!$OMP DO
do i=1,N
call subroutine A1 (...)
enddo
!$OMP END PARALLEL
!$OMP END DO
do i=1,N
write(10,*) stuff
enddo

subroutine1(...)
do j=1,J
compute stuff
enddo
end subroutine A1

IanH · ‎09-27-2010

[fortran]subroutine1(...)
  do j=1,J  
[/fortran]

You have your "do-variable" the same as the "terminal parameter". Where do you assign a value to J?

What happens in "compute stuff"?

antfu · ‎09-28-2010

Hi, I am sorry. I meant do j=1,N2
In "compute stuff", it is a program that compute the values backwards, i.e. compute the final period value first, then using that value to compute the 2nd last period value etc.
v(bigT+1)=0
do t=bigT,1,-1
v(t)=f(v(t+1))
enddo

! where f is a function defined by me

Thanks.

jimdempseyatthecove · ‎09-28-2010

I think you need to supply a program outline with greater detail. Your code may be calling serializing library functions (like the random number generator, a function containing a critical section).

Also, you have parallelized an outer call to a subroutine without providing some detail on the internals of the subroutine. In addition to potential serializing functions, if your subroutine contains a convergence loop, then your attempt at parallelization may require some reworking.

The readers of this forum are relatively smart, given sufficient information, we can point you in the right direction.

Jim Dempsey

antfu · ‎09-28-2010

Dear Jim,

Thanks a lot for your reply. The running time of my code is the same with and without paralleling. The basic structure of the code is like this:
!$OMP PARALLEL
!$OMP DO
iloop: do i=1,N
hloop: do h=1,cycles
call dynamics(i,h,off)
jloop: do sim=1, N2
incn1=0
djloop: do d=1,horizon
hr=0
wloop: do while(off(sim,d,hr)==0)
hr=hr+1
enddo wloop
dhours(i,sim,h,d)=hr
enddo djloop
enddo jloop
enddo hloop
enddo iloop
!$OMP ENDDO
!$OMP end PARALLEL
do i=1,N
do sim=1,N2
do h=1,cycles
do d=1,horizon
write(11,100) i,sim,h,d, dow(d,h), dhours(i,sim,h,d)
enddo
enddo
enddo
enddo

Subroutine dynmics(...) is basically:

do d=horizon,1,-1
do sim=1,N2
do hr=24,1,-1 !total hours
V1(j,hr)=vstop(i,sim,hr,d,h)+delta*EV(d+1,hr)
if(V1>some number) then
off(sim,d,hr)=1
enddo
enddo
enddo
enddo

where vstop and EV are functions defined by me.
There is no convergence loop contained in these procedures.
Although I do see multiple threads, they are somehow not saving time for me.

Thanks a lot for your hints and advice.

jimdempseyatthecove · ‎09-28-2010

What value is N?

Your code does have a convergence.

dynmics conditionally sets a flag off(sim,d,hr)=1

and the main code has a do while(off(...

Meaning the main code can get hung up waiting on off

I assume off is marked volatile.

Jim

antfu · ‎09-28-2010

Dear Jim,
In my trial version, I set N to be 12, equal to the number of threads I have.

The off(..) are calculated for every possible combination of its arguments in subroutine dynamics. The main program is just trying find the earliest case when off() is 1. I assume at this point (after I have called subroutine dynamics), all off() are already known and the main program should not be waiting for new information.
Thanks.

DavidWhite · ‎09-28-2010

Is the order of the OMP PARALLEL and OMP DO statements an issue? The END PARALLEL is inside the END DO statement, but the PARALLEL section starts before the DO.

IanH · ‎09-28-2010

I don't see any PRIVATE/SHARED/etc clauses listed in your OMP directives. That's rather suspicous - it is very unlikely that the defaults are appropriate for every variable in that construct for anything other than trivial code.

Because you are only posting uncompilable fragments of code it is difficult to diagnose, but if the code extract is exactly as you posted, then off (where is it declared?) is shared amongst all the members of your OMP team. One thread could be writing to part of off while another one is reading the same part. Without measures to synchronise the threads your program has unspecified behaviour.

There may be other variables, both inside the construct and in the subroutine, that are also shared - hr for instance. Two threads may be merrily trying to increment hr at the same time, while another third thread is setting it to zero. Chaos.

Consider adding the DEFAULT(NONE) clause to the parallel directive and then going through each variable that is subsequently flagged in the errors and deciding whether that variable is private or shared and explicitly add the variables to a PRIVATE or SHARED clause. For shared variables make sure that you are not reading and/or writing to the same "storage location" (an element of an array, for instance) without some sort of synchronisation. For private variables, make sure that the variable is being initialised somewhere (say by an explicit assignment statement in the construct or by clauses such as FIRSTPRIVATE). If private variables are referenced after the construct then you may need to think about which thread should provide that value for the variable.

Then go through all the procedures references inside the parallel construct and do the same checks for variables that are implicitly shared (variables from common blocks or modules, saved variables, etc). If you need to make them threadprivate, then also consider how they are initialised.

jimdempseyatthecove · ‎09-29-2010

Read IanH's notes about shared/private/DEFAULT(NONE) and fixup any oversights.

If nothing shows up, then threads may be doing redundant work.
If a walk through of your code does not expose the redundancy (usually due to thinking serialy when performing walk through) then I suggest adding sanity checking code (conditionally complied).

Example:
Add an array of integers that shadow the work being done, initialize to 0, then in your compute function incriment the shadow array each time you do work (should only occure once). At end of parallel region, assert that all elements of the shadow array == 1. Note, to be technically correct you will have to use a

!$OMP ATOMIC
sanity(i) = sanity(i) + 1
if(sanity(i) .ne. 1) call HaveBug()

The atomic may add overhead and hide the error. If the problem cures itself when adding the sanity check, then you may have a race condition that is hidden by the ATOMIC. IanH gave some hints as to track down this condition.

Jim Dempsey