Solved: How to debug OpenMP parallelized code with vfn 10.1?

PKM · ‎01-05-2009

Hi.

I am in process of learning how to use OpenMP in order to optimizesome research code, but I am really stuck in the debug process. DoI need to do something special in order to use watches on parallel code? I created a very simple example below to illustrate my problem. When I run this code with a watch on I, I is equal to 1 during every iteration of the loop... ?????

!$omp parallel private(i) num_threads(1)
!$omp do
DO I=1,100
X=X+I
ENDDO
!$omp end do
!$omp end parallel

Any help will be greatly appreciated!!!!

jimdempseyatthecove · ‎01-06-2009

You may have a bug here. IVF is inserting a break point at each step of the statement "x=x+i" as opposed to at the first assembly code line of the statement. On my system it take 3 continues to advance the statement. File a report with Premier Support. Also, move the break point to the "x=x+i" as opposed to being located on the "do i=1,100" statement.

Sample of code using subroutine

[cpp]    program Console2
    implicit none
    integer i
    real*8  x
    
    x = 0
    !$omp parallel private(i) shared(x) num_threads(1)
    !$omp do schedule(static)
    do i=1,100 
      call DoSum(x,i)
    enddo
    !$omp end do
    !$omp end parallel
    
    end program Console2

    subroutine DoSum(x, i)
    implicit none
    integer i
    real*8  x
    !$omp atomic
    x=x+i
    end subroutine DoSum
[/cpp]

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎01-05-2009

Quoting - Casper Kirkegaard

Hi.

I am in process of learning how to use OpenMP in order to optimizesome research code, but I am really stuck in the debug process. DoI need to do something special in order to use watches on parallel code? I created a very simple example below to illustrate my problem. When I run this code with a watch on I, I is equal to 1 during every iteration of the loop... ?????

!$omp parallel private(i) num_threads(1)
!$omp do
DO I=1,100
X=X+I
ENDDO
!$omp end do
!$omp end parallel

Any help will be greatly appreciated!!!!

Set the watch on I after you enter the parallel region. i.e. place the break point on X=X+I, then at break set the watch.

Note, the I as specified will be valid only for the thread context when the watch was set.

PKM · ‎01-05-2009

Thanks for your reply Jim!

I just tried following your guidelines, but the issue remains the same. After placing the watch and hitting F5 to continue the I variable just remains 1 no matterhow many times I hitcontinue ... Any suggestions?

jimdempseyatthecove · ‎01-05-2009

Quoting - Casper Kirkegaard

Thanks for your reply Jim!

I just tried following your guidelines, but the issue remains the same. After placing the watch and hitting F5 to continue the I variable just remains 1 no matterhow many times I hitcontinue ... Any suggestions?

Are your optimizations settings set to disabled? When not at disabled I may be optimized away. Also, for the particular loop you were using a good optimizer could optimize the entire loop away (i.e. the compiler could compute the sum of the I, then insert code to add that precomputed sumonce to X).

To see what is happening you can use the debugger dissassembly window, but you may not be up to rolling your sleaves up that far.

Because this is a learningexperiment you can do the following:

Place the code contained within the loop into a subroutine. In this case

subroutine DOSUM(X, I)
real :: X
integer :: I
X = X + I
end subroutine DOSUM

And then use a call tothis subroutine for the body of your loop.

Jim Dempsey

TimP · ‎01-05-2009

In the title, you mention parallelized code, but you don't show any. You set 1 thread, as you must do, when you don't specify how you will handle the x variable which would be shared among threads. So, I don't know what your question is.

jimdempseyatthecove · ‎01-05-2009

Tim,

I would guess that the number of threads was set to 1 for purposes of debugging the parallel code using 1 thread. In this case such that the variable I would not hop about with context switches. The problem the poster is having is either the symbolic address of I is not resolved within the thread context .OR. I is optimized out of the code. The end result in either case is for the debugger to not be able to observe the value of I at break point during iteration.

When optimizations are on for that loop it is likely that I will be registerized and the debugger will not see it varying during execution.

Jim

PKM · ‎01-06-2009

You are right Jim ... I'm also having this problem with more complicated code and I just wanted to keep my test case simple ...

Anyway, I tried rewriting the code into its own subroutine as you suggested - unsuccesfull, however. I have also gone through all the compiler settings at least 10 times before posting here, so unless there are optimizations hidden in a setting that I don't of this should not be the cause.

I attached my example project code. I run it with the default compiler debug profile+/Qopenmp. If someone could try and open it and see whether they are having the same problem I would really appreciate it. I have been struggling with this for days now and I can't seem to find a solution :-(

Thanks!

Casper

jimdempseyatthecove · ‎01-06-2009

Quoting - Casper Kirkegaard

You are right Jim ... I'm also having this problem with more complicated code and I just wanted to keep my test case simple ...

Anyway, I tried rewriting the code into its own subroutine as you suggested - unsuccesfull, however. I have also gone through all the compiler settings at least 10 times before posting here, so unless there are optimizations hidden in a setting that I don't of this should not be the cause.

I attached my example project code. I run it with the default compiler debug profile+/Qopenmp. If someone could try and open it and see whether they are having the same problem I would really appreciate it. I have been struggling with this for days now and I can't seem to find a solution :-(

Thanks!

Casper

Casper,

Your sample code was not to be found??

Jim

PKM · ‎01-06-2009

Sorry. I had a hard time figuring out how to attach files. Itried editing my post but couldn't really get it to work.

Anyway, the file is here:

http://software.intel.com/file/8419

jimdempseyatthecove · ‎01-06-2009

You may have a bug here. IVF is inserting a break point at each step of the statement "x=x+i" as opposed to at the first assembly code line of the statement. On my system it take 3 continues to advance the statement. File a report with Premier Support. Also, move the break point to the "x=x+i" as opposed to being located on the "do i=1,100" statement.

Sample of code using subroutine

[cpp]    program Console2
    implicit none
    integer i
    real*8  x
    
    x = 0
    !$omp parallel private(i) shared(x) num_threads(1)
    !$omp do schedule(static)
    do i=1,100 
      call DoSum(x,i)
    enddo
    !$omp end do
    !$omp end parallel
    
    end program Console2

    subroutine DoSum(x, i)
    implicit none
    integer i
    real*8  x
    !$omp atomic
    x=x+i
    end subroutine DoSum
[/cpp]

Jim Dempsey

PKM · ‎01-06-2009

Thanks a lot Jim!

I alsofound itstrangethat stepping through code follows the assembler lines rather than the code lines. I come from more modern programming languages and Fortran is new to me, so I just assumed this was intentional behaviour for an old low level language ... Also, this strange assembly-stepping behaviour is not just for this particular example -it is consistent. Iuse the default debug compilation profile, but could it be there is some setting I need to change to get "normal" codeline-by-codeline stepping behaviour? Are you using the same IVF as me? Perhaps I should give the 11.1 trial a go to see if itbehaves differently ..

Regards,

Casper

jimdempseyatthecove · ‎01-06-2009

I am using IVF 11.0.066

Jim

Steven_L_Intel1 · ‎01-06-2009

Instructions for attaching files are here (announcement at top of forum). Most people miss steps 5-7.

gib · ‎01-06-2009

Quoting - jimdempseyatthecove

You may have a bug here. IVF is inserting a break point at each step of the statement "x=x+i" as opposed to at the first assembly code line of the statement. On my system it take 3 continues to advance the statement. File a report with Premier Support. Also, move the break point to the "x=x+i" as opposed to being located on the "do i=1,100" statement.

Sample of code using subroutine

[cpp]    program Console2
    implicit none
    integer i
    real*8  x
    
    x = 0
    !$omp parallel private(i) shared(x) num_threads(1)
    !$omp do schedule(static)
    do i=1,100 
      call DoSum(x,i)
    enddo
    !$omp end do
    !$omp end parallel
    
    end program Console2

    subroutine DoSum(x, i)
    implicit none
    integer i
    real*8  x
    !$omp atomic
    x=x+i
    end subroutine DoSum
[/cpp]

Jim Dempsey

This is a bit off topic, but nonetheless an interesting sidelight on OpenMP. Running the program above (Release build) with an i loop limit of 10,000,000 instead of 100 takes 0.38 sec on my machine. Increasing the number of threads to 4 gives an execution time of 4.27 sec. I guess !$omp atomic is quite expensive.

Gib

jimdempseyatthecove · ‎01-07-2009

[cpp]Yes it is, that is why you use reduction variables,
the atomic operation occures on the exit of the loop,
once per thread. In this example you would have 4 atomic
add operations as opposed to 10,000,000 atomic add operations.

!$omp parallel private(i) num_threads(1)
!$omp do reduction(+:X)
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! subroutine scoped X has full result
!$omp end parallel

or

!$omp parallel private(i), num_threads(1), reduction(+:X)
!$omp do
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result

or

!$omp parallel private(i), num_threads(1),  reduction(+:X)
!$omp do
       DO I=1,100 
        call DoSum(X,I)
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result
...
subroutine DoSum(x, i) 
    implicit none 
    integer i 
    real*8  x 
     x=x+i ! without !$OMP ATOMIC
end subroutine DoSum 


[/cpp]

Jim

gib · ‎01-07-2009

Quoting - jimdempseyatthecove

[cpp]Yes it is, that is why you use reduction variables,
the atomic operation occures on the exit of the loop,
once per thread. In this example you would have 4 atomic
add operations as opposed to 10,000,000 atomic add operations.

!$omp parallel private(i) num_threads(1)
!$omp do reduction(+:X)
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! subroutine scoped X has full result
!$omp end parallel

or

!$omp parallel private(i), num_threads(1), reduction(+:X)
!$omp do
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result

or

!$omp parallel private(i), num_threads(1),  reduction(+:X)
!$omp do
       DO I=1,100 
        call DoSum(X,I)
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result
...
subroutine DoSum(x, i) 
    implicit none 
    integer i 
    real*8  x 
     x=x+i ! without !$OMP ATOMIC
end subroutine DoSum 


[/cpp]

Jim

Continuing off topic, because this is very educational: not knowing about the reduction clause, I improvised the following:

...
nthreads = 4
nloop = 10000000
dn = nloop/nthreads
x = 0
!$omp parallel do private(i,kpar,xt) num_threads(nthreads)
do k = 1,nthreads
kpar = omp_get_thread_num()
xt = 0
do i = 1 + kpar*dn, (kpar+1)*dn
call DoSum(xt,i)
enddo
!$omp atomic
x = xt + x
enddo
!$omp end parallel do
...

subroutine DoSum(x, i)
implicit none
integer i
real*8 x
x=x+i
end subroutine DoSum

Is this equivalent to one/all of your more elegant code examples?

Gib

jimdempseyatthecove · ‎01-07-2009

No it is not the same.
k iterates over 1:n (split by the compilerinto four treads each with a portion of the range)
i iterates over 1:n (split by you into four treads each with a portion of the range)
Therefor the end result in X will be nloop * the result of X from the original code.

You may have been thinking along the line of the following:

[cpp]nthreads = 4
nloop = 10000000
dn = nloop/nthreads
x = 0 
!$omp parallel private(i,kpar,xt) num_threads(nthreads)
kpar = omp_get_thread_num()
xt = 0
do i = 1 + kpar*dn, (kpar+1)*dn
call DoSum(xt,i) 
enddo 
!$omp atomic
x = xt + x 
!$omp end parallel 
...

subroutine DoSum(x, i) 
implicit none 
integer i 
real*8 x 
x=x+i 
end subroutine DoSum 
[/cpp]

Jim

gib · ‎01-07-2009

Quoting - jimdempseyatthecove

No it is not the same.
k iterates over 1:n (split by the compilerinto four treads each with a portion of the range)
i iterates over 1:n (split by you into four treads each with a portion of the range)
Therefor the end result in X will be nloop * the result of X from the original code.

Actually k iterates over 1:nthreads, and for each thread i iterates over a fraction 1/nthreads of the whole range. My code yields exactly the same result in the same time as your third example. Thanks for alerting me to 'reduction'.

Gib

jimdempseyatthecove · ‎01-09-2009

Quoting - gib

Quoting - jimdempseyatthecove

No it is not the same.
k iterates over 1:n (split by the compilerinto four treads each with a portion of the range)
i iterates over 1:n (split by you into four treads each with a portion of the range)
Therefor the end result in X will be nloop * the result of X from the original code.

Actually k iterates over 1:nthreads, and for each thread i iterates over a fraction 1/nthreads of the whole range. My code yields exactly the same result in the same time as your third example. Thanks for alerting me to 'reduction'.

Gib

Gib,

There is the potential for an intermittent bug to appear in your program that you should be aware of. The OpenMP clause num_threads(nhtreads) is a suggestion for a desired number of threads and not a declaration of a required number of threads. When the number of threads available to the parallel region is less than what you requested then your routine will break. Also, when the number of threads available is equal to the number of threads requested, other activities on the system may interfere with one or more of these threads from starting up immediatelyand depending on your scheduling (default scheduling for OpenMP) one of the thread team members may execute an iteration you expect to be executed by a different team member (e.g. kpar varies 0,1,2,0 not 0,1,2,3)

Jim Dempsey

gib · ‎01-09-2009

Quoting - jimdempseyatthecove

Gib,

There is the potential for an intermittent bug to appear in your program that you should be aware of. The OpenMP clause num_threads(nhtreads) is a suggestion for a desired number of threads and not a declaration of a required number of threads. When the number of threads available to the parallel region is less than what you requested then your routine will break. Also, when the number of threads available is equal to the number of threads requested, other activities on the system may interfere with one or more of these threads from starting up immediatelyand depending on your scheduling (default scheduling for OpenMP) one of the thread team members may execute an iteration you expect to be executed by a different team member (e.g. kpar varies 0,1,2,0 not 0,1,2,3)

Jim Dempsey

Thanks Jim. First, I should point out that I've never used the num_threads clause in my own programs - I just learned of it from your code examples. The possibility of the parallel section being executed by fewer than nthreads team members is indeed a trap for the unwary. All my OpenMP code is currently run on my personal quad core machine, therefore the possibility of a team member being unavailable is remote, but when/if I start running my program on a shared memory server that has other users, or if/when I make my code available to others the situation will be different.

Now, I have a couple of questions, if you don't mind. In my own code I start by calling omp_set_num_threads(nthreads) in an initialisation subroutine, and I assume that this number of threads will henceforth be available. Assuming that the number of threads allocated initially is what I request (as checked with omp_get_num_threads()), does this guarantee that the same number will be available throughout the execution of the program, as I've been assuming? I found the following info in some online OpenM documentation: "when dynamic adjustment of the number of threads is disabled, omp_set_num_threads sets the exact number of threads to use in the next parallel region. " I never enable or disable dynamic adjustment - which is the default? The section I quoted refers to the "next parallel section" - what about the parallel section after that?

Thanks
Gib

jimdempseyatthecove · ‎01-10-2009

Quoting - gib

Quoting - jimdempseyatthecove

Gib,

There is the potential for an intermittent bug to appear in your program that you should be aware of. The OpenMP clause num_threads(nhtreads) is a suggestion for a desired number of threads and not a declaration of a required number of threads. When the number of threads available to the parallel region is less than what you requested then your routine will break. Also, when the number of threads available is equal to the number of threads requested, other activities on the system may interfere with one or more of these threads from starting up immediatelyand depending on your scheduling (default scheduling for OpenMP) one of the thread team members may execute an iteration you expect to be executed by a different team member (e.g. kpar varies 0,1,2,0 not 0,1,2,3)

Jim Dempsey

Thanks Jim. First, I should point out that I've never used the num_threads clause in my own programs - I just learned of it from your code examples. The possibility of the parallel section being executed by fewer than nthreads team members is indeed a trap for the unwary. All my OpenMP code is currently run on my personal quad core machine, therefore the possibility of a team member being unavailable is remote, but when/if I start running my program on a shared memory server that has other users, or if/when I make my code available to others the situation will be different.

Now, I have a couple of questions, if you don't mind. In my own code I start by calling omp_set_num_threads(nthreads) in an initialisation subroutine, and I assume that this number of threads will henceforth be available. Assuming that the number of threads allocated initially is what I request (as checked with omp_get_num_threads()), does this guarantee that the same number will be available throughout the execution of the program, as I've been assuming? I found the following info in some online OpenM documentation: "when dynamic adjustment of the number of threads is disabled, omp_set_num_threads sets the exact number of threads to use in the next parallel region. " I never enable or disable dynamic adjustment - which is the default? The section I quoted refers to the "next parallel section" - what about the parallel section after that?

Thanks
Gib

Gib,

My followingcomments are intended to guide you towards better programming...

Consider writing your code such that you can compile it with OpenMP disabled. You might want to do thisfor testing or to get a base level single thread performance metric, or maybe you suspect the version update of the compiler introduced a bug. What ever the case is you may find it necessary to compile either all or part(s) of your application without OpenMP.So your code should not be written with an assumption of a given number of threads are always available.

If you use the OMP_GET_MAX_THREADS function it will return the maximum number you can use for the next parallel region. Consider using this for a sanity check (which can be conditionaly compiled). Note, this value is not a global constant. When you enter the next parallel region and call OMP_GET_MAX_THREADS you should see a reduced number of threads available for the next (nested) parallel region. At least, in your initialization routine issue a OMP_GET_MAX_THREADS function call and assert that you have at least the number of threads that you require (for your first OpenMP level).

If you write a test program to test your functions or subroutines prior to integration into your application you should keep aware that the subroutine execution environment inside the test application might not be the same as in your application. Which means your test programruns fine, but your application may fail. An example of this is your test program has a dummy exercize loop calling your subroutine under test and providing test inputs and comparing results with expected results this test program runs fine. Then when you integrate into your application you didn't notice that the subroutine is called from within a parallel region. What happens next depends on several factors: a) OpenMP nested levels may be turned off, in this case only 1 thread (current thread) will enter the parallel region, b) OpenMP nested levels is on, but you have exceeded the OMP_MAX_AXTIVE_LEVELS the number of threads in the team might get reduced (or application might terminate), c) you might be requesting a number of threads thatexceed the remaining OMP_GET_MAX_THREADS and depending on other factors your program will either terminate or run with a reduced number of threads for the next parallel region.

You are aware of the above paragraph and convince yourself that you will never call the subroutine from within a parallel region. Six months from now you get your new 16-core system and re-address the parallization by increasing levels etc... only to find intermittant problems with your code (as you forgot about the prior paragraph). IOW write your code right the first time.

Another point, although you think your application is the only program running on your system and think 4 cores are always available, you are running on an OS that may berunning many things other than your application. You must write your code such that it is not sensitive to the temporary loss of availability of a thread. e.g. you decide to burn a CD or look at your email during a lengthy program run (or your Anti-Virus decides to do something).

I hope that these suggestions assist you in your programming.

Jim Dempsey