topic How to debug OpenMP parallelized code with vfn 10.1? in Intel® Fortran Compiler

How to debug OpenMP parallelized code with vfn 10.1?

PKM — Mon, 05 Jan 2009 13:06:38 GMT

Hi.

I am in process of learning how to use OpenMP in order to optimizesome research code, but I am really stuck in the debug process. DoI need to do something special in order to use watches on parallel code? I created a very simple example below to illustrate my problem. When I run this code with a watch on I, I is equal to 1 during every iteration of the loop... ?????

!$omp parallel private(i) num_threads(1)
!$omp do
DO I=1,100
X=X+I
ENDDO
!$omp end do
!$omp end parallel

Any help will be greatly appreciated!!!!

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Mon, 05 Jan 2009 14:08:07 GMT

Quoting - Casper Kirkegaard

Hi.

I am in process of learning how to use OpenMP in order to optimizesome research code, but I am really stuck in the debug process. DoI need to do something special in order to use watches on parallel code? I created a very simple example below to illustrate my problem. When I run this code with a watch on I, I is equal to 1 during every iteration of the loop... ?????

!$omp parallel private(i) num_threads(1)
!$omp do
DO I=1,100
X=X+I
ENDDO
!$omp end do
!$omp end parallel

Any help will be greatly appreciated!!!!

Set the watch on I after you enter the parallel region. i.e. place the break point on X=X+I, then at break set the watch.

Note, the I as specified will be valid only for the thread context when the watch was set.

Re: How to debug OpenMP parallelized code with vfn 10.1?

PKM — Mon, 05 Jan 2009 14:31:05 GMT

Thanks for your reply Jim!

I just tried following your guidelines, but the issue remains the same. After placing the watch and hitting F5 to continue the I variable just remains 1 no matterhow many times I hitcontinue ... Any suggestions?

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Mon, 05 Jan 2009 15:42:20 GMT

Quoting - Casper Kirkegaard

Thanks for your reply Jim!

I just tried following your guidelines, but the issue remains the same. After placing the watch and hitting F5 to continue the I variable just remains 1 no matterhow many times I hitcontinue ... Any suggestions?

Are your optimizations settings set to disabled? When not at disabled I may be optimized away. Also, for the particular loop you were using a good optimizer could optimize the entire loop away (i.e. the compiler could compute the sum of the I, then insert code to add that precomputed sumonce to X).

To see what is happening you can use the debugger dissassembly window, but you may not be up to rolling your sleaves up that far.

Because this is a learningexperiment you can do the following:

Place the code contained within the loop into a subroutine. In this case

subroutine DOSUM(X, I)
real :: X
integer :: I
X = X + I
end subroutine DOSUM

And then use a call tothis subroutine for the body of your loop.

Jim Dempsey

Re: How to debug OpenMP parallelized code with vfn 10.1?

TimP — Mon, 05 Jan 2009 16:43:09 GMT

In the title, you mention parallelized code, but you don't show any. You set 1 thread, as you must do, when you don't specify how you will handle the x variable which would be shared among threads. So, I don't know what your question is.

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Mon, 05 Jan 2009 18:41:57 GMT

Tim,

I would guess that the number of threads was set to 1 for purposes of debugging the parallel code using 1 thread. In this case such that the variable I would not hop about with context switches. The problem the poster is having is either the symbolic address of I is not resolved within the thread context .OR. I is optimized out of the code. The end result in either case is for the debugger to not be able to observe the value of I at break point during iteration.

When optimizations are on for that loop it is likely that I will be registerized and the debugger will not see it varying during execution.

Jim

Re: How to debug OpenMP parallelized code with vfn 10.1?

PKM — Tue, 06 Jan 2009 14:46:04 GMT

You are right Jim ... I'm also having this problem with more complicated code and I just wanted to keep my test case simple ...

Anyway, I tried rewriting the code into its own subroutine as you suggested - unsuccesfull, however. I have also gone through all the compiler settings at least 10 times before posting here, so unless there are optimizations hidden in a setting that I don't of this should not be the cause.

I attached my example project code. I run it with the default compiler debug profile+/Qopenmp. If someone could try and open it and see whether they are having the same problem I would really appreciate it. I have been struggling with this for days now and I can't seem to find a solution :-(

Thanks!

Casper

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Tue, 06 Jan 2009 15:09:26 GMT

Quoting - Casper Kirkegaard

Casper,

Your sample code was not to be found??

Jim

Re: How to debug OpenMP parallelized code with vfn 10.1?

PKM — Tue, 06 Jan 2009 16:27:37 GMT

Sorry. I had a hard time figuring out how to attach files. Itried editing my post but couldn't really get it to work.

Anyway, the file is here:

http://software.intel.com/file/8419

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Tue, 06 Jan 2009 17:46:57 GMT

You may have a bug here. IVF is inserting a break point at each step of the statement "x=x+i" as opposed to at the first assembly code line of the statement. On my system it take 3 continues to advance the statement. File a report with Premier Support. Also, move the break point to the "x=x+i" as opposed to being located on the "do i=1,100" statement.

Sample of code using subroutine

[cpp]    program Console2
    implicit none
    integer i
    real*8  x
    
    x = 0
    !$omp parallel private(i) shared(x) num_threads(1)
    !$omp do schedule(static)
    do i=1,100 
      call DoSum(x,i)
    enddo
    !$omp end do
    !$omp end parallel
    
    end program Console2

    subroutine DoSum(x, i)
    implicit none
    integer i
    real*8  x
    !$omp atomic
    x=x+i
    end subroutine DoSum
[/cpp]

Jim Dempsey

Re: How to debug OpenMP parallelized code with vfn 10.1?

PKM — Tue, 06 Jan 2009 19:19:54 GMT

Thanks a lot Jim!

I alsofound itstrangethat stepping through code follows the assembler lines rather than the code lines. I come from more modern programming languages and Fortran is new to me, so I just assumed this was intentional behaviour for an old low level language ... Also, this strange assembly-stepping behaviour is not just for this particular example -it is consistent. Iuse the default debug compilation profile, but could it be there is some setting I need to change to get "normal" codeline-by-codeline stepping behaviour? Are you using the same IVF as me? Perhaps I should give the 11.1 trial a go to see if itbehaves differently ..

Regards,

Casper

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Tue, 06 Jan 2009 19:31:11 GMT

I am using IVF 11.0.066

Jim

Re: How to debug OpenMP parallelized code with vfn 10.1?

Steven_L_Intel1 — Tue, 06 Jan 2009 20:28:31 GMT

Instructions for attaching files are here (announcement at top of forum). Most people miss steps 5-7.

Re: How to debug OpenMP parallelized code with vfn 10.1?

gib — Wed, 07 Jan 2009 04:08:11 GMT

Quoting - jimdempseyatthecove

[cpp]    program Console2
    implicit none
    integer i
    real*8  x
    
    x = 0
    !$omp parallel private(i) shared(x) num_threads(1)
    !$omp do schedule(static)
    do i=1,100 
      call DoSum(x,i)
    enddo
    !$omp end do
    !$omp end parallel
    
    end program Console2

    subroutine DoSum(x, i)
    implicit none
    integer i
    real*8  x
    !$omp atomic
    x=x+i
    end subroutine DoSum
[/cpp]

Jim Dempsey

This is a bit off topic, but nonetheless an interesting sidelight on OpenMP. Running the program above (Release build) with an i loop limit of 10,000,000 instead of 100 takes 0.38 sec on my machine. Increasing the number of threads to 4 gives an execution time of 4.27 sec. I guess !$omp atomic is quite expensive.

Gib

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Wed, 07 Jan 2009 18:36:52 GMT

[cpp]Yes it is, that is why you use reduction variables,
the atomic operation occures on the exit of the loop,
once per thread. In this example you would have 4 atomic
add operations as opposed to 10,000,000 atomic add operations.

!$omp parallel private(i) num_threads(1)
!$omp do reduction(+:X)
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! subroutine scoped X has full result
!$omp end parallel

or

!$omp parallel private(i), num_threads(1), reduction(+:X)
!$omp do
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result

or

!$omp parallel private(i), num_threads(1),  reduction(+:X)
!$omp do
       DO I=1,100 
        call DoSum(X,I)
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result
...
subroutine DoSum(x, i) 
    implicit none 
    integer i 
    real*8  x 
     x=x+i ! without !$OMP ATOMIC
end subroutine DoSum 


[/cpp]

Jim

Re: How to debug OpenMP parallelized code with vfn 10.1?

gib — Wed, 07 Jan 2009 21:22:31 GMT

Quoting - jimdempseyatthecove

[cpp]Yes it is, that is why you use reduction variables,
the atomic operation occures on the exit of the loop,
once per thread. In this example you would have 4 atomic
add operations as opposed to 10,000,000 atomic add operations.

!$omp parallel private(i) num_threads(1)
!$omp do reduction(+:X)
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! subroutine scoped X has full result
!$omp end parallel

or

!$omp parallel private(i), num_threads(1), reduction(+:X)
!$omp do
       DO I=1,100 
        X=X+I
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result

or

!$omp parallel private(i), num_threads(1),  reduction(+:X)
!$omp do
       DO I=1,100 
        call DoSum(X,I)
       ENDDO
!$omp end do
! each thread parallel region scoped X has partial result
!$omp end parallel
! subroutine scoped X has full result
...
subroutine DoSum(x, i) 
    implicit none 
    integer i 
    real*8  x 
     x=x+i ! without !$OMP ATOMIC
end subroutine DoSum 


[/cpp]

Jim

Continuing off topic, because this is very educational: not knowing about the reduction clause, I improvised the following:

...
nthreads = 4
nloop = 10000000
dn = nloop/nthreads
x = 0
!$omp parallel do private(i,kpar,xt) num_threads(nthreads)
do k = 1,nthreads
kpar = omp_get_thread_num()
xt = 0
do i = 1 + kpar*dn, (kpar+1)*dn
call DoSum(xt,i)
enddo
!$omp atomic
x = xt + x
enddo
!$omp end parallel do
...

subroutine DoSum(x, i)
implicit none
integer i
real*8 x
x=x+i
end subroutine DoSum

Is this equivalent to one/all of your more elegant code examples?

Gib

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Wed, 07 Jan 2009 22:40:52 GMT

No it is not the same.
k iterates over 1:n (split by the compilerinto four treads each with a portion of the range)
i iterates over 1:n (split by you into four treads each with a portion of the range)
Therefor the end result in X will be nloop * the result of X from the original code.

You may have been thinking along the line of the following:

[cpp]nthreads = 4
nloop = 10000000
dn = nloop/nthreads
x = 0 
!$omp parallel private(i,kpar,xt) num_threads(nthreads)
kpar = omp_get_thread_num()
xt = 0
do i = 1 + kpar*dn, (kpar+1)*dn
call DoSum(xt,i) 
enddo 
!$omp atomic
x = xt + x 
!$omp end parallel 
...

subroutine DoSum(x, i) 
implicit none 
integer i 
real*8 x 
x=x+i 
end subroutine DoSum 
[/cpp]

Jim

Re: How to debug OpenMP parallelized code with vfn 10.1?

gib — Wed, 07 Jan 2009 22:55:11 GMT

Quoting - jimdempseyatthecove

Actually k iterates over 1:nthreads, and for each thread i iterates over a fraction 1/nthreads of the whole range. My code yields exactly the same result in the same time as your third example. Thanks for alerting me to 'reduction'.

Gib

Re: How to debug OpenMP parallelized code with vfn 10.1?

jimdempseyatthecove — Fri, 09 Jan 2009 13:45:22 GMT

Quoting - gib

Quoting - jimdempseyatthecove

No it is not the same.
k iterates over 1:n (split by the compilerinto four treads each with a portion of the range)
i iterates over 1:n (split by you into four treads each with a portion of the range)
Therefor the end result in X will be nloop * the result of X from the original code.

Actually k iterates over 1:nthreads, and for each thread i iterates over a fraction 1/nthreads of the whole range. My code yields exactly the same result in the same time as your third example. Thanks for alerting me to 'reduction'.

Gib

Gib,

There is the potential for an intermittent bug to appear in your program that you should be aware of. The OpenMP clause num_threads(nhtreads) is a suggestion for a desired number of threads and not a declaration of a required number of threads. When the number of threads available to the parallel region is less than what you requested then your routine will break. Also, when the number of threads available is equal to the number of threads requested, other activities on the system may interfere with one or more of these threads from starting up immediatelyand depending on your scheduling (default scheduling for OpenMP) one of the thread team members may execute an iteration you expect to be executed by a different team member (e.g. kpar varies 0,1,2,0 not 0,1,2,3)

Jim Dempsey

Re: How to debug OpenMP parallelized code with vfn 10.1?

gib — Fri, 09 Jan 2009 20:19:52 GMT

Quoting - jimdempseyatthecove

Thanks Jim. First, I should point out that I've never used the num_threads clause in my own programs - I just learned of it from your code examples. The possibility of the parallel section being executed by fewer than nthreads team members is indeed a trap for the unwary. All my OpenMP code is currently run on my personal quad core machine, therefore the possibility of a team member being unavailable is remote, but when/if I start running my program on a shared memory server that has other users, or if/when I make my code available to others the situation will be different.

Now, I have a couple of questions, if you don't mind. In my own code I start by calling omp_set_num_threads(nthreads) in an initialisation subroutine, and I assume that this number of threads will henceforth be available. Assuming that the number of threads allocated initially is what I request (as checked with omp_get_num_threads()), does this guarantee that the same number will be available throughout the execution of the program, as I've been assuming? I found the following info in some online OpenM documentation: "when dynamic adjustment of the number of threads is disabled, omp_set_num_threads sets the exact number of threads to use in the next parallel region. " I never enable or disable dynamic adjustment - which is the default? The section I quoted refers to the "next parallel section" - what about the parallel section after that?

Thanks
Gib