Could anyone tell my how to code parallel program?

Zhanghong_T_ · ‎12-09-2006

Hi all,

I am trying to code parallel program based on OpenMP, but for the following simple code (test.f90), it can't got the expected result, could anyone tell me how to do?

 module commondata
 real*8,allocatable::s(:)
 integer,allocatable::irow(:),jcol(:)
 integer::nz
 end module

 use commondata
 implicit none
 real*8,allocatable::s0(:)
 integer,allocatable::irow0(:),jcol0(:)
 integer::nz0,i
 nz=200000
 if(allocated(irow0))deallocate(irow0)
 if(allocated(jcol0))deallocate(jcol0)
 allocate(s0(nz),irow0(nz),jcol0(nz))
 write(*,*)nz
 !$omp parallel do default(shared),private(i)
 do i=1,nz
 if(mod(i,3)==0)then
 irow(i)=i
 else
 irow(i)=-i
 endif
 jcol(i)=i+1
 s(i)=1.d0
 enddo
 !$omp end parallel do
 write(*,*)i
 nz0=0
 !$omp parallel do default(shared),private(i),reduction(+:nz0)
 do i=1,nz
 if(irow(i)>0)then
 nz0=nz0+1
 s0(nz0)=s(i)
 irow0(nz0)=irow(i)
 jcol0(nz0)=jcol(i)
 endif
 enddo
 !$omp end parallel do
 open(1,file='t.txt')
 write(1,*)nz0
 write(*,*)nz0
 do i=1,nz0
 write(1,*)irow0(i),jcol0(i),s0(i)
 enddo
 close(1)
 end

I compiled the file inWindows XP Professional x64 Edition OS with IVF9.1, the command line is as follows:

ifort /O3 /threads /Qopenmp /libs:static test.f90

Zhanghong_T_ · ‎12-09-2006

Sorry, the code should be like this:

 module commondata
 real*8,allocatable::s(:)
 integer,allocatable::irow(:),jcol(:)
 integer::nz
 end module

 use commondata
 implicit none
 real*8,allocatable::s0(:)
 integer,allocatable::irow0(:),jcol0(:)
 integer::nz0,i
 nz=200000
 if(allocated(irow))deallocate(irow)
 if(allocated(jcol))deallocate(jcol)
 if(allocated(s))deallocate(s)
 if(allocated(irow0))deallocate(irow0)
 if(allocated(jcol0))deallocate(jcol0)
 allocate(s(nz),irow(nz),jcol(nz))
 allocate(s0(nz),irow0(nz),jcol0(nz))
 write(*,*)nz
 !$omp parallel do default(shared),private(i)
 do i=1,nz
 if(mod(i,3)==0)then
 irow(i)=i
 else
 irow(i)=-i
 endif
 jcol(i)=i+1
 s(i)=1.d0
 enddo
 !$omp end parallel do
 write(*,*)i
 nz0=0
 !$omp parallel do default(shared),private(i),reduction(+:nz0)
 do i=1,nz
 if(irow(i)>0)then
 nz0=nz0+1
 s0(nz0)=s(i)
 irow0(nz0)=irow(i)
 jcol0(nz0)=jcol(i)
 endif
 enddo
 !$omp end parallel do
 open(1,file='t.txt')
 write(1,*)nz0
 write(*,*)nz0
 do i=1,nz0
 write(1,*)irow0(i),jcol0(i),s0(i)
 enddo
 close(1)
 end

This time the output file is wrong, which is different from the version compiled without /Qopenmp option.

Thanks,

Zhanghong Tang

TimP · ‎12-09-2006

Where you have

write(*,*)i

I would have doubts about seeing a meaningful value, even if you had 
set lastprivate(i).

You usage of nz0 doesn't look feasible for parallelization.  Did you 
try removing the openmp and trying the -Qparallel options to see what
the compiler has to say?  
No doubt, threadchecker would have something to say as well.

Zhanghong_T_ · ‎12-09-2006

Hi Tim,

Thank you very much for your reply. I place such a write statement just to check whether the program has executed to there or not.

Just now I tried to replace 'qopenmp' with 'qparallel' option, I am very exicited that the result is correct. However, I found the CPU usage is still 25% (4 CPUs in that machine). If the 'qopenmp' option is used, the CPU usage could be 100%.

Could you please help me again?

Thanks,

Zhanghong Tang

TimP · ‎12-10-2006

Options such as -Qpar-report2 should show you the compiler's reasons for not vectorizing. If it is only a matter of "seems inefficient," changing the parallelization theshold, e.g.
-Qpar_threshold=60 will encourage the compiler to parallelize even though it thinks this will not be useful.
But, I think the par-report2 (or 3) will point out that you have dependencies which could not be resolved, if the compiler should split that loop into parallel sections. You might be able to parallelize the loop if you did not make the storage location of each section dependent on completion of the previous section.

Zhanghong_T_ · ‎12-12-2006

Hi Tim,

Thank you very much for your reply! However, I am still don't understand your options such as 'Qpar-report2', I didn't found such an option in ifort's options.

Suppose I have used the /Qopenmp option to build the program, how should I modify the code to let the result be correct?

Thanks,
Zhanghong Tang

jimdempseyatthecove · ‎12-12-2006

Tang,

When you use "reduction(+:nz0)" you are telling the compiler to create multiple local copies of nz0, one for each thread. Then each thread will sequence the local copy 0, 1, 2, 3, ... Therefore the 2nd thread overwrites the 1st threads data, the 3rd thread overwrites the 1st and 2nd thread data, etc...

consider the following changes

Remove from second parallel do"reduction(+:nz0)"

Add to local variables "integer:: My_nz0"

Second parallel do

Change"private(i)" to private(i,My_nz0)
Change "nz0=nz0+1" to "My_nz0 = InterlockedIncrement(loc(nz0))"
Change subsequent uses of nz0 to My_nz0

With these changes all threads increment nz0 and return the incremented value to a thread local variable in a thread-safe manner. Then the local index can be safely used.

Jim Dempsey

Zhanghong_T_ · ‎12-12-2006

Hi Jim,

Thank you so much for your explaination!

But one thing I still don't understand:Is 'InterlockedIncrement' an intrinsic function? Or you mean the following:

!$omp parallel do default(shared),private(i,My_nz0)
do i=1,nz
if(irow(i)>0)then
nz0=nz0+1
My_nz0=nz0
s0(My_nz0)=s(i)
irow0(My_nz0)=irow(i)
jcol0(My_nz0)=jcol(i)
endif
enddo
!$omp end parallel do

However, I tested and found at the end of this struct the value of nz0 is not correct and some of elements are 'lost'.

Thanks,

Zhanghong Tang

jimdempseyatthecove · ‎12-12-2006

InterlockedIncrement is a kernel32 library function. See

http://msdn2.microsoft.com/en-us/library/ms683614.aspx

Versions of C++ implement this as an intrinsic, whereasIVF currently calls a library function to perform the instruction sequence. Intel could make these intrinsic as well.

 use kernel32  ! Add this to top
 ...

 !$omp parallel do default(shared),private(i,My_nz0) 
 do i=1,nz
 if(irow(i)>0)then
 ! nz0=nz0+1 ! Remove
 My_nz0=InterlockedIncrement(loc(nz0)) ! Change 
 s0(My_nz0)=s(i)
 irow0(My_nz0)=irow(i)
 jcol0(My_nz0)=jcol(i)
 endif
 enddo
!$omp end parallel do

The above is functionaly equivelent to

!$omp parallel do default(shared),private(i,My_nz0) 
 do i=1,nz
 if(irow(i)>0)then
!$omp critical
 nz0=nz0+1
 My_nz0=nz0
!$omp end critical
 s0(My_nz0)=s(i)
 irow0(My_nz0)=irow(i)
 jcol0(My_nz0)=jcol(i)
 endif
 enddo
!$omp end parallel do

The InterlockedIncrement is more efficient

If this routine is used often then you might want to consider
dividing the work by the number of threads. e.g.

! in your module
type find_nz0
 integer :: nz0 = 0
 integer :: iFrom, iTo
 integer, pointer :: s0(:)
 integer, pointer :: irow0(:)
 integer, pointer :: jcol0(:)
end type find_nz0
type(find_nz0), pointer :: find_nz0Array(:)
integer :: nMaxThreads

...
 nMaxThreads = OMP_GET_MAX_THREADS()
 allocate(find_nz0Array(nMaxThreads ))
 do i=1,nMaxThreads
 allocate(find_nz0Array(i).s0(nz))
 allocate(find_nz0Array(i).irow0(nz))
 allocate(find_nz0Array(i).jcol0(nz))
 end do
...
!$omp parallel
 call find_nz0(find_nz0Array(OMP_GET_THREAD_NUM()))
!$omp end parallel
...
 ! then either process seperately
 ! or conjoin data.

subroutine find_nz0(my_nz0)
 use yourModule
 type(find_nz0) :: my_nz0
 integer :: nNumThreads, nThreadNum, iStride, i
 nNumThreads = OMP_GET_NUM_THREADS() ! may be less than nMaxThreads
 nThreadNum = OMP_GET_THREAD_NUM()
 iStride = nz / nNumThreads
 my_nz0.iFrom = (iStride * nThreadNum) + 1
 if(nThreadNum .eq. (nNumThreads-1)) then
 my_nz0.iTo = nz
 else
 my_nz0.iTo = my_nz0.iFrom + iStride - 1
 endif
 my_nz0.nz0 = 0
 do i=my_nz0.iFrom,my_nz0.iTo
 if(irow(i)>0)then
&n
bsp; my_nz0.nz0=nz0+1
 my_nz0.s0(nz0)=s(i)
 my_nz0.irow0(nz0)=irow(i)
 my_nz0.jcol0(nz0)=jcol(i)
 endif
 enddo
end subroutine find_nz0

Jim Dempsey

Zhanghong_T_ · ‎12-13-2006

Hi Jim,

Thank you very much for your help! I will study your code carefully and test the last two examples of your code.

Thanks,

Zhanghong Tang

jimdempseyatthecove · ‎12-13-2006

Tang,

In the last example there is a bug.

call find_nz0(find_nz0Array(OMP_GET_THREAD_NUM()))

should read

call find_nz0(find_nz0Array(OMP_GET_THREAD_NUM() +1 ))

Good luck with your coding.

Jim Dempsey

Zhanghong_T_ · ‎12-13-2006

Hi Jim,

Thank you very much! You are always so kind.

There is another problem: for your second example, it is ok when compiled in IA32 mode, however, if I compile the same code in EM64 mode, the following error appears:

Error1 Error: This name does not have a type, and must have an explicit type. [INTERLOCKEDINCREMENT]

Is there any special settings should be done in EM64 mode?

Thanks,

Zhanghong Tang

jimdempseyatthecove · ‎12-13-2006

Tang,

This sounds like a missing header file (module). On IA32 "use kernel32". I think that is still valid on IA64. I don't have IA64 installed yet (waiting on CDs from MSDN Pro).

Do a search in the IVF IA64 includefolder for .F90 files containing "Interlocked" if that folder also contains a .MOD file of the same name then that is the name to use on the "use xxx" statement.

Note, "use kernel32" is for WinNT kernel mode calls. WinXP, WinXP X64, WinVista, are all successors to WinNT so I think the kernel32 is still valid.

Now, if the IVF IA32 include folder has a kernel32 but it does not contain the Interlocked... declarations, then copy the text of the interface from the IVF IA32 kernel32.f90 and paste into your test application (don't modify the distribution copy of the IA64 kernel32). If that works (no link problems and program runs). Then submit an incident report on the Premier site.

Also note, InterlockedIncrement is for 32-bit values. If you are using 64-bit values then use InterlockedIncrement64.

There are a bunch of other Interlocked functions that you may find very useful. Sounds like you are almost up and running.

Jim Dempsey

Zhanghong_T_ · ‎12-14-2006

Hi Jim,

Thank you again! I test with your method and it still can't be built. I have reported this problem to the Intel Premier Support.

Thanks,

Zhanghong Tang

jimdempseyatthecove · ‎12-14-2006

I am sure it is a case of a missing interface block in a systemmodule. The Interlocked Increment is available for the processor in IA64 mode. A little bit of investigation on your part will be able to identify and fix the probelm. Premier Support will first ask you to send in an example. By the time you get an answer you will have found and fixed the problem. Premier Support will eventualy get to the problem and issue a resolution usualy "scheduled for next release". What you want is a work around that fixes the problem now.

To help you with a work around. Write a test program that has OpenMP enabled

Program Foo
integer(4) :: i4
integer(8) :: i8

i4 = 0
i8 = 0
!$OMP ATOMIC
I4=I4+1
!$OMP ATOMIC
I8=I8+1

The while using the debugger set the break point at the i8=0. At break open a dissassembly window. You will probably see a call to a kmp_something, step into that, if necessary step in deeper. Eventualy you will reach a function that performs the InterlockeIncrement. It may look like

lock inc [eax]

The function containing that is the kernel interlocked increment function. You can call that with the address of the variable to increment.
The interlocked I8 will use something like

lock inc [rax]

Jim

Zhanghong_T_ · ‎12-14-2006

Hi Jim,

Thank you so much! I have never debugged like this! Because no .net installed in that 64-bit machine, I have to debug the code you gave above by Intel Debugger. The dissassembly functions between I8=I8+1 to the end of the program is as follows:

stopped at [subroutine FOO():10 0x000000000040309b]FOO+0x8b: call __kmpc_atomic_fixed8_add
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bca8]__kmpc_atomic_fixed8_add: subq $0x28, %rsp
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcac]__kmpc_atomic_fixed8_add+0x4: movq %r8, %rcx
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcaf]__kmpc_atomic_fixed8_add+0x7: movq %r9, %rdx
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcb2]__kmpc_atomic_fixed8_add+0xa: call __kmp_test_then_add64
stopped at [ function __kmp_test_then_add64(...) 0x000000000040e1c0]__kmp_test_then_add64: movqr %rdx, %rax
stopped at [ function __kmp_test_then_add64(...) 0x000000000040e1c3]__kmp_test_then_add64+0x3: lock xaddq %rax, (%rcx)
stopped at [ function __kmp_test_then_add64(...) 0x000000000040e1c8]__kmp_test_then_add64+0x8: ret 
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcb7]__kmpc_atomic_fixed8_add+0xf: addq $0x28, %rsp
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcbb]__kmpc_atomic_fixed8_add+0x13: ret 
stopped at [subroutine FOO():12 0x00000000004030a0]FOO+0x90: leaq 0xaf055(%rip), %rcx
stopped at [subroutine FOO():12 0x00000000004030a7]FOO+0x97: call __kmpc_end

So it seems the interlocked I8 use this function:

lock xaddq %rax, (%rcx)

Then what should I do?

Thanks,

Zhanghong Tang

Zhanghong_T_ · ‎12-14-2006

Sorry I have not pasted all code:

This is the debugger output panel.

All output from debugger commands (eg: 'print')
and all debugger messages (eg: 'The "attach" command has failed')
are shown in this panel.

The output from your program, and from any shell commands
that you execute from within the debugger, will continue to
be displayed on your terminal.
-----------------------------------------------------------------------------
Reading symbolic information from C:Users	angSpeedXPPowerSI
ewtest	est.exe...done
The "stepi" command has failed because there is no running program.
stopped at [subroutine FOO():8 0x000000000040307b]FOO+0x6b: movl $0x1, %r9d
stopped at [subroutine FOO():8 0x0000000000403081]FOO+0x71: call __kmpc_atomic_fixed4_add
stopped at [ function __kmpc_atomic_fixed4_add(...) 0x000000000040bc94]__kmpc_atomic_fixed4_add: subq $0x28, %rsp
stopped at [ function __kmpc_atomic_fixed4_add(...) 0x000000000040bc98]__kmpc_atomic_fixed4_add+0x4: movq %r8, %rcx
stopped at [ function __kmpc_atomic_fixed4_add(...) 0x000000000040bc9b]__kmpc_atomic_fixed4_add+0x7: movl %r9d, %edx
stopped at [ function __kmpc_atomic_fixed4_add(...) 0x000000000040bc9e]__kmpc_atomic_fixed4_add+0xa: call __kmp_test_then_add32
stopped at [ function __kmp_test_then_add32(...) 0x000000000040e1b0]__kmp_test_then_add32: movlr %edx, %eax
stopped at [ function __kmp_test_then_add32(...) 0x000000000040e1b2]__kmp_test_then_add32+0x2: lock xaddl %eax, (%rcx)
stopped at [ function __kmp_test_then_add32(...) 0x000000000040e1b6]__kmp_test_then_add32+0x6: ret 
stopped at [ function __kmpc_atomic_fixed4_add(...) 0x000000000040bca3]__kmpc_atomic_fixed4_add+0xf: addq $0x28, %rsp
stopped at [ function __kmpc_atomic_fixed4_add(...) 0x000000000040bca7]__kmpc_atomic_fixed4_add+0x13: ret 
stopped at [subroutine FOO():10 0x0000000000403086]FOO+0x76:&n
bsp; leaq 0x28(%rsp), %r8
stopped at [subroutine FOO():10 0x000000000040308b]FOO+0x7b: leaq 0xaf016(%rip), %rcx
stopped at [subroutine FOO():10 0x0000000000403092]FOO+0x82: movl %r12d, %edx
stopped at [subroutine FOO():10 0x0000000000403095]FOO+0x85: movl $0x1, %r9d
stopped at [subroutine FOO():10 0x000000000040309b]FOO+0x8b: call __kmpc_atomic_fixed8_add
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bca8]__kmpc_atomic_fixed8_add: subq $0x28, %rsp
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcac]__kmpc_atomic_fixed8_add+0x4: movq %r8, %rcx
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcaf]__kmpc_atomic_fixed8_add+0x7: movq %r9, %rdx
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcb2]__kmpc_atomic_fixed8_add+0xa: call __kmp_test_then_add64
stopped at [ function __kmp_test_then_add64(...) 0x000000000040e1c0]__kmp_test_then_add64: movqr %rdx, %rax
stopped at [ function __kmp_test_then_add64(...) 0x000000000040e1c3]__kmp_test_then_add64+0x3: lock xaddq %rax, (%rcx)
stopped at [ function __kmp_test_then_add64(...) 0x000000000040e1c8]__kmp_test_then_add64+0x8: ret 
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcb7]__kmpc_atomic_fixed8_add+0xf: addq $0x28, %rsp
stopped at [ function __kmpc_atomic_fixed8_add(...) 0x000000000040bcbb]__kmpc_atomic_fixed8_add+0x13: ret 
stopped at [subroutine FOO():12 0x00000000004030a0]FOO+0x90:&nbs
p; leaq 0xaf055(%rip), %rcx
stopped at [subroutine FOO():12 0x00000000004030a7]FOO+0x97: call __kmpc_end

Steven_L_Intel1 · ‎12-14-2006

The problem here is that unlike on IA-32, Windows does not provide InterlockedIncrement and friends as a callable routine, instead relying on C++ to generate inline code. I'm not sure how we'll solve this one...

jimdempseyatthecove · ‎12-14-2006

I am still waiting for my CD's to come in so I can install Windows XP x64 and install Vista. Then I can install VS 2005 and then IVF on both platforms. Once I do that, I will be in a better position to help with this problem.

All Tang needs is access to the Windows non-inlined kernelfunction. Calling the kmp_xxx function from IVF is probably a bad suggestion as he would be traversing a lot of subject to change territory. Tang can fall back on a !$OMP CRITICAL section and get by until you publish your findings.

Jim Dempsey

Steven_L_Intel1 · ‎12-14-2006

There is no Windows non-inlined function in this case. I will create one for now.