Community
cancel
Showing results for
Did you mean: Beginner
152 Views

## Question of Openmp's reduction

Dear all,

I need to do reduction operation to an array. I have the following code:

```    program openmptest
use omp_lib
implicit none
integer::i,j,id
integer,parameter::n=12
integer::a(2,n),b(n),c(n)
a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/))
do i=1,n
b(i)=i
c(i)=0
enddo
! Variables
!\$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b) reduction(+:c)
!\$OMP do
do i=1,n
do j=a(1,i),a(2,i)
c(j)=c(j)+b(j)
enddo
enddo
!\$OMP enddo
!\$OMP END PARALLEL
do i=1,n
write(1,*)b(i)
enddo
do i=1,n
write(1,*)a(:,i),c(i)
enddo
end program openmptest```

however, the result is exactly the same as the following code:

```    program openmptest
use omp_lib
implicit none
integer::i,j,id
integer,parameter::n=12
integer::a(2,n),b(n),c(n)
a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/))
do i=1,n
b(i)=i
c(i)=0
enddo
! Variables
!\$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c)
!\$OMP do
do i=1,n
do j=a(1,i),a(2,i)
c(j)=c(j)+b(j)
enddo
enddo
!\$OMP enddo
!\$OMP END PARALLEL
do i=1,n
write(1,*)b(i)
enddo
do i=1,n
write(1,*)a(:,i),c(i)
enddo
end program openmptest```

Is there any difference between 'shared' and 'reduction'?

Thanks,

Tang Laoya

Tags (1)
19 Replies Black Belt
152 Views

When you create an apparent race condition by removing the reduction clause, it's not guaranteed to produce a.wrong result, although it may aggravate the inefficiency. Beginner
152 Views

Dear Tim,

Thank you very much for your kindly reply. Do you mean that both cases must get correct results, but without the reduction clause, the efficiency will be decreased?

Thanks Black Belt
152 Views

No, when you have a race condition, results may change unpredictably.  If the updates do happen in the correct order, hardware cache protocol may effectively serialize the updates, or some may be missed, depending on number of threads and numa characteristics of your platform, as well as circumstances out of your control.  Some openmp compilers (not so much Intel's) may not implement an openmp directive when it appears defective, so you would need to examine build logs. Beginner
152 Views

Dear Tim,

Thank you very much for your kindly reply. Do you think that the first code it the correct implement?

I just show an example code, in my real code, there are large number of variables are shared by threads, so when I execute the program, it crashed when entering into the Openmp part displayed 'stack overflow'. Do you think what could lead to the problem?

Thanks Black Belt
152 Views

In your test code (both) have two problems:

1) your array sizes are too small to effectively parallelize
2) The subscripts generated by the do j loop is indeterminate as to if different threads will have overlapping regions.

Number 2) could be corrected by changing the j/i loop order. Using OpenMP reduction on such code would be inefficient. Normally one uses reduction on a scalar, or a small array. You wouldn't want to do this on a large array.

Jim Dempsey Beginner
152 Views

Dear Jim,

Thank you very much for your kindly comment.

For your first issue, I just posted a test code and the real code has much larger n, and for your second issue, a(2,i) and a(1,i) relies on i, how to swap i/j order?

Do you mean that the reduction on large array is also inefficient?

Thanks Black Belt
152 Views
```    program openmptest
use omp_lib
implicit none
! Variables
integer::i,j,id
integer,parameter::n=12
integer::a(2,n),b(n),c(n)
! code
a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/))
do i=1,n
b(i)=i
c(i)=0
enddo
iSliceSize = n / nThreads       ! need not be private, as all threads will calculate the same value
iResidual = mod(n,iSliceSize)   ! need not be private, as all threads will calculate the same value
iBegin = iSliceSize * iThread + 1
iBegin = iBegin + iResidule
iEnd = iBegin + iSliceSize
else
iEnd = iBegin + iSliceSize-1
endif

do i=1,n
if(a(1,i) < iBegin) cycle
do j=a(1,i),min(a(2,i),iEnd)
c(j)=c(j)+b(j)
enddo
enddo
!\$OMP END PARALLEL
do i=1,n
write(1,*)b(i)
enddo
do i=1,n
write(1,*)a(:,i),c(i)
enddo
end program openmptest
```

The above is untested code. The above does not require reduction.

You can make an improvement to the above code by aligning your arrays, then calculating the iSliceSize as the nearest multiple of number of variables that fit within a cache line. And then use the !DIR\$ directives to indicate the inner do loop variables are aligned at the start. This will eliminate the runtime code that checks for alignment.

Jim Dempsey Beginner
152 Views

Sorry, I don't understand, what do you mean?

Thanks Black Belt
152 Views

How large will n be under typical use?
What is the frequency and degree of overlap of the index tuples a(1,i),a(2,i)?

In the provided example

Jim Dempsey Beginner
152 Views

Dear Jim,

Thank you very much for your kindly reply. The n is the number of finite elements in fact, usually it is several million. I am trying to parallelize the part of stiff matrix assembling of finite element.

Thanks Black Belt
152 Views

In my example posted at github/tprince array reduction is fast for 1 thread but swapping loops so as to keep data local to threads is faster beyond 2 threads, when using Intel compilers which supports simd reduction in the inner loop.  I set it up for single thread array reduction when using gcc and msvc on account of the more limited capabilities of those compilers. Black Belt
152 Views

Note to all. I modified the inner loops:

```        do i=1,n
if(a(1,i) < iBegin) cycle
do j=a(1,i),min(a(2,i),iEnd)
c(j)=c(j)+b(j)
enddo
enddo
```

The above will aid vectorization as well as eliminate the IF test

Jim Dempsey Beginner
152 Views

Dear Jim and Tim,

Thank you very much! Beginner
152 Views

Dear all,

I have correctly implemented the reduction by MPI function MPI_Allreduction. But the efficience is still not good. Original code is like following:

```...
A=0.0d0
do
! update array A
...
A(i)=A(i)+delta
...
! if convergent, exit
enddo
```

Note that in the cycle A is not cleared to zero. So I have to do like this:

```! define temp array A0 and A00
allocate(A0(n),A00(n))
...
A=0.0d0
do
A0=0.0d0
! update array A0 parallelly
...
A0(i)=A0(i)+delta(i)
! do reduction operation
call MPI_Allreduce(A0,A00,n,MPI_DOUBLE,MPI_SUM,comm,ierr)
A=A+A00
...
! if convergent, exit
enddo
```

In the cycle, two addition operations are added: A0=0.0d0 and A=A+A00. Since n is very large, these two operations will spend considerable time, is there any better ways to save CPU time?

Thanks Black Belt
152 Views

Would you please post a code snip of serial code that works. You can excise the unimportant parts. But you must include sufficient information so we can understand your code better.

IOW This last post shows you are (may be) using MPI. Presumably with more than one Rank.

Earlier posts indicate you are using OpenMP.  It is not clear from your information as to if each (SMP) host in your cluster is running one MPI Rank, with each Rank introducing additional threading using OpenMP...

.OR.

Each (SMP) host in your cluster is running n MPI Ranks (n=number of logical processor on that host), *** with each Rank introducing additional threading using OpenMP.

The last situation would introduce gross oversubscription of threads. Rank0NumberOfLogicalProcessors**2 + Rank1NumberOfLogicalProcessors**2 + ...

Jim Dempsey Beginner
152 Views

Dear Jim,

Thank you very much for your kindly reply. The following is the test serial code that can fit the assemlbe of finite element matrix. Could you please give me some suggestion?

```    program paralleltest
use ifport
implicit none
integer,parameter::nmax=10000000
integer,parameter::nzmax=100000000
integer,parameter::itermax=100
integer::i,j,k,k1,k2,iter
integer,allocatable::rowrange(:,:)
real*8,allocatable::a(:)
allocate(rowrange(2,nmax),a(nzmax))

! use a random rowrange to fit overlapped stiff matrix of finite element
do i=1,nmax
k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)
k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)
if(k1<k2)then
rowrange(1,i)=k1
rowrange(2,i)=k2
else
rowrange(1,i)=k2
rowrange(2,i)=k1
endif
enddo

a=0.0d0
do iter=1,itermax  ! use itermax to	fit the convergence of iteration
! in every iteration, the global stiff matrix is updated
do i=1,nmax
do j=rowrange(1,i),rowrange(2,i)
a(j)=a(j)+rand()
enddo
enddo
enddo
end program paralleltest
```
```I use the MPI instead of Openmp because it displayed 'stack overlap' error message when used Openmp. Furthermore, there are multiple calculate nodes that used distributed memory. The Openmp based parallelization can only work for single calculate node with shared memory.
```

Thanks Employee
152 Views

Parallelizing finite element stiffness matrix assembly is far from trivial.  Check the literature, but recognize there are no easy answers.

- A parallel algorithm for sti€ffness matrix assembling in a shared memory environment, Marcelo Novaes De Rezende, Joao Batista de Paiva

# - Parallel assembling and equation solving via graph algorithms with an application to the FE simulation of metal extrusion processes, A Unterkircher, J. Reissner Beginner
152 Views

Dear Gregg,

Thank you very much for your kindly reply. I know it is difficult to write an efficient parallel code to assemble FEM stiff matrix. I will read the books you suggested.

Thanks Black Belt
152 Views

From your posted code, (which is a pared down version of your code) is it safe to assume:

The array rowrange, once initialized, remains unchanged for the duration of the second loop?
(it is in the sample above, but it is unknown if your formal program leaves this unchanged)

The second loop is adding rand() to a(j). Is rand() a substitution for large chunk of code?
(Note, adding a comment to your code post would alleviate this question).

The code I posted in #8 with revisions #13 will work within a rank (or in a single process)...
...provided that  the code represented by rand() on line 30 in post #17, and which may precede that line within the do j loop,
has no temporal issues (IOW references a(?) other than j).

Something like this untested code

```program openmptest
use ifport
use omp_lib
implicit none
integer,parameter::nmax=10000000
integer,parameter::nzmax=100000000
integer,parameter::itermax=100
integer::i,j,k,k1,k2,iter
integer,allocatable::rowrange(:,:)
real*8,allocatable::a(:)
logical :: lConverged
allocate(rowrange(2,nmax),a(nzmax))

! use a random rowrange to fit overlapped stiff matrix of finite element
do i=1,nmax
k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)
k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)
if(k1<k2)then
rowrange(1,i)=k1
rowrange(2,i)=k2
else
rowrange(1,i)=k2
rowrange(2,i)=k1
endif
enddo

a=0.0d0

iSliceSize = nmax / nThreads       ! need not be private, as all threads will calculate the same value
iResidual = mod(nmax,iSliceSize)   ! need not be private, as all threads will calculate the same value
iBegin = iSliceSize * iThread + 1
iBegin = iBegin + iResidule
iEnd = iBegin + iSliceSize
else
iEnd = iBegin + iSliceSize-1
endif
do iter=1,itermax  ! use itermax to fit the convergence of iteration
! in every iteration, the global stiff matrix is updated
do i=1,nmax
if(rowrange(1,i) < iBegin) cycle
do j=rowrange(1,i),min(rowrange(2,i),iEnd)
! it is unknown how you produce the data to be accumulated
! *** do not use rand() for testing
! *** as the results will not be deterministic in multi-threaded mode
! *** also rand() has a critical section and will rash parallel performance
! *** instead, use a deterministic means to generate test data
a(j) = a(j) + tiny(a(j)) * j
enddo
enddo
!\$OMP BARRIER
!\$OMP MASTER
! try to parallelize this if possible
! for now bail after 100 iterations
lConverged = (iter >= 100)
! **** MPI optional code
! update global lConverged table in Rank 0
! read updated lConverged table from Rank 0 (all other participating ranks)
lConverged = ANY(copyOfgloballConverged)
! *** end MPI optional code
!\$OMP END MASTER
!\$OMP BARRIER
if(lConverged) exit
end do
!\$OMP END PARALLEL
end program openmptest
```

The above is untested and incomplete.

For your correctness test, use your original code with the exception of replacing the rand() with the expression  in the above code. then add the converged test and bailout.

Have the old (revised) code output to a different array. Then compare the two outputs. The results should be the same.

Jim Dempsey 