Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Topics
- Intel® Moderncode for Parallel Architectures
- Question of Openmp's reduction

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-11-2016
05:57 PM

152 Views

Question of Openmp's reduction

Dear all,

I need to do reduction operation to an array. I have the following code:

program openmptest use omp_lib implicit none integer::i,j,id integer,parameter::n=12 integer::a(2,n),b(n),c(n) a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/)) do i=1,n b(i)=i c(i)=0 enddo ! Variables !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b) reduction(+:c) !$OMP do do i=1,n do j=a(1,i),a(2,i) c(j)=c(j)+b(j) enddo enddo !$OMP enddo !$OMP END PARALLEL do i=1,n write(1,*)b(i) enddo do i=1,n write(1,*)a(:,i),c(i) enddo end program openmptest

however, the result is exactly the same as the following code:

program openmptest use omp_lib implicit none integer::i,j,id integer,parameter::n=12 integer::a(2,n),b(n),c(n) a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/)) do i=1,n b(i)=i c(i)=0 enddo ! Variables !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) !$OMP do do i=1,n do j=a(1,i),a(2,i) c(j)=c(j)+b(j) enddo enddo !$OMP enddo !$OMP END PARALLEL do i=1,n write(1,*)b(i) enddo do i=1,n write(1,*)a(:,i),c(i) enddo end program openmptest

Is there any difference between 'shared' and 'reduction'?

Thanks,

Tang Laoya

Link Copied

19 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-12-2016
05:29 AM

152 Views

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-12-2016
08:25 AM

152 Views

Dear Tim,

Thank you very much for your kindly reply. Do you mean that both cases must get correct results, but without the reduction clause, the efficiency will be decreased?

Thanks

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-12-2016
12:13 PM

152 Views

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-12-2016
05:47 PM

152 Views

Dear Tim,

Thank you very much for your kindly reply. Do you think that the first code it the correct implement?

I just show an example code, in my real code, there are large number of variables are shared by threads, so when I execute the program, it crashed when entering into the Openmp part displayed 'stack overflow'. Do you think what could lead to the problem?

Thanks

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-13-2016
06:00 AM

152 Views

In your test code (both) have two problems:

1) your array sizes are too small to effectively parallelize

2) The subscripts generated by the do j loop is indeterminate as to if different threads will have overlapping regions.

Number 2) could be corrected by changing the j/i loop order. Using OpenMP reduction on such code would be inefficient. Normally one uses reduction on a scalar, or a small array. You wouldn't want to do this on a large array.

Jim Dempsey

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-13-2016
06:16 AM

152 Views

Dear Jim,

Thank you very much for your kindly comment.

For your first issue, I just posted a test code and the real code has much larger n, and for your second issue, a(2,i) and a(1,i) relies on i, how to swap i/j order?

Do you mean that the reduction on large array is also inefficient?

Thanks

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-13-2016
07:04 AM

152 Views

program openmptest use omp_lib implicit none ! Variables integer::i,j,id integer,parameter::n=12 integer::a(2,n),b(n),c(n) integer::iThread,nThreads,iBegin,iEnd, iSliceSize, iResidual ! code a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/)) do i=1,n b(i)=i c(i)=0 enddo !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) PRIVATE(iThread,iBegin,iEnd) iThread = omp_get_thread_num() nThreads = omp_get_num_threads() iSliceSize = n / nThreads ! need not be private, as all threads will calculate the same value iResidual = mod(n,iSliceSize) ! need not be private, as all threads will calculate the same value iBegin = iSliceSize * iThread + 1 if(iThread > iResidule) then iBegin = iBegin + iResidule iEnd = iBegin + iSliceSize else iBegin=iBegin+iThread iEnd = iBegin + iSliceSize-1 endif do i=1,n if(a(1,i) < iBegin) cycle do j=a(1,i),min(a(2,i),iEnd) c(j)=c(j)+b(j) enddo enddo !$OMP END PARALLEL do i=1,n write(1,*)b(i) enddo do i=1,n write(1,*)a(:,i),c(i) enddo end program openmptest

The above is untested code. The above does not require reduction.

You can make an improvement to the above code by aligning your arrays, then calculating the iSliceSize as the nearest multiple of number of variables that fit within a cache line. And then use the !DIR$ directives to indicate the inner do loop variables are aligned at the start. This will eliminate the runtime code that checks for alignment.

Jim Dempsey

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-13-2016
07:24 AM

152 Views

Sorry, I don't understand, what do you mean?

Thanks

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-13-2016
09:56 AM

152 Views

How large will n be under typical use?

What is the frequency and degree of overlap of the index tuples a(1,i),a(2,i)?

In the provided example

Jim Dempsey

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-13-2016
09:06 PM

152 Views

Dear Jim,

Thank you very much for your kindly reply. The n is the number of finite elements in fact, usually it is several million. I am trying to parallelize the part of stiff matrix assembling of finite element.

Thanks

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-14-2016
04:56 AM

152 Views

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-14-2016
05:29 AM

152 Views

Note to all. I modified the inner loops:

do i=1,n if(a(1,i) < iBegin) cycle do j=a(1,i),min(a(2,i),iEnd) c(j)=c(j)+b(j) enddo enddo

The above will aid vectorization as well as eliminate the IF test

Jim Dempsey

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-15-2016
07:02 PM

152 Views

Dear Jim and Tim,

Thank you very much!

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-16-2016
12:26 AM

152 Views

Dear all,

I have correctly implemented the reduction by MPI function MPI_Allreduction. But the efficience is still not good. Original code is like following:

... A=0.0d0 do ! update array A ... A(i)=A(i)+delta ... ! if convergent, exit enddo

Note that in the cycle A is not cleared to zero. So I have to do like this:

! define temp array A0 and A00 allocate(A0(n),A00(n)) ... A=0.0d0 do A0=0.0d0 ! update array A0 parallelly ... A0(i)=A0(i)+delta(i) ! do reduction operation call MPI_Allreduce(A0,A00,n,MPI_DOUBLE,MPI_SUM,comm,ierr) A=A+A00 ... ! if convergent, exit enddo

In the cycle, two addition operations are added: A0=0.0d0 and A=A+A00. Since n is very large, these two operations will spend considerable time, is there any better ways to save CPU time?

Thanks

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-16-2016
04:57 AM

152 Views

Would you please post a code snip of serial code that works. You can excise the unimportant parts. But you must include sufficient information so we can understand your code better.

IOW This last post shows you are (may be) using MPI. Presumably with more than one Rank.

Earlier posts indicate you are using OpenMP. It is not clear from your information as to if each (SMP) host in your cluster is running one MPI Rank, with each Rank introducing additional threading using OpenMP...

.OR.

Each (SMP) host in your cluster is running n MPI Ranks (n=number of logical processor on that host), *** with each Rank introducing additional threading using OpenMP.

The last situation would introduce gross oversubscription of threads. Rank0NumberOfLogicalProcessors**2 + Rank1NumberOfLogicalProcessors**2 + ...

Jim Dempsey

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-17-2016
07:17 PM

152 Views

Dear Jim,

Thank you very much for your kindly reply. The following is the test serial code that can fit the assemlbe of finite element matrix. Could you please give me some suggestion?

program paralleltest use ifport implicit none integer,parameter::nmax=10000000 integer,parameter::nzmax=100000000 integer,parameter::itermax=100 integer::i,j,k,k1,k2,iter integer,allocatable::rowrange(:,:) real*8,allocatable::a(:) allocate(rowrange(2,nmax),a(nzmax)) ! use a random rowrange to fit overlapped stiff matrix of finite element do i=1,nmax k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) if(k1<k2)then rowrange(1,i)=k1 rowrange(2,i)=k2 else rowrange(1,i)=k2 rowrange(2,i)=k1 endif enddo a=0.0d0 do iter=1,itermax ! use itermax to fit the convergence of iteration ! in every iteration, the global stiff matrix is updated do i=1,nmax do j=rowrange(1,i),rowrange(2,i) a(j)=a(j)+rand() enddo enddo enddo end program paralleltest

```
I use the MPI instead of Openmp because it displayed 'stack overlap' error message when used Openmp. Furthermore, there are multiple calculate nodes that used distributed memory. The Openmp based parallelization can only work for single calculate node with shared memory.
```

Thanks

Gregg_S_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-18-2016
01:34 PM

152 Views

Parallelizing finite element stiffness matrix assembly is far from trivial. Check the literature, but recognize there are no easy answers.

- A parallel algorithm for sti€ffness matrix assembling in a shared memory environment, Marcelo Novaes De Rezende, Joao Batista de Paiva

Zhanghong_T_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-19-2016
09:55 PM

152 Views

Dear Gregg,

Thank you very much for your kindly reply. I know it is difficult to write an efficient parallel code to assemble FEM stiff matrix. I will read the books you suggested.

Thanks

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-21-2016
12:45 PM

152 Views

From your posted code, (which is a pared down version of your code) is it safe to assume:

The array rowrange, once initialized, remains unchanged for the duration of the second loop?

(it is in the sample above, but it is unknown if your formal program leaves this unchanged)

The second loop is adding rand() to a(j). Is rand() a substitution for large chunk of code?

(Note, adding a comment to your code post would alleviate this question).

The code I posted in #8 with revisions #13 will work within a rank (or in a single process)...

...provided that the code represented by rand() on line 30 in post #17, and which may precede that line within the do j loop,

has no temporal issues (IOW references a(?) other than j).

Something like this untested code

program openmptest use ifport use omp_lib implicit none integer,parameter::nmax=10000000 integer,parameter::nzmax=100000000 integer,parameter::itermax=100 integer::i,j,k,k1,k2,iter integer,allocatable::rowrange(:,:) real*8,allocatable::a(:) integer::iThread,nThreads,iBegin,iEnd, iSliceSize, iResidual logical :: lConverged allocate(rowrange(2,nmax),a(nzmax)) ! use a random rowrange to fit overlapped stiff matrix of finite element do i=1,nmax k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) if(k1<k2)then rowrange(1,i)=k1 rowrange(2,i)=k2 else rowrange(1,i)=k2 rowrange(2,i)=k1 endif enddo a=0.0d0 !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) PRIVATE(iThread,iBegin,iEnd,iter,i) iThread = omp_get_thread_num() nThreads = omp_get_num_threads() iSliceSize = nmax / nThreads ! need not be private, as all threads will calculate the same value iResidual = mod(nmax,iSliceSize) ! need not be private, as all threads will calculate the same value iBegin = iSliceSize * iThread + 1 if(iThread > iResidule) then iBegin = iBegin + iResidule iEnd = iBegin + iSliceSize else iBegin=iBegin+iThread iEnd = iBegin + iSliceSize-1 endif do iter=1,itermax ! use itermax to fit the convergence of iteration ! in every iteration, the global stiff matrix is updated do i=1,nmax if(rowrange(1,i) < iBegin) cycle do j=rowrange(1,i),min(rowrange(2,i),iEnd) ! it is unknown how you produce the data to be accumulated ! *** do not use rand() for testing ! *** as the results will not be deterministic in multi-threaded mode ! *** also rand() has a critical section and will rash parallel performance ! *** instead, use a deterministic means to generate test data a(j) = a(j) + tiny(a(j)) * j enddo enddo !$OMP BARRIER !$OMP MASTER ! your convergence test here ! try to parallelize this if possible ! for now bail after 100 iterations lConverged = (iter >= 100) ! **** MPI optional code ! update global lConverged table in Rank 0 ! read updated lConverged table from Rank 0 (all other participating ranks) lConverged = ANY(copyOfgloballConverged) ! *** end MPI optional code !$OMP END MASTER !$OMP BARRIER if(lConverged) exit end do !$OMP END PARALLEL end program openmptest

The above is untested and incomplete.

For your correctness test, use your original code with the exception of replacing the rand() with the expression in the above code. then add the converged test and bailout.

Have the old (revised) code output to a different array. Then compare the two outputs. The results should be the same.

Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.