- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear all,

I need to do reduction operation to an array. I have the following code:

program openmptest use omp_lib implicit none integer::i,j,id integer,parameter::n=12 integer::a(2,n),b(n),c(n) a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/)) do i=1,n b(i)=i c(i)=0 enddo ! Variables !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b) reduction(+:c) !$OMP do do i=1,n do j=a(1,i),a(2,i) c(j)=c(j)+b(j) enddo enddo !$OMP enddo !$OMP END PARALLEL do i=1,n write(1,*)b(i) enddo do i=1,n write(1,*)a(:,i),c(i) enddo end program openmptest

however, the result is exactly the same as the following code:

program openmptest use omp_lib implicit none integer::i,j,id integer,parameter::n=12 integer::a(2,n),b(n),c(n) a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/)) do i=1,n b(i)=i c(i)=0 enddo ! Variables !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) !$OMP do do i=1,n do j=a(1,i),a(2,i) c(j)=c(j)+b(j) enddo enddo !$OMP enddo !$OMP END PARALLEL do i=1,n write(1,*)b(i) enddo do i=1,n write(1,*)a(:,i),c(i) enddo end program openmptest

Is there any difference between 'shared' and 'reduction'?

Thanks,

Tang Laoya

- Tags:
- Parallel Computing

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

When you create an apparent race condition by removing the reduction clause, it's not guaranteed to produce a.wrong result, although it may aggravate the inefficiency.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Tim,

Thank you very much for your kindly reply. Do you mean that both cases must get correct results, but without the reduction clause, the efficiency will be decreased?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

No, when you have a race condition, results may change unpredictably. If the updates do happen in the correct order, hardware cache protocol may effectively serialize the updates, or some may be missed, depending on number of threads and numa characteristics of your platform, as well as circumstances out of your control. Some openmp compilers (not so much Intel's) may not implement an openmp directive when it appears defective, so you would need to examine build logs.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Tim,

Thank you very much for your kindly reply. Do you think that the first code it the correct implement?

I just show an example code, in my real code, there are large number of variables are shared by threads, so when I execute the program, it crashed when entering into the Openmp part displayed 'stack overflow'. Do you think what could lead to the problem?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

In your test code (both) have two problems:

1) your array sizes are too small to effectively parallelize

2) The subscripts generated by the do j loop is indeterminate as to if different threads will have overlapping regions.

Number 2) could be corrected by changing the j/i loop order. Using OpenMP reduction on such code would be inefficient. Normally one uses reduction on a scalar, or a small array. You wouldn't want to do this on a large array.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Jim,

Thank you very much for your kindly comment.

For your first issue, I just posted a test code and the real code has much larger n, and for your second issue, a(2,i) and a(1,i) relies on i, how to swap i/j order?

Do you mean that the reduction on large array is also inefficient?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

program openmptest use omp_lib implicit none ! Variables integer::i,j,id integer,parameter::n=12 integer::a(2,n),b(n),c(n) integer::iThread,nThreads,iBegin,iEnd, iSliceSize, iResidual ! code a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/)) do i=1,n b(i)=i c(i)=0 enddo !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) PRIVATE(iThread,iBegin,iEnd) iThread = omp_get_thread_num() nThreads = omp_get_num_threads() iSliceSize = n / nThreads ! need not be private, as all threads will calculate the same value iResidual = mod(n,iSliceSize) ! need not be private, as all threads will calculate the same value iBegin = iSliceSize * iThread + 1 if(iThread > iResidule) then iBegin = iBegin + iResidule iEnd = iBegin + iSliceSize else iBegin=iBegin+iThread iEnd = iBegin + iSliceSize-1 endif do i=1,n if(a(1,i) < iBegin) cycle do j=a(1,i),min(a(2,i),iEnd) c(j)=c(j)+b(j) enddo enddo !$OMP END PARALLEL do i=1,n write(1,*)b(i) enddo do i=1,n write(1,*)a(:,i),c(i) enddo end program openmptest

The above is untested code. The above does not require reduction.

You can make an improvement to the above code by aligning your arrays, then calculating the iSliceSize as the nearest multiple of number of variables that fit within a cache line. And then use the !DIR$ directives to indicate the inner do loop variables are aligned at the start. This will eliminate the runtime code that checks for alignment.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Sorry, I don't understand, what do you mean?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

How large will n be under typical use?

What is the frequency and degree of overlap of the index tuples a(1,i),a(2,i)?

In the provided example

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Jim,

Thank you very much for your kindly reply. The n is the number of finite elements in fact, usually it is several million. I am trying to parallelize the part of stiff matrix assembling of finite element.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

In my example posted at github/tprince array reduction is fast for 1 thread but swapping loops so as to keep data local to threads is faster beyond 2 threads, when using Intel compilers which supports simd reduction in the inner loop. I set it up for single thread array reduction when using gcc and msvc on account of the more limited capabilities of those compilers.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Note to all. I modified the inner loops:

do i=1,n if(a(1,i) < iBegin) cycle do j=a(1,i),min(a(2,i),iEnd) c(j)=c(j)+b(j) enddo enddo

The above will aid vectorization as well as eliminate the IF test

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Jim and Tim,

Thank you very much!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear all,

I have correctly implemented the reduction by MPI function MPI_Allreduction. But the efficience is still not good. Original code is like following:

... A=0.0d0 do ! update array A ... A(i)=A(i)+delta ... ! if convergent, exit enddo

Note that in the cycle A is not cleared to zero. So I have to do like this:

! define temp array A0 and A00 allocate(A0(n),A00(n)) ... A=0.0d0 do A0=0.0d0 ! update array A0 parallelly ... A0(i)=A0(i)+delta(i) ! do reduction operation call MPI_Allreduce(A0,A00,n,MPI_DOUBLE,MPI_SUM,comm,ierr) A=A+A00 ... ! if convergent, exit enddo

In the cycle, two addition operations are added: A0=0.0d0 and A=A+A00. Since n is very large, these two operations will spend considerable time, is there any better ways to save CPU time?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Would you please post a code snip of serial code that works. You can excise the unimportant parts. But you must include sufficient information so we can understand your code better.

IOW This last post shows you are (may be) using MPI. Presumably with more than one Rank.

Earlier posts indicate you are using OpenMP. It is not clear from your information as to if each (SMP) host in your cluster is running one MPI Rank, with each Rank introducing additional threading using OpenMP...

.OR.

Each (SMP) host in your cluster is running n MPI Ranks (n=number of logical processor on that host), *** with each Rank introducing additional threading using OpenMP.

The last situation would introduce gross oversubscription of threads. Rank0NumberOfLogicalProcessors**2 + Rank1NumberOfLogicalProcessors**2 + ...

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Jim,

Thank you very much for your kindly reply. The following is the test serial code that can fit the assemlbe of finite element matrix. Could you please give me some suggestion?

program paralleltest use ifport implicit none integer,parameter::nmax=10000000 integer,parameter::nzmax=100000000 integer,parameter::itermax=100 integer::i,j,k,k1,k2,iter integer,allocatable::rowrange(:,:) real*8,allocatable::a(:) allocate(rowrange(2,nmax),a(nzmax)) ! use a random rowrange to fit overlapped stiff matrix of finite element do i=1,nmax k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) if(k1<k2)then rowrange(1,i)=k1 rowrange(2,i)=k2 else rowrange(1,i)=k2 rowrange(2,i)=k1 endif enddo a=0.0d0 do iter=1,itermax ! use itermax to fit the convergence of iteration ! in every iteration, the global stiff matrix is updated do i=1,nmax do j=rowrange(1,i),rowrange(2,i) a(j)=a(j)+rand() enddo enddo enddo end program paralleltest

```
I use the MPI instead of Openmp because it displayed 'stack overlap' error message when used Openmp. Furthermore, there are multiple calculate nodes that used distributed memory. The Openmp based parallelization can only work for single calculate node with shared memory.
```

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Parallelizing finite element stiffness matrix assembly is far from trivial. Check the literature, but recognize there are no easy answers.

- A parallel algorithm for sti€ffness matrix assembling in a shared memory environment, Marcelo Novaes De Rezende, Joao Batista de Paiva

# - Parallel assembling and equation solving via graph algorithms with an application to the FE simulation of metal extrusion processes, A Unterkircher, J. Reissner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Gregg,

Thank you very much for your kindly reply. I know it is difficult to write an efficient parallel code to assemble FEM stiff matrix. I will read the books you suggested.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

From your posted code, (which is a pared down version of your code) is it safe to assume:

The array rowrange, once initialized, remains unchanged for the duration of the second loop?

(it is in the sample above, but it is unknown if your formal program leaves this unchanged)

The second loop is adding rand() to a(j). Is rand() a substitution for large chunk of code?

(Note, adding a comment to your code post would alleviate this question).

The code I posted in #8 with revisions #13 will work within a rank (or in a single process)...

...provided that the code represented by rand() on line 30 in post #17, and which may precede that line within the do j loop,

has no temporal issues (IOW references a(?) other than j).

Something like this untested code

program openmptest use ifport use omp_lib implicit none integer,parameter::nmax=10000000 integer,parameter::nzmax=100000000 integer,parameter::itermax=100 integer::i,j,k,k1,k2,iter integer,allocatable::rowrange(:,:) real*8,allocatable::a(:) integer::iThread,nThreads,iBegin,iEnd, iSliceSize, iResidual logical :: lConverged allocate(rowrange(2,nmax),a(nzmax)) ! use a random rowrange to fit overlapped stiff matrix of finite element do i=1,nmax k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0) if(k1<k2)then rowrange(1,i)=k1 rowrange(2,i)=k2 else rowrange(1,i)=k2 rowrange(2,i)=k1 endif enddo a=0.0d0 !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) PRIVATE(iThread,iBegin,iEnd,iter,i) iThread = omp_get_thread_num() nThreads = omp_get_num_threads() iSliceSize = nmax / nThreads ! need not be private, as all threads will calculate the same value iResidual = mod(nmax,iSliceSize) ! need not be private, as all threads will calculate the same value iBegin = iSliceSize * iThread + 1 if(iThread > iResidule) then iBegin = iBegin + iResidule iEnd = iBegin + iSliceSize else iBegin=iBegin+iThread iEnd = iBegin + iSliceSize-1 endif do iter=1,itermax ! use itermax to fit the convergence of iteration ! in every iteration, the global stiff matrix is updated do i=1,nmax if(rowrange(1,i) < iBegin) cycle do j=rowrange(1,i),min(rowrange(2,i),iEnd) ! it is unknown how you produce the data to be accumulated ! *** do not use rand() for testing ! *** as the results will not be deterministic in multi-threaded mode ! *** also rand() has a critical section and will rash parallel performance ! *** instead, use a deterministic means to generate test data a(j) = a(j) + tiny(a(j)) * j enddo enddo !$OMP BARRIER !$OMP MASTER ! your convergence test here ! try to parallelize this if possible ! for now bail after 100 iterations lConverged = (iter >= 100) ! **** MPI optional code ! update global lConverged table in Rank 0 ! read updated lConverged table from Rank 0 (all other participating ranks) lConverged = ANY(copyOfgloballConverged) ! *** end MPI optional code !$OMP END MASTER !$OMP BARRIER if(lConverged) exit end do !$OMP END PARALLEL end program openmptest

The above is untested and incomplete.

For your correctness test, use your original code with the exception of replacing the rand() with the expression in the above code. then add the converged test and bailout.

Have the old (revised) code output to a different array. Then compare the two outputs. The results should be the same.

Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page