- Parallel assembling and equation solving via graph algorithms with an application to the FE simulation of metal extrusion processes, A Unterkircher, J. Reissner

Zhanghong_T_ · ‎04-11-2016

Dear all,

I need to do reduction operation to an array. I have the following code:

    program openmptest
        use omp_lib
    implicit none
        integer::i,j,id
        integer,parameter::n=12
        integer::a(2,n),b(n),c(n)
        a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/))
        do i=1,n            
            b(i)=i
            c(i)=0
        enddo
    ! Variables
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b) reduction(+:c)
!$OMP do        
        do i=1,n
            do j=a(1,i),a(2,i)
                c(j)=c(j)+b(j)
            enddo
        enddo
!$OMP enddo        
!$OMP END PARALLEL
        do i=1,n
            write(1,*)b(i)
        enddo
        do i=1,n
            write(1,*)a(:,i),c(i)
        enddo     
        end program openmptest

however, the result is exactly the same as the following code:

    program openmptest
        use omp_lib
    implicit none
        integer::i,j,id
        integer,parameter::n=12
        integer::a(2,n),b(n),c(n)
        a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/))
        do i=1,n            
            b(i)=i
            c(i)=0
        enddo
    ! Variables
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) 
!$OMP do        
        do i=1,n
            do j=a(1,i),a(2,i)
                c(j)=c(j)+b(j)
            enddo
        enddo
!$OMP enddo        
!$OMP END PARALLEL
        do i=1,n
            write(1,*)b(i)
        enddo
        do i=1,n
            write(1,*)a(:,i),c(i)
        enddo     
        end program openmptest

Is there any difference between 'shared' and 'reduction'?

Thanks,

Tang Laoya

TimP · ‎04-12-2016

When you create an apparent race condition by removing the reduction clause, it's not guaranteed to produce a.wrong result, although it may aggravate the inefficiency.

Zhanghong_T_ · ‎04-12-2016

Dear Tim,

Thank you very much for your kindly reply. Do you mean that both cases must get correct results, but without the reduction clause, the efficiency will be decreased?

Thanks

TimP · ‎04-12-2016

No, when you have a race condition, results may change unpredictably. If the updates do happen in the correct order, hardware cache protocol may effectively serialize the updates, or some may be missed, depending on number of threads and numa characteristics of your platform, as well as circumstances out of your control. Some openmp compilers (not so much Intel's) may not implement an openmp directive when it appears defective, so you would need to examine build logs.

Zhanghong_T_ · ‎04-12-2016

Dear Tim,

Thank you very much for your kindly reply. Do you think that the first code it the correct implement?

I just show an example code, in my real code, there are large number of variables are shared by threads, so when I execute the program, it crashed when entering into the Openmp part displayed 'stack overflow'. Do you think what could lead to the problem?

Thanks

jimdempseyatthecove · ‎04-13-2016

In your test code (both) have two problems:

1) your array sizes are too small to effectively parallelize
2) The subscripts generated by the do j loop is indeterminate as to if different threads will have overlapping regions.

Number 2) could be corrected by changing the j/i loop order. Using OpenMP reduction on such code would be inefficient. Normally one uses reduction on a scalar, or a small array. You wouldn't want to do this on a large array.

Jim Dempsey

Zhanghong_T_ · ‎04-13-2016

Dear Jim,

Thank you very much for your kindly comment.

For your first issue, I just posted a test code and the real code has much larger n, and for your second issue, a(2,i) and a(1,i) relies on i, how to swap i/j order?

Do you mean that the reduction on large array is also inefficient?

Thanks

jimdempseyatthecove · ‎04-13-2016

    program openmptest
        use omp_lib
    implicit none
    ! Variables
        integer::i,j,id
        integer,parameter::n=12
        integer::a(2,n),b(n),c(n)
        integer::iThread,nThreads,iBegin,iEnd, iSliceSize, iResidual
        ! code
        a=reshape((/1,8, 3,9, 2,7, 3,12, 4,12, 2,10, 3,11, 2,9, 2,10, 3,8, 2,8, 3,10/),(/2,n/))
        do i=1,n            
            b(i)=i
            c(i)=0
        enddo
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) PRIVATE(iThread,iBegin,iEnd)
        iThread = omp_get_thread_num()
        nThreads = omp_get_num_threads()
        iSliceSize = n / nThreads       ! need not be private, as all threads will calculate the same value
        iResidual = mod(n,iSliceSize)   ! need not be private, as all threads will calculate the same value
        iBegin = iSliceSize * iThread + 1
        if(iThread > iResidule) then
            iBegin = iBegin + iResidule
            iEnd = iBegin + iSliceSize
        else
            iBegin=iBegin+iThread
            iEnd = iBegin + iSliceSize-1
        endif
        
        do i=1,n
            if(a(1,i) < iBegin) cycle
            do j=a(1,i),min(a(2,i),iEnd)
                c(j)=c(j)+b(j)
            enddo
        enddo
!$OMP END PARALLEL
        do i=1,n
            write(1,*)b(i)
        enddo
        do i=1,n
            write(1,*)a(:,i),c(i)
        enddo     
        end program openmptest

The above is untested code. The above does not require reduction.

You can make an improvement to the above code by aligning your arrays, then calculating the iSliceSize as the nearest multiple of number of variables that fit within a cache line. And then use the !DIR$ directives to indicate the inner do loop variables are aligned at the start. This will eliminate the runtime code that checks for alignment.

Jim Dempsey

Zhanghong_T_ · ‎04-13-2016

Sorry, I don't understand, what do you mean?

Thanks

jimdempseyatthecove · ‎04-13-2016

How large will n be under typical use?
What is the frequency and degree of overlap of the index tuples a(1,i),a(2,i)?

In the provided example

Jim Dempsey

Zhanghong_T_ · ‎04-13-2016

Dear Jim,

Thank you very much for your kindly reply. The n is the number of finite elements in fact, usually it is several million. I am trying to parallelize the part of stiff matrix assembling of finite element.

Thanks

TimP · ‎04-14-2016

In my example posted at github/tprince array reduction is fast for 1 thread but swapping loops so as to keep data local to threads is faster beyond 2 threads, when using Intel compilers which supports simd reduction in the inner loop. I set it up for single thread array reduction when using gcc and msvc on account of the more limited capabilities of those compilers.

jimdempseyatthecove · ‎04-14-2016

Note to all. I modified the inner loops:

        do i=1,n
            if(a(1,i) < iBegin) cycle
            do j=a(1,i),min(a(2,i),iEnd)
                c(j)=c(j)+b(j)
            enddo
        enddo

The above will aid vectorization as well as eliminate the IF test

Jim Dempsey

Zhanghong_T_ · ‎04-15-2016

Dear Jim and Tim,

Thank you very much!

Zhanghong_T_ · ‎04-16-2016

Dear all,

I have correctly implemented the reduction by MPI function MPI_Allreduction. But the efficience is still not good. Original code is like following:

...
A=0.0d0
do
    ! update array A
    ...
    A(i)=A(i)+delta
    ... 
    ! if convergent, exit
enddo

Note that in the cycle A is not cleared to zero. So I have to do like this:

! define temp array A0 and A00
allocate(A0(n),A00(n))
...
A=0.0d0
do
    A0=0.0d0
    ! update array A0 parallelly
    ...
    A0(i)=A0(i)+delta(i)
    ! do reduction operation
    call MPI_Allreduce(A0,A00,n,MPI_DOUBLE,MPI_SUM,comm,ierr)
    A=A+A00
    ... 
    ! if convergent, exit
enddo

In the cycle, two addition operations are added: A0=0.0d0 and A=A+A00. Since n is very large, these two operations will spend considerable time, is there any better ways to save CPU time?

Thanks

jimdempseyatthecove · ‎04-16-2016

Would you please post a code snip of serial code that works. You can excise the unimportant parts. But you must include sufficient information so we can understand your code better.

IOW This last post shows you are (may be) using MPI. Presumably with more than one Rank.

Earlier posts indicate you are using OpenMP. It is not clear from your information as to if each (SMP) host in your cluster is running one MPI Rank, with each Rank introducing additional threading using OpenMP...

.OR.

Each (SMP) host in your cluster is running n MPI Ranks (n=number of logical processor on that host), *** with each Rank introducing additional threading using OpenMP.

The last situation would introduce gross oversubscription of threads. Rank0NumberOfLogicalProcessors**2 + Rank1NumberOfLogicalProcessors**2 + ...

Jim Dempsey

Zhanghong_T_ · ‎04-17-2016

Dear Jim,

Thank you very much for your kindly reply. The following is the test serial code that can fit the assemlbe of finite element matrix. Could you please give me some suggestion?

    program paralleltest
		use ifport
		implicit none
		integer,parameter::nmax=10000000
		integer,parameter::nzmax=100000000
		integer,parameter::itermax=100
		integer::i,j,k,k1,k2,iter
		integer,allocatable::rowrange(:,:)
		real*8,allocatable::a(:)
		allocate(rowrange(2,nmax),a(nzmax))
		
		! use a random rowrange to fit overlapped stiff matrix of finite element
		do i=1,nmax
			k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)				
			k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)				
			if(k1<k2)then
				rowrange(1,i)=k1
				rowrange(2,i)=k2
			else
				rowrange(1,i)=k2
				rowrange(2,i)=k1
			endif
		enddo
		
		a=0.0d0
		do iter=1,itermax  ! use itermax to	fit the convergence of iteration
			! in every iteration, the global stiff matrix is updated
			do i=1,nmax
				do j=rowrange(1,i),rowrange(2,i)
					a(j)=a(j)+rand()
				enddo
			enddo
		enddo
    end program paralleltest

I use the MPI instead of Openmp because it displayed 'stack overlap' error message when used Openmp. Furthermore, there are multiple calculate nodes that used distributed memory. The Openmp based parallelization can only work for single calculate node with shared memory.

Thanks

Gregg_S_Intel · ‎04-18-2016

Parallelizing finite element stiffness matrix assembly is far from trivial. Check the literature, but recognize there are no easy answers.

- A parallel algorithm for sti€ffness matrix assembling in a shared memory environment, Marcelo Novaes De Rezende, Joao Batista de Paiva

- Parallel assembling and equation solving via graph algorithms with an application to the FE simulation of metal extrusion processes, A Unterkircher, J. Reissner

Zhanghong_T_ · ‎04-19-2016

Dear Gregg,

Thank you very much for your kindly reply. I know it is difficult to write an efficient parallel code to assemble FEM stiff matrix. I will read the books you suggested.

Thanks

jimdempseyatthecove · ‎04-21-2016

From your posted code, (which is a pared down version of your code) is it safe to assume:

The array rowrange, once initialized, remains unchanged for the duration of the second loop?
(it is in the sample above, but it is unknown if your formal program leaves this unchanged)

The second loop is adding rand() to a(j). Is rand() a substitution for large chunk of code?
(Note, adding a comment to your code post would alleviate this question).

The code I posted in #8 with revisions #13 will work within a rank (or in a single process)...
...provided that the code represented by rand() on line 30 in post #17, and which may precede that line within the do j loop,
has no temporal issues (IOW references a(?) other than j).

Something like this untested code

program openmptest
  use ifport
  use omp_lib
  implicit none
  integer,parameter::nmax=10000000
  integer,parameter::nzmax=100000000
  integer,parameter::itermax=100
  integer::i,j,k,k1,k2,iter
  integer,allocatable::rowrange(:,:)
  real*8,allocatable::a(:)
  integer::iThread,nThreads,iBegin,iEnd, iSliceSize, iResidual
  logical :: lConverged
  allocate(rowrange(2,nmax),a(nzmax))

  ! use a random rowrange to fit overlapped stiff matrix of finite element
  do i=1,nmax
    k1=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)    
    k2=max(min(rand()*1.1d0*nzmax,dble(nzmax)),1.0d0)    
    if(k1<k2)then
      rowrange(1,i)=k1
      rowrange(2,i)=k2
    else
      rowrange(1,i)=k2
      rowrange(2,i)=k1
    endif
  enddo

  a=0.0d0

!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(a,b,c) PRIVATE(iThread,iBegin,iEnd,iter,i)
  iThread = omp_get_thread_num()
  nThreads = omp_get_num_threads()
  iSliceSize = nmax / nThreads       ! need not be private, as all threads will calculate the same value
  iResidual = mod(nmax,iSliceSize)   ! need not be private, as all threads will calculate the same value
  iBegin = iSliceSize * iThread + 1
  if(iThread > iResidule) then
    iBegin = iBegin + iResidule
    iEnd = iBegin + iSliceSize
  else
    iBegin=iBegin+iThread
    iEnd = iBegin + iSliceSize-1
  endif
  do iter=1,itermax  ! use itermax to fit the convergence of iteration      
    ! in every iteration, the global stiff matrix is updated
    do i=1,nmax
      if(rowrange(1,i) < iBegin) cycle
      do j=rowrange(1,i),min(rowrange(2,i),iEnd)
        ! it is unknown how you produce the data to be accumulated
        ! *** do not use rand() for testing
        ! *** as the results will not be deterministic in multi-threaded mode
        ! *** also rand() has a critical section and will rash parallel performance
        ! *** instead, use a deterministic means to generate test data
        a(j) = a(j) + tiny(a(j)) * j
      enddo
    enddo
!$OMP BARRIER
!$OMP MASTER
    ! your convergence test here
    ! try to parallelize this if possible
    ! for now bail after 100 iterations
    lConverged = (iter >= 100)
! **** MPI optional code
! update global lConverged table in Rank 0
! read updated lConverged table from Rank 0 (all other participating ranks)
    lConverged = ANY(copyOfgloballConverged)
! *** end MPI optional code
!$OMP END MASTER
!$OMP BARRIER
    if(lConverged) exit
  end do
!$OMP END PARALLEL
end program openmptest

The above is untested and incomplete.

For your correctness test, use your original code with the exception of replacing the rand() with the expression in the above code. then add the converged test and bailout.

Have the old (revised) code output to a different array. Then compare the two outputs. The results should be the same.

Jim Dempsey

Question of Openmp's reduction

- Parallel assembling and equation solving via graph algorithms with an application to the FE simulation of metal extrusion processes, A Unterkircher, J. Reissner