Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29236 Discussions

Performance of double vs single precision

arikd
Beginner
7,772 Views

Our software does lots of floating point linear algebra (CFD modeling). Many operations require double precision for larger size matrices. Unfortunately the performance we have been getting out of both Intel and AMD processors was very poor: double precision was almost 2x slower than the same code when using single precision.

We tested the software both under 32-bit and 64-bit Windows (and compiled with corresponding 32 and 64 bit options) with no visible difference. Are there compiler options that we are missing that could speed up double precision computations?

Thanks.

0 Kudos
39 Replies
jimdempseyatthecove
Honored Contributor III
1,992 Views

Arikd,

Try processing the rank 3 array as rank 1

CALL COMPUTE(rb, z, beta, Nz*Ny*Nx)
...
SUBROUTINE COMPUTE(rb, z, beta, N)
REAL :: rb(N), z(N), beta
INTEGER :: N

!$omp parallel do
!$omp& private( i)
do i=1,N
! if possible, use canned routines
rb(i) = z(i) + beta*rb(i)
enddo
enddo


For portibility issues you may want to insert some conditional compilation code that asserts the assumed indexing order.

Jim Dempsey

0 Kudos
arikd
Beginner
1,992 Views

Jim,

What kind of improvement can we expect?

I recall that long time ago, we did try something like this and it has not helped much ("much" is defined as >10% improvement). Back in the 80s it was helpful, but I recall that since the 90s such "tricks" were no longer required.

Also, the way the code is written, the index runs from 0 to Nx+1, such that at 0 and Nx+1 layers we store boundary conditions. In other words, as it runs from 1 to Nx, the Nx+1 layer has to be skipped, etc. This can be fixed, I assume, but for what benefit?

Thanks,

Arik

0 Kudos
TimP
Honored Contributor III
1,992 Views
ifort -O3 is supposed to look for easily detected opportunities for this optimization. It would allow work to be distributed more evenly among threads, in cases where the outer loop count isn't evenly divisible by the number of threads.
I've seen applications which parallelize by first ignoring the difference between the boundary data, following the main loop by correction of the boundary elements. The automatic unrolling, evidently, wouldn't go this far.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views

Arik,

On your original code with the three subscripts compile with full optimizations including uses/requires SSE3and no runtime checks. Place a break point at the beginning of the outer loop or at the beginning of the subroutine. Open a disassembly window and examine the nested loop. See what is generated.

Next, make the little 20-line subroutine as suggested but only in a test "Hello world" application complete with a call to it with a test array. compile with full optimizations including uses/requires SSE3and no runtime checks. Place a break point at the beginning of the outer loop or at the beginning of the subroutine. Open a disassembly window and examine the nested loop. See what is generated.

Compare the two code sets.

By the way, running the sample test would likely have taken less time than to post the question on the forum.

Regarding boundary conditions:

If your inner loop has test for boundary conditions, although the integer test is not a large computational consideration, the branching around a piece of code is high computational over head, and the branch may gunk up the vectorization. It is best to eliminate the branches within the inner loop if at all possible.

Often this can be doneby computing beyond the bounds of the perceived array. e.g. if the "perceived" array is ARRAY(1:N), allocate it to (0:N+1) and place appropriate values into the extension cells prior to executing your loop (i.e. to avoid NaN or Divide by 0). This may present a problem elsewhere if you use ARRAY without the subscripts under the assumption that the array bounds is the "perceived" size.

If this presents a problem then work it the other way around - Perform the special conditions for ARRAY(1), and ARRAY(N) outside the loop and run the loop from (2) to (N-1).

I will leave this an exercise to you to extend this to multi-dimensional arrays.

Jim Dempsey

0 Kudos
arikd
Beginner
1,992 Views

We have tested all the suggestions and the only one which produced a tangible improvement was the SSE3 option (~30% faster). The rest were all below 5%. However, our most time-consuming operation - matrix backsubstitution, was not helped at all, apparently because of recurrency. I have included the subroutine below, in case somebody has insights on what we are doing wrong. All the inputs fromeverybody Jim, Tim, Lionel et al, have been very much appreciated.

Arik

!-----------------------------------------------------
subroutine solvm13(
& bp, bw, bs, bu, bnu, beu, bes, r, zt
&)

use numbers
use MultiThreads
use GridBlocks
use grdblks

implicit none

real bp (0:Nx1,0:Ny1,0:Nz1),
& bw (0:Nx1,0:Ny1,0:Nz1),
& bs (0:Nx1,0:Ny1,0:Nz1),
& bu (0:Nx1,0:Ny1,0:Nz1),
& bnu (0:Nx1,0:Ny1,0:Nz1),
& beu (0:Nx1,0:Ny1,0:Nz1),
& bes (0:Nx1,0:Ny1,0:Nz1),
& r (0:Nx1,0:Ny1,0:Nz1),
& zt (0:Nx1,0:Ny1,0:Nz1)
! internal obj
integer i, j, k

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu(i,j,k) * zt(i,j,k-1)
& - beu(i, j,k) * zt(i+1,j,k-1)
& - bnu(i,j,k) * zt(i,j+1,k-1)
& - bs(i,j,k) * zt(i,j-1,k)
& - bes(i,j,k) * zt(i+1,j-1,k)
& - bw(i,j,k) * zt(i-1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu(i,j,k+1) * zt(i,j,k+1)
& - beu(i-1,j,k+1) * zt(i-1,j,k+1)
& - bnu(i,j-1,k+1) * zt(i,j-1,k+1)
& - bs(i,j+1,k) * zt(i,j+1,k)
& - bes(i-1,j+1,k) * zt(i-1,j+1,k)
& - bw(i+1,j,k) * zt(i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

end
!-----------------------------------------------------

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views
Arik,
Rework:
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu(i,j,k) * zt(i,j,k-1)
& - beu(i,j,k) * zt(i+1,j,k-1)
& - bnu(i,j,k) * zt(i,j+1,k-1)
& - bs(i,j,k) * zt(i,j-1,k)
& - bes(i,j,k) * zt(i+1,j-1,k)
& - bw(i,j,k) * zt(i-1,j,k)
enddo
into:
do i=1,Nx
bu_x_zt(i,j,k) = bu(i,j,k) * zt(i,j,k-1)
enddo
do i=1,Nx
bue_x_zt(i,j,k) = beu(i,j,k) * zt(i+1,j,k-1)
enddo
do i=1,Nx
bnu_x_zt(i,j,k) = bnu(i,j,k) * zt(i,j+1,k-1)
enddo
do i=1,Nx
bs_x_zt(i,j,k) = bs(i,j,k) * zt(i,j-1,k)
enddo
do i=1,Nx
bes_x_zt(i,j,k) = bes(i,j,k) * zt(i+1,j-1,k)
enddo
do i=1,Nx
bw_x_zt(i,j,k) = bw(i,j,k) * zt(i-1,j,k)
enddo
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu_x_zt(i,j,k)
& - beu_x_zt(i,j,k)
& - bnu_x_zt(i,j,k)
& - bs_x_zt(i,j,k)
& - bes_x_zt(i,j,k)
& - bw_x_zt(i,j,k)
enddo
This will improve SSE performance as well as cache performance.

I will leave it an exercize for you do do the backward sweep loop.

Jim Dempsey

					
				
			
			
				
			
			
			
			
			
			
			
		
0 Kudos
TimP
Honored Contributor III
1,992 Views
This changes my assumptions from your original posted code, as you have recurrence which would prevent vectorization of the inner loops. Your outer loop recurrence should raise a complaint from thread checker, as it would no doubt give wrong results if you increase the chunk size.
Most CPUs have fewer hardware prefetch streams for negative than for positive stride, so you may be losing more time on cache misses in the back substitution. If your code is correct otherwise, you may be able to alleviate this with software prefetch or by splitting the inner loop; the optimum strategy is far from evident. If you split the inner loop, you might save the terms which depend on the result of the other thread for the 2nd pass.
0 Kudos
arikd
Beginner
1,992 Views

Jim,

Unfortunately your proposal made it slower, specifically 1.9 times slower. I can see that it has more do-loop and RAM work, which was apparently not compensated by whatever efficiencies it added. Unless, of course, the implementation was incorrect, so I am including this subroutine below. For compiler (Intel Fortran 10.1) options we used: -c -O3 -QxT -Qopenmp.

thanks,

Arik

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
bu_ (i,j,k) = bu (i,j,k) * zt(i,j,k-1)
enddo
do i=1,Nx
beu_(i,j,k) = beu(i,j,k) * zt(i+1,j,k-1)
enddo
do i=1,Nx
bnu_(i,j,k) = bnu(i,j,k) * zt(i,j+1,k-1)
enddo
do i=1,Nx
bs_ (i,j,k) = bs (i,j,k) * zt(i,j-1,k)
enddo
do i=1,Nx
bes_(i,j,k) = bes(i,j,k) * zt(i+1,j-1,k)
enddo
do i=1,Nx
bw_ (i,j,k) = bw (i,j,k) * zt(i-1,j,k)
enddo
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu_ (i,j,k)
& - beu_(i,j,k)
&&n bsp; - bnu_(i,j,k)
& - bs_ (i,j,k)
& - bes_(i,j,k)
& - bw_ (i,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
bu_ (i,j,k+1) = bu(i,j,k+1) * zt(i,j,k+1)
enddo
do i=Nx,1,-1
beu_(i-1,j,k+1) = beu(i-1,j,k+1) * zt(i-1,j,k+1)
enddo
do i=Nx,1,-1
bnu_(i,j-1,k+1) = bnu(i,j-1,k+1) * zt(i,j-1,k+1)
enddo
do i=Nx,1,-1
bs_ (i,j+1,k) = bs(i,j+1,k) * zt(i,j+1,k)
enddo
do i=Nx,1,-1
bes_(i-1,j+1,k) = bes(i-1,j+1,k) * zt(i-1,j+1,k)
enddo
do i=Nx,1,-1
bw_ (i+1,j,k) = bw(i+1,j,k) * zt(i+1,j,k)
enddo
do i=Nx,1,-1
&nb sp; zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu_ (i,j,k+1)
& - beu_(i-1,j,k+1)
& - bnu_(i,j-1,k+1)
& - bs_ (i,j+1,k)
& - bes_(i-1,j+1,k)
& - bw_ (i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo
end

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views

Arikd,

Do you have a single threaded example of the code that works?

By this I mean do not simply strip the !$OMP stuff out of thelast code posted but supply an example of the original single threaded code that worked. There are temporal issues with your code that need to be clarified and can best be clarified by examining the original single threaded code that worked.

Of particular concern for exampleis on your loops (of the attempt at parallel code) is the use of offfset indecies (e.g. i-1 and i+1) and wether or not the algorithm is to use the prior or post state change values. This will be evident by the working pre-parallization attempt at the code change. For portions of the expression(s) relying on prior values of zt this can be resolved by having two zt arrays (zt and zt_) one containing the prior values. The portion of the expression(s) that rely on using the post change state values might not be parallizable unless you can rework the code to compute the post value on-the-fly so to speak from the pre-change state values. But then this adds redundant computations of the same values.

A secondary cause of poor performance is excessive use of locks. It is best to eliminate locks wherever possible and to minimize there use whenever possible. Overuse of locks result in threads burning up CPU time in spinlocks or worse yet yields.

Jim Dempsey

0 Kudos
arikd
Beginner
1,992 Views

Jim,

I am attaching complete text of the original, solvm13, and modified as per your post, solvm13_2, subroutines. They seem to be using the same number of locks etc. Both solvm13 and solvm13_2, the latter is just much slower.

Please comment, thanks,

Arik

*-----------------------------------------------------------------------
subroutine solvm13(
& bp, bw, bs, bu, bnu, beu, bes, r, zt
&)
use numbers
implicit none

real bp (0:Nx1,0:Ny1,0:Nz1),
& bw (0:Nx1,0:Ny1,0:Nz1),
& bs (0:Nx1,0:Ny1,0:Nz1),
& bu (0:Nx1,0:Ny1,0:Nz1),
& bnu(0:Nx1,0:Ny1,0:Nz1),
& beu(0:Nx1,0:Ny1,0:Nz1),
& bes(0:Nx1,0:Ny1,0:Nz1),
& r (0:Nx1,0:Ny1,0:Nz1),
& zt (0:Nx1,0:Ny1,0:Nz1)
! internal obj
integer i, j, k

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu(i,j,k) * zt(i,j,k-1)
& - beu(i,j,k) * zt(i+1,j,k-1)
& ; - bnu(i,j,k) * zt(i,j+1,k-1)
& - bs(i,j,k) * zt(i,j-1,k)
& - bes(i,j,k) * zt(i+1,j-1,k)
& - bw(i,j,k) * zt(i-1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu(i,j,k+1) * zt(i,j,k+1)
& - beu(i-1,j,k+1) * zt(i-1,j,k+1)
& - bnu(i,j-1,k+1) * zt(i,j-1,k+1)
& - bs(i,j+1,k) * zt(i,j+1,k)
& - bes(i-1,j+1,k) * zt(i-1,j+1,k)
& - bw(i+1,j,k) * zt(i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo
end

*-----------------------------------------------------------------------
subroutine solvm13_2(
& bp, bw, bs, bu, bnu, beu, bes, r, zt,
& bw_, bs_, bu_, bnu_, beu_, bes_
&)
use numbers
implicit none

real bp (0:Nx1,0:Ny1,0:Nz1),
& bw (0:Nx1,0:Ny1,0:Nz1),
& bs (0:Nx1,0:Ny1,0:Nz1),
& bu (0:Nx1,0:Ny1,0:Nz1),
& bnu (0:Nx1,0:Ny1,0:Nz1),
& beu (0:Nx1,0:Ny1,0:Nz1),
& bes (0:Nx1,0:Ny1,0:Nz1),
& r (0:Nx1,0:Ny1,0:Nz1),
& zt (0:Nx1,0:Ny1,0:Nz1),
& bw_ (0:Nx1,0:Ny1,0:Nz1),
& bs_ (0:Nx1,0:Ny1,0:Nz1),
& bu_ (0:Nx1,0:Ny1,0:Nz1),
& bnu_(0:Nx1,0:Ny1,0:Nz1),
& beu_(0:Nx1,0:Ny1,0:Nz1),
& bes_(0:Nx1,0:Ny1,0:Nz1)

! internal obj
integer i, j, k

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
bu_ (i,j,k) = bu (i,j,k) * zt(i,j,k-1)
enddo
do i=1,Nx
beu_(i,j,k) = beu(i,j,k) * zt(i+1,j,k-1)
&nb sp; enddo
do i=1,Nx
bnu_(i,j,k) = bnu(i,j,k) * zt(i,j+1,k-1)
enddo
do i=1,Nx
bs_ (i,j,k) = bs (i,j,k) * zt(i,j-1,k)
enddo
do i=1,Nx
bes_(i,j,k) = bes(i,j,k) * zt(i+1,j-1,k)
enddo
do i=1,Nx
bw_ (i,j,k) = bw (i,j,k) * zt(i-1,j,k)
enddo
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu_ (i,j,k)
& - beu_(i,j,k)
& - bnu_(i,j,k)
& - bs_ (i,j,k)
& - bes_(i,j,k)
& - bw_ (i,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
bu_ (i,j,k+1) = bu(i,j,k+1) * zt(i,j,k+1)
enddo
&n bsp; do i=Nx,1,-1
beu_(i-1,j,k+1) = beu(i-1,j,k+1) * zt(i-1,j,k+1)
enddo
do i=Nx,1,-1
bnu_(i,j-1,k+1) = bnu(i,j-1,k+1) * zt(i,j-1,k+1)
enddo
do i=Nx,1,-1
bs_ (i,j+1,k) = bs(i,j+1,k) * zt(i,j+1,k)
enddo
do i=Nx,1,-1
bes_(i-1,j+1,k) = bes(i-1,j+1,k) * zt(i-1,j+1,k)
enddo
do i=Nx,1,-1
bw_ (i+1,j,k) = bw(i+1,j,k) * zt(i+1,j,k)
enddo
do i=Nx,1,-1
zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu_ (i,j,k+1)
& - beu_(i-1,j,k+1)
& - bnu_(i,j-1,k+1)
& - bs_ (i,j+1,k)
& - bes_(i-1,j+1,k)
& - bw_ (i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo
end

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views

Arikd,

I am preparing a test program using dummy data.

In overviewing your code, prior to each sweep you have a double loop issuingCALLs to omp_test_lock. This is a coding error since omp_test_lock is a logical function and not a subroutine. Also, at that point in the code, the locks should be in the state of unlocked (assuming your code properly initialized the locks).

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views

Arik,

Your "original" solvm13 is not what I asked for. What you provided is your attempt at parallizing the (or some) code. What I asked for is the non-parallel version of the code that used to work. Please confirm that running solvm13 with OpenMP disabled that the results produced are correct.If your non-parallel results are incorrect then it is not productive for us to suggest improvements in performance of this non-functioning code.

Additionaly, in looking at the solvm13 code, in addition to using CALL to call a function as opposed to a subroutine, your set lock and unset lock are not correct in that (assuming more than one thread) the thread unsetting a lock is not the same as the thread setting the lock. eg Thread 0 gets k=1, Thread 1 gets k=2, Thread 3 gets k=3, Thread 0 gets k=4, ... at some point Thread 0 will lock (j,2) whereas Thread 1 will be unlocking (j,2). This is not permitted (i.e. your code is relying on undefined behavior).

Jim Dempsey

0 Kudos
arikd
Beginner
1,992 Views

Jim,

Your locks note is well taken. It should be fixed, but that has nothing to do w/optimization. The code produces correct results in either configuration, i.e. the way we had it before or with the changes you proposed. The only issue is the solution time. We did run your modificationsboth with and without parallelization and the result is the same, roughly 2x slower than the original code.

Arik

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views

Arik,

In looking at your forward sweep routine (were indexes increment) you are referencing offsets of -1 (i.e. k-1 and j-1). In the non-threaded execution of the code,this requires that those cells of zt have been updated prior to reference (with -1 offset). A similer situation occures in the backward sweep with the offsets of +1 (i.e. k+1 and j+1).

What this causes isthe requirement for the serializationof the process (only one thread at a time can progress).

In many filtering or transformation algorithems the next time state values depend on prior time state values but not the values in the state of flux. i.e. your forward sweep would use index offsets of +1 as well as backward sweep using index offsets of -1. This causes me to call into question if the algorithm as supplied is correct.

From the little understanding I have of what the code isit looks as if you are calculating new values for r in 3D space using faces of a cube/box. If this is so then each call to solvm13 would be part of a state advancement and therefore all the values on the right side of the zt(i,j,k)= ought to reference the prior values of zt(i,j,k) and not the subsequent values during the state advancement within solvm13. Could you comment on this?

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,992 Views
Jim,

I already questioned these serial dependencies. Given that a chunk size of 1 is used, with luck, and all threads progressing at the same rate, it's possible that the updates have completed soon enough, but there's still a race condition. If it's an iterative method, it's possible that small variations in iteration history from run to run are acceptable and don't cause it to fail.

Tim
0 Kudos
arikd
Beginner
1,992 Views

Jim and Tim,

Unfortunately, both the algorithm and the code implementing it are correct. It has been tested to produce correct results and hence by definition it is fine. We are solving a banded matrix here, using Incomplete Cholesky Conjugate Gradient, ICCG, method. One of the key elements of this approach is to carry out ILU (Incomplete LU decomposition) on the original matrix and then solve the matrix iteratively. Each iteration involves all kinds of dot products (like the ones I posted in the beginning), but the most time-consuming operation is this forward and back-substitution step. (If you are interested, look up the standard direct, i.e. non-iterative, methods of matrix solution based on LU decomposition - all of them involve this type of a fw/back substitution step.)

Indeed, as I mentioned earlier and as you noticed, the step involves recurrent relations, which apparently impede its efficient implementation. It sounds as if you are saying that if that's what it is,there is not much wecan do. If so, I will look for ways to break that recurrency, somehow.

Thanks,

Arik

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views

Arik,

Hope is not lost.

I spent the day working on a different method of interlocking and came up with a way that produces these results on a system with 4 cores (2x Opteron 270 w/ 2 cores each).

Available Cores 4
Nx=200
Ny=200
Nz=200
solvm13_s = 4.05135552724823 (unthreaded)
solvm13_s = 4.05434086732566 (unthreaded)
solvm13_s = 4.05069780210033 (unthreaded)
solvm13_s2 = 6.38864697469398 Threads=1
solvm13_s2 = 6.37160542188212 Threads=1
solvm13_s2 = 6.36766673671082 Threads=1
solvm13_s2 = 3.33201492624357 Threads=2
solvm13_s2 = 3.32790499506518 Threads=2
solvm13_s2 = 3.32613467006013 Threads=2
solvm13_s2 = 2.32680508820340 Threads=3
solvm13_s2 = 2.33010923257098 Threads=3
solvm13_s2 = 2.33329555764794 Threads=3
solvm13_s2 = 1.83062202343717 Threads=4
solvm13_s2 = 1.82885152008384 Threads=4
solvm13_s2 = 1.82682427763939 Threads=4

The system only has 2GB so I could not bump up the array sizes largest than 200 by 200 by 200 (202 by 202 by 202if you count the 0 and +1 perimeters).

The arrays were declared as REAL(8). There is interlocking in the code but there are no OpenMP locks involved. And no extraneous temporary arrays.

Each test was run 3 times in order to produce a run time with some confidence that it is approximately correct. There is an unthreaded test (Without OpenMP statements and without the interlocking code added to force proper sequencing). The OpenMP threaded tests run with 1, 2, 3, 4 threads and 3 iterations each (in addition to the OpenMP overhead there is also the synchronization code to assert the temporal requirements of the method).

Two sets of arrays were used. Each set of arrays were declared as a user defined type. One of the user defined type was A and the other B (each with 9 arrays zt(:,:,:), r(:,:,:),...).

The arrays were initialized with different values in each array. The values were chosen such that neither Infinities nor NaNs would result in the output array (zt(i,j,k)). The A set of initialized data was the same as the B set of initialized data.

The unthreaded code processed the A set of arrays, the threaded code processed the B set of arrays. After each iteration a comparison was performed

if(A%zt(i,j,k) .ne. B%zt(i,j,k)) write 'err ', i,j,k

No errors were reported between the various numbers of theads and the unthreaded code.

As you can see there is hit from going unthreaded to OpenMP with 1 thread. Additional improvement might be attained with closer examination of the code.

You can email me to discuss this in further detail, my address follows.

Jim Dempsey
jim_dempsey@ameritech.net

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,992 Views

Further enhancement gets to

Available Cores 4
Nx 200
Ny 200
Nz 200
solvm13_s = 4.03220395976678 (unthreaded)
solvm13_s = 4.03583411965519 (unthreaded)
solvm13_s = 4.04411484347656 (unthreaded)
solvm13_s2 = 5.83800405170768 Threads= 1
solvm13_s2 = 5.83111096965149 Threads= 1
solvm13_s2 = 5.83113517612219 Threads= 1
solvm13_s2 = 3.02408889075741 Threads= 2
solvm13_s2 = 2.99818594055250 Threads= 2
solvm13_s2 = 2.99723869049922 Threads= 2
solvm13_s2 = 2.07681118417531 Threads= 3
solvm13_s2 = 2.07517947163433 Threads= 3
solvm13_s2 = 2.06526678288355 Threads= 3
solvm13_s2 = 1.59286772180349 Threads= 4
solvm13_s2 = 1.59124766616151 Threads= 4
solvm13_s2 = 1.58598867338151 Threads= 4

In examining the dissassembly code it appears that additional and significant improvement can yet be attained.

Jim Dempsey

0 Kudos
adhishor
Beginner
1,992 Views
You should try to make some tests and improve the 64 bit options. I will try to put some text on my website tomorrow when I have time.
0 Kudos
Reply