Solved: Sharing memory among threads

Petros · ‎05-20-2012

Hello all parallel coders! I need your insights...

I have a big loop that I have parallelized and I need to see if the way I have the data is the optimal:

[fortran]integer, parameter :: num_elem=40, num_domains=10000 integer :: IPVT(num_elem,num_domains), info, i, counter, chunk double precision :: A(num_elem,num_elem,num_domains), Bx(num_elem,num_domains), &
By(num_elem,num_domains), Tx(num_elem,num_domains), Ty(num_elem,num_domains), &
TB(4,num_domains), Cx(num_elem), Cy(num_elem) double precision :: f(num_elem,num_domains),x(num_elem,num_domains) logical :: converged call initialize(Cy,Cx,counter,chunk,x,f,converged) do while(.not. converged) !$omp parallel do default(none) & !$omp private(i, info) & !$omp shared(num_domains,A,IPVT,Ty,Tx,Bx,By,Cy,Cx,TB)& !$omp reduction(+:counter) & !$omp schedule(dynamic,chunk) do i=1,num_domains call calculate_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy,Cx) call solve_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i)) counter=counter+1 enddo call check_convergence(x,f,converged) enddo [/fortran]
-In this loop many domains are calculated by each thread.
-Inside calculate_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(out) while Cy,Cx and x are intent(in).
-Inside solve_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(in) while x is intent(inout).
-The schedule is dynamic because the work inside each subdomain is not the same.

I get some speedup but not the expected. This part of the code sums for 80% of the total program. I would expect a maximum theoretic speedup of 5x. I get 2.5-3x (I vary the number of threads from 2 to 48, on a 48-core machine. The max speedup is at 20-22 cores).

Thoughts I have:
1) Use thread private data. This will mean multiplying the memory used by the num_threads.
2) Use allocatable arrays to be sure each domain is sized as needed and not at the size of the biggest domain.
3) Use static schedule by manually dividing the domains
4) Put the data of each domain in a type structure and then have an array of structures. That is:
[fortran]type domain_data double precission :: A(num_elem,num_elem), Bx(num_elem) ... end type domain_data type(domain_data), dimension(num_domains) :: all_domains_data[/fortran]
5) Any combination of the above.

Can you give me your thoughts? Someone more experienced maybe?

Thanks in advanced,
Petros

TimP · ‎05-21-2012

It's always best to use the largest chunk size which can give you satisfactory work balance. I think it's unusual that large dynamic chunks can do that when static can't, but when you have the right situation, it could be a good choice.
Usually, starting a new parallel region involves a new set of private data, so I expect the overhead for copyin and copyout (explicit or implicit) to be incurred at the beginning and end of each parallel region, even though KMP_BLOCKTIME can save the additional overhead of assembling a thread pool. So it can raise the number of outer loop iterations necessary for effective parallelism to 10's of thousands.

In my personal demonstrations, I've been able to use Cilk+ in C functions called from a Fortran OpenMP application. It requires avoiding overlap between OpenMP and Cilk+ parallel regions, as the threaded run-times don't cooperate (TBB and Cilk+ should cooperate). Still it requires very specialized (in my view) cases for it to work well, which I don't think would include cases where a large chunk size is effective.

Compilers and hardware platforms are evolving toward more effective parallelization of arrays of structures. In the cases I have to deal with (legacy Fortran equivalents of arrays of structures), parallelization of array updates incurs false sharing, so the updates are much more dependent on fast cache updates than are the reads. This brings in sooner the need to use multiple levels of parallelism, e.g. MPI over OpenMP over vectorization. The only advantages of the array of structures are possible improvements in maintainability and possibly enough improvement in cache locality to overcome the sharing of cache lines among threads.

View solution in original post

TimP · ‎05-20-2012

I have demonstrated in my own tests that explicit division of work among threads, so as to use static scheduling, performs much better than dynamic on multi-CPU platforms. On a single CPU with shared last level cache, one of schedule(dynamic,2) or (guided,2) or the like should work. If you must choose among static, dynamic, and guided, static may work OK in spite of some thread taking double the average work; evidently then, dynamic or guided pinned to a single CPU socket may also work.
Thread private arrays significantly increase various overheads involved in starting or completing parallel regions. Even with thread private scalars, there appears to be a relatively low number of distinct private variables which can work effectively with Intel OpenMP. The compiler's -parallel option may be superior to OpenMP when an excessive number of privates is involved.
I've also made some comparisons with Cilk+. The latter becomes more competitive against OpenMP as dynamic scheduling comes into play. Cilk+ institutes a great divide between simple cases which are more straightforward than OpenMP and more complex reductions where the opposite is true.

Petros · ‎05-21-2012

First of all thanks for the answer!

For my situation (dynamic,100) seems to be give the best results. Maybe it's because of spatial locality? (A(:,:,i) and A(:,:,i+1) are next to each other in memory).

Thread private arrays significantly increase various overheads involved in starting or completing parallel regions.

Isn't this overhead only once? Or is it at each parallel region? When the outer iterations of the do while reach up to thousands, doesn't having local to the thread data make a difference? There are 4 CPUs, eachwith shared last level cache.

I've also made some comparisons with Cilk+. The latter becomes more competitive against OpenMP as dynamic scheduling comes into play. Cilk+ institutes a great divide between simple cases which are more straightforward than OpenMP and more complex reductions where the opposite is true.

I've thought about Cilk+ and TBB but both are not available under Fortran. Most of the code is written in Fortran 95/2003. Rewritting it in C++ will be a big job..

What about many arrays VS an array of structures? Have you made any tests on that?

TimP · ‎05-21-2012

It's always best to use the largest chunk size which can give you satisfactory work balance. I think it's unusual that large dynamic chunks can do that when static can't, but when you have the right situation, it could be a good choice.
Usually, starting a new parallel region involves a new set of private data, so I expect the overhead for copyin and copyout (explicit or implicit) to be incurred at the beginning and end of each parallel region, even though KMP_BLOCKTIME can save the additional overhead of assembling a thread pool. So it can raise the number of outer loop iterations necessary for effective parallelism to 10's of thousands.

In my personal demonstrations, I've been able to use Cilk+ in C functions called from a Fortran OpenMP application. It requires avoiding overlap between OpenMP and Cilk+ parallel regions, as the threaded run-times don't cooperate (TBB and Cilk+ should cooperate). Still it requires very specialized (in my view) cases for it to work well, which I don't think would include cases where a large chunk size is effective.

Compilers and hardware platforms are evolving toward more effective parallelization of arrays of structures. In the cases I have to deal with (legacy Fortran equivalents of arrays of structures), parallelization of array updates incurs false sharing, so the updates are much more dependent on fast cache updates than are the reads. This brings in sooner the need to use multiple levels of parallelism, e.g. MPI over OpenMP over vectorization. The only advantages of the array of structures are possible improvements in maintainability and possibly enough improvement in cache locality to overcome the sharing of cache lines among threads.

jimdempseyatthecove · ‎05-21-2012

Petros,

I suggest you create a module that contains

A(:,:,:),IPVT(:,:),Ty(:,:),Tx(:,:),Bx(:,:),By(:,:),TB(:,:),x(:,:),f(:,:), Cy, Cx

And then USE that module within your two subroutines

!$omp parallel do reduction(+:counter)
doi=1,num_domains
callcalculate_domain(i)
callsolve_domain(i)
counter=counter+1
enddo
!$omp end parallel do

Jim Dempsey

Petros · ‎05-21-2012

I can try that, but, what is the justification? It will be on the extended scope of the parallel region, so, it will be as if they are shared? Is there any benefit instead of explicitly declaring them as shared?

Thanks,
Petros

jimdempseyatthecove · ‎05-21-2012

>>Is there any benefit instead of explicitly declaring them as shared?

You won't be passing all of the arguments on each call to the two inner subroutines.
All that is required to be passed is the index i.

You did not indicate if the called routine takes a assumed shape arrays. My guess is yes, because of the way the arguments are declared on the call. You are making 10,000 calls. You did not list the extent of the other dimensions on the called arrays. If they are relatively small, then the overhead to construct and pass the array descriptors may be a significant portion of the loop. If you have VTune available, then you can measure the time spent making the calls (as opposed to the time spent in what was called).

Jim Dempsey

Petros · ‎05-21-2012

Yes, the routine take assumed shape arrays. The dimensions of the arrays are 40x1000 for all arrays except for A which is 40x40x1000. I will try the vtune approach! I've used it before to check load balancing and overhead but not for this reason! I'll try it.

jimdempseyatthecove · ‎05-22-2012

Your first post had num_domains at 10000 (not 1000), and num_elem at 40 therefore

doi=1,num_domains
callcalculate_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy,Cx)
callsolve_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i))
counter=counter+1
enddo

would:

loop:
reserve stack space for array descriptors (hardly any time)
construct array descriptors A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy(:),Cx(:) each possibly 20 (machine) words to initialize (x11 array args) IOW at least 220 machine words to write
call calculate_domain (hardly any time
iterate 40 times on all but A
return
reserve stack space for array descriptors (hardly any time)
construct array descriptors A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i) each possibly 20 (machine) words to initialize (x9 array args) IOW at least 180 machine words to write
call calculate_domain (hardly any time
iterate 40 times on all but A
return
end loop

Depending on what is done in each subroutine, with 40 iterations (assumption) the time to construct the array descriptors could exceed the processing time within the called subroutine.

If these subroutines can be inlined, then the compiler optimization might possibly remove some of the redundancy. e.g. construct the array descriptors once and/or pushing (some of) the array descriptor construction out of the loop. I cannot assess as if the optimization will do this. However, you can easily do this by using the module technique.

Jim Dempsey