number of copies of an array in OpenMP

amit-amritkar · ‎07-13-2009

Hi,

I have a 4D array and I parallelize a loop which works on the 4D array using OpenMP parallel regions.
(See part of the code attached.)

I used totalview software for memory debugging with -g -O0 compilation flags and setting OMP_NUM_THREADS=4.
In totalview, I see that the 'phi' array, in the code below, has 5 copies created in the parallel region.

Why do I get 5 copies even though maximum number of threads is 4? Is it the same for other optimization flags, say for the commenly used -O3 flag?

Again in totalview, each copy created for array 'phi' is 4D and has the same dimension as specified in the module.
Is this is true?
How can I make the private array size phi_private(11,11,11,1) so that it will be memory efficient for the code to run on 4 threads?

Thanks.

[plain]module data

real,dimension(:,:,:,:),allocatable :: phi

end module data


program test

USE data
real counter
integer i,j,k,m
allocate(phi(11,11,11,4))

counter = 0.0
c$omp parallel 
c$omp do private(i,j,k,m)
do m=1,4
do k=1,11
do j=1,11
do i=1,11
phi(i,j,k,m)=counter
counter = counter + 1.2
enddo
enddo
enddo
enddo
c$omp enddo

.....
.....

c$omp end parallel

end program test[/plain]

Yuan_C_Intel · ‎07-13-2009

Hi, Amit

You don't need to declare the array phi(11,11,11,4) as private, as there're no dependence betweenloopiterations for those array elements phi(i,j,k,m). Compiler will handle this and parellel those array elements automatically. If you declare it as private, according to openmp rule,however, a separate copy of thearraywill bemade for each thread to access in private in addition to the original one. In your case,that's why a separate 4 copies, 5 in total,of arrays are created duringloop parallelism.When the loop ends, those private copies are destroyed.

jimdempseyatthecove · ‎07-13-2009

The code as written is correct for usage of phi.
That is there is but one instance of the array phi.
You can print out the LOC%(phi(1,1,1,1)) and see all threads print the same LOC%

You do have a problem with

phi(i,j,k,m)=counter
counter = counter + 1.2

as you will not fill in the array phi with the values expected
use

phi(i,j,k,m) = ((m-1)*1200.) + ((k-1)*120.) + ((j-1)*12.) + (i-1)*1.2

amit-amritkar · ‎07-14-2009

Quoting - Yolanda Chen (Intel)

Hi, Amit

You don't need to declare the array phi(11,11,11,4) as private, as there're no dependence betweenloopiterations for those array elements phi(i,j,k,m). Compiler will handle this and parellel those array elements automatically. If you declare it as private, according to openmp rule,however, a separate copy of thearraywill bemade for each thread to access in private in addition to the original one. In your case,that's why a separate 4 copies, 5 in total,of arrays are created duringloop parallelism.When the loop ends, those private copies are destroyed.

Hi Yolanda,

"Compiler will handle this and parallel those array elements automatically."
How do I get more information about this particular implementation?

My understanding of the implementation is as follows. Please correct me as necessary.

When a global array is accessed by a work sharing construct inside a parallel region then the OpenMP directive implementation hands out the starting and ending memory locations to each thread to work on that array and once the work sharing construct is finished (but still inside the same parallel region) the thread updates it's part of the global array.

Thanks.

amit-amritkar · ‎07-14-2009

Quoting - jimdempseyatthecove

The code as written is correct for usage of phi.
That is there is but one instance of the array phi.
You can print out the LOC%(phi(1,1,1,1)) and see all threads print the same LOC%

You do have a problem with

phi(i,j,k,m)=counter
counter = counter + 1.2

as you will not fill in the array phi with the values expected
use

phi(i,j,k,m) = ((m-1)*1200.) + ((k-1)*120.) + ((j-1)*12.) + (i-1)*1.2

Hi Jim,

In totalview, I get different addresses for loc(phi(1,1,1,1)) but with a write to screen in the loop, I get the same address. It seems that the totalview debugger creates new address space for the temporary copies of the array for debugging purpose.

Thanks.

jimdempseyatthecove · ‎07-14-2009

Maybe totalview is telling you the location of the descriptor for phi and not the location of the 1st elelment in phi?
(and for whatever reason, each thread is getting a copy of the descriptor). Or totalview is broke with respect to OpenMP and arrays.

Believe in the address produced with the WRITE(*,*) LOC(PHI(1,1,1,1))

Jim

Yuan_C_Intel · ‎07-15-2009

Quoting - Yolanda Chen (Intel)

Hi, Amit

You don't need to declare the array phi(11,11,11,4) as private, as there're no dependence betweenloopiterations for those array elements phi(i,j,k,m). Compiler will handle this and parellel those array elements automatically. If you declare it as private, according to openmp rule,however, a separate copy of thearraywill bemade for each thread to access in private in addition to the original one. In your case,that's why a separate 4 copies, 5 in total,of arrays are created duringloop parallelism.When the loop ends, those private copies are destroyed.

Hi, Amit
It's not quite regarding low level implementations. The openmp directives are introduced to release developer from complicated threading work and let compiler take part of the job. From that sense, I say compiler will do the work automatically.

Specifically for the loop parallel, and in your case, each loop iteration is referencing a different array element, the sbuscript combination(i,j,k,m)is unique between iterations. If we have 4 threads in total, each thread can take, e.g. 11*11*11, an average part of those iterations to run (this also up to how you specify loop scheduling), as the (i,j,k,m) are distinct, there will be no referenceoverlap cross threads for array phi, thus phi can be shared.

But for variable counter, it does sharedbetween loop iterations and we need to change thereference inadifferent way to makethe loop parallelable:
phi(i,j,k,m) = ((m-1)*11*11*11*1.2) + (k-1)*11*11*1.2 + ((j-1)*11*1.2) + (i-1)*1.2

To learn more onopenmp, there's a good artical to start:
http://software.intel.com/en-us/articles/getting-started-with-openmp/