arrays, pointers, memory...

rreis · ‎12-23-2009

Hi all

I need to partition an array in two diferent ways. The best would be if I could use the same array and could just change the upper bounds. I've tried it but I think the compiler is creating temporary arrays so I though someone here could give me a hand, advice or just say to forget it.

Initial version:

real(prec), dimension(:,:,:), allocatable :: &
cuo,cvo,cwo, cuoC,cvoC,cwoC

allocate(cuo(nG2d,nG3,nL1))
allocate(cvo(nG2d,nG3,nL1))
allocate(cwo(nG2d,nG3,nL1))

allocate(cuoC(nG1,nL2,nG3))
allocate(cvoC(nG1,nL2,nG3))
allocate(cwoC(nG1,nL2,nG3))

ALT version:

real(prec), dimension(:,:), allocatable, target :: uvw

real(prec), dimension(:,:,:), pointer :: &
cuo,cvo,cwo, cuoC,cvoC,cwoC

nG2dnG3nL1 = nG2d*nG3*nL1
nG1nL2nG3 = nG1*nL2*nG3

allocate(uvw(nG2dnG3nL1, 3))

call c_f_pointer (c_loc(uvw(1,1)), cuo, [nG2d,nG3,nL1])
call c_f_pointer (c_loc(uvw(1,2)), cvo, [nG2d,nG3,nL1])
call c_f_pointer (c_loc(uvw(1,3)), cwo, [nG2d,nG3,nL1])

call c_f_pointer (c_loc(uvw(1,1)), cuoC, [nG1,nL2,nG3])
call c_f_pointer (c_loc(uvw(1,2)), cvoC, [nG1,nL2,nG3])
call c_f_pointer (c_loc(uvw(1,3)), cwoC, [nG1,nL2,nG3])

this is used in this function

call shear(cuo,cvo,cwo, cuoC,cvoC,cwoC)

which has, on the other side...

! - velocity arrays for the slices
real(prec), dimension(nG2d,nG3,nL1), intent(inout) :: cuo, cvo, cwo !, work

! - velocity arrays for the xz planes
real(prec), dimension(nG1,nL2,nG3), intent(inout) :: cuoC, cvoC, cwoC

any ideas, sugestions? many thanks,

jimdempseyatthecove · ‎12-23-2009

Ricardo,

In the before case you have seperate arrays for (arrays cuo, cvo, cwo) and (arrays cuoC, cvoC, cwoC). IOW none of the six arrays share memory.

In the proposed ALT version (arrays cuo, cvo, cwo) share memory with(arrays cuoC, cvoC, cwoC).
Both array sets are passed to a subroutine as intent(inout).
Therefore, success or failure of the called routine may depend upon temporal issues.
Consequently, your working serial version of theshear subroutine might NOT be suitable for parallelization without addressing the temporal issues.

I think it would be better for you to present the larger picture of the problem, the performance issues encountered, and your proposed steps at resolution.

Is this a memory capacity issue?
A performance issue?
Will parallelization be involved?
Does your current code exhibit poor vectorization?
Other?

Jim Dempsey

rreis · ‎12-23-2009

Thanks Jim.

OK, let's start. The code is already paralelized with MPI. It is a pseudo-spectral code, having two Fourrier dimensions and one compact, finit difference direction. Thats why I "need" the two distinct group of arrays. One (cuo,cvo,cwo) is ordered (ny+2, nz, nx_local) meaning Ipm grouping the Y and Z dims for better performance doing FFTs (they are 2D FFTs). The other group (cuoC, cvoC, cwoC) is ordered (nx, ny_local, nz) because I need to get derivatives along the x lines, using the compact. nx_local and ny_local are the X and Y dimensions divided by the number of MPI processes. This ordering was arrived after profiling for the best alternative in the comms routines.

Right now you can see cuo,cvo,cwo and cuoC,cvoC,cwoC have almost the same dimension and hold the same data, albeit in a different configuration (and space, fourrier or physical). My problem is memory. As they never overlap, in the computation phases they go through (meaning cuo,cvo,cwo and cuoC,cvoC,cwoC are used in distinct phases of the code) using the same allocated space would be most desirable because then bigger simulations, for the same memory expenditure, could be run.

hope I managed to explain my problem, if not please ask and I'll try again.

What I was expecting with the "shared" approach was to reduce memory usage by half (roughly) but it was not the case :(

jimdempseyatthecove · ‎12-23-2009

IanH,

Your subroutine shear uses both sets of arrays as inout. Are you certain the writing into one, sharing the same block of memory as the other, will not affect the other or itself? I would suggest you run a rigorous test designed to catch such an event (e.g. enter in with non-zero values, compute non-zero values, and at moment of last write, replace with zero. Then insert diagnostic code to assert the input side never sees 0.0. Run with varying dimensioned sets (with/without OpenMP if you are using OpenMP). IOW assert what you assume is free for use is available for use. You can also run a test with calling shear using seperate arrays, followed by a call using the remapped arrays, then compare the outputs (as well as portions of input that ought not to change).

Re-using a buffer is one thing, performing a transformation/reclamation within self bufferis another. Your code may provide for this but writingtest diagnostice code is strongly recommended. If the diagnostic code exposes a problem, then studying the problem may yield a work around (e.g. small secondary read-ahead buffer may solve problem). In tight memory situations you have to do what you have to do.

Jim Dempsey

rreis · ‎12-23-2009

Jim

I have positive certainty I have no problem in that respect (I've tried another version, with a maximized array and fortran pointers. I've found out the resultant array was bigger than the sum of the parts so I gave up on it).

The serial code used only one set of arrays, I need this division because of the way it was parallized. All I want is a sugestion how I can avoid the temporary arrays it seems to be creating because I don't see it diminuishing in memory size using my approach of "shared arrays".

many thanks,

rreis · ‎12-23-2009

I might add, I'm not using OpenMP, just MPI and between uses the information is passed through buffer arrays which means there's an isolation between the use of one set of arrays or the other.

jimdempseyatthecove · ‎12-23-2009

Although the shared memory buffer may be safely used, it might be advisable to have a state variable to indicate which set of array descriptors are active. Or, immediately map the non-current descriptors as null so you ab-end if you inadvertantly use the munged up array (from those descriptors viewpoint).

Jim

jimdempseyatthecove · ‎12-23-2009

>>I've found out the resultant array was bigger than the sum of the parts so I gave up on it).
Then allocate the conglomerate array first to the larger size.Then map the two descriptor layouts.
Maybe it is time to look at more RAM (and x64 platform).

On x32 you might be able to use two (or more) processes and memory mapped file for shared data. Instead of passing array descriptors from function to function you pass an event between processes when data is ready and complete. This way you can extend the working data across two processes. Note, this is not the same as MPI on the same system.

Jim

rreis · ‎12-23-2009

Quoting - jimdempseyatthecove

>>I've found out the resultant array was bigger than the sum of the parts so I gave up on it).
Then allocate the conglomerate array first to the larger size.Then map the two descriptor layouts.
Maybe it is time to look at more RAM (and x64 platform).

On x32 you might be able to use two (or more) processes and memory mapped file for shared data. Instead of passing array descriptors from function to function you pass an event between processes when data is ready and complete. This way you can extend the working data across two processes. Note, this is not the same as MPI on the same system.

Jim

I'm already using X64... exactly 512 cores (32 cores, 64Gb Ram each), doing a 1x10^9 point run (2048x2048x256). But I want to do more with less.

Jim, if you notice, I'm already doing that with the shared option. I choose the bigger array and map the two there. What I suspect is happening is that when passing the pointers to the subroutine the compiler is getting me temporary arrays for the second set of pointers and that why I don't the the memory usage reduced by half.

The question is fairly simple. I want to allocate an array, say A with the max dimension possible, say A(NTOT). Then I want to look at it in two diferent ways, say B(n1,n2,n3) and C(n4,n5,n6). You know NTOT=n1*n2*n3=N4*N5*N6 but N1 difers from N4 and so on. Wasn't the pointer strategy employed in the first post suposed to work? And if not, why (or in what conditions)?

many thanks for bearing with me

jimdempseyatthecove · ‎12-24-2009

The pointer transformation should have worked.
There is an alternative (which I do not have much experience with).
There are newer interoperability intrinsics for converting a C pointer into an array descriptor
REAL(prec), allocatable, target:: flat(:)
REAL(prec), pointer :: B(:,:,:), C(:,:,:)

...
allocate(flat(NTOT),STAT=ierror)
...
CALL C_F_POINTER(C_LOC(flat(1)), B, (/n1,n2,n3/))
CALL C_F_POINTER(C_LOC(flat(1)), C, (/N1,N2,N3/))

Something like the above should work.

The only bug-a-boo is the array indexes are 1-based (1:n1, 1:n2, 1:n3)...

Jim Dempsey