Dynamically allocated private variables - possible?

xraygenfit · ‎08-21-2007

Hi,

I have a large calculation, during which I have to store several arrays. I have successfully parallelized this by dynamically allocating arrays within the loop (with very poor performance), or by declaring them outside of the loop statically (i.e. double ak[5000]) and making them private. The latter method works very well, except when I use a large dataset. If I use double ak[10000] (along with other arrays) in a parallelized region, I get a stack overflow crash. If I dynamically declare my array outside of the loop and make it private, I get a crash. It seems as if I can't allocate memory dynamically and have it be private. Right now, it looks like if I want to dynamically allocate my arrays, I must do it within the parallel for structure. Is there a workaround for this? I've scoured the internet, and can't seem to find an answer. Thanks for any help.

TimP · ‎08-21-2007

Certain compilers have options to control thread stack size limits, and to control the maximum size of arrays which go on stack rather than heap.

jimdempseyatthecove · ‎08-21-2007

Both C++ and Fortran support Thread Private data area. The Thread Private area isn't large but it is more than adequate to hold a pointer to your large array or a pointer to a structure that contains a large array(s) or pointer(s) to large arrays.

An alternate method is to pass a pointer toa thread private context areaalong with function/subroutine calls.

The following is what I use in Fortran

type TypeThreadContextSEQUENCE
type(TypeObject), pointer :: pObjecttype(TypeTether), pointer :: pTethertype(TypeFSInput), pointer :: pFSInputinteger :: LastObjectLoadedend type TypeThreadContext

type(TypeThreadContext) :: ThreadContextCOMMON /CONTEXT/ ThreadContext!$OMP THREADPRIVATE(/CONTEXT/)

In this implementation I hold pointers to an element within an array. The thread context could also contain a pointer to a derrived type that contains an array or pointer to an array (both allocatable).

A different technique to use is assume you have t number of threads and n number of "things" to process where each "thing" has a varying number of entities requiring temporary array space. Due to memory constraints you do not wish to allocate the worst case situation to all threads.

For this in Fortran you create a derived type (class if C++, struct if C or C++) containing thread context information including pointers to arrays:

type TypeCURCAL
integer :: FirstInteger
integer :: MAXSEG
real, pointer :: CTBDUM(:)
real, pointer :: DUMI(:)
integer :: LastIntegerend type TypeCURCAL

You typically will have one of these types per major subroutine/function requiring scratch space. This happens to declare the data for my subroutine CRUCAL.

Then in a common module have an array of pointers to this data type

! CMODE4.F90type(TypeCMODE4), allocatable :: cmode4(:)
! CURCAL.F90type(TypeCURCAL), allocatable :: crucal(:)
On the entry to the subroutine CRUCIAL a function call is made
to obtain the pointer to the entry of Module.crucal(threadnumber).
Prior to returning the pointer to crucal(threadnumber) a
test is made to see if crucal had been allocated. If not
then a critical section is entered and re-test for allocated
on array crucal is made. If not allocated perform allocation
to number of threads. Next the pointer to the crucal(threadnumber)
is obtained and then the arrays within the TypeCRUCAL structure
are tested to see if they are allocated, if not allocate, if
allocated a test is made to see if the size of allocation
meets the size requirements for the particular thing being
processed by the thread. If so, exit with pointer, if not
large enough deallocate/reallocate scratch space.
Once initialized, and large enough, the function is a quick
in and out.



Jim Dempsey

xraygenfit · ‎08-21-2007

Thanks for the replies guys. That last one sounds fairly complicated. I would think this would be one of the things OpenMP would be designed to do.

What I currently have is 

//I have
double myvar[5000];
//I want 
//double* myvar = new double[nl];


#pragma omp parallel for private(myvar)
for(int i = 0; i < datacount; i++)
{
   //Do some stuff
   for(int j = 0; j   {
	myvar = sqrt(stuff);
    }
 
   //myvar used down here
}

What I would like is to be able to dynamically declare myvar outside 
of the loop, since I really have no idea how big it will be if a
user has a nonstandard data file. It would be bad form to just crash. It
seems that since OpenMP would have some sort of easy mechanism for 
doing this. At least that's what I'm hoping.

jimdempseyatthecove · ‎08-22-2007

xray,

This may be an unintended consequence of your trying to write a simplified example.

Your sample code is declaring myvar as a local array of the full extent of work storage [5000] (I know you want to chang this to fastdynamic allocation). However, notice that parallel loop declares myvar as private. i.e. each thread stack gets a local copy of myvar to the full extent of [5000] whereas if you had n threads you would require only 5000/n+1 amount of scratch space (as i is striped per thread). Or less depending on chunk size arg to schedule.So that is one problem.

I may be wrong bit it seems like you want to have a scratch working array inside the parallel loop that is allocated to at least the size of the current working set (reallocated as needed).

If you intend to have each thread work on a different section of myvar then myvar should not be private to each thread. But each thread must be careful not to stomp on sections of myvar that it ought not to modify.

If you want the myvar private then consider something like the following

#define MAXcores 64 // or 32 or ??

// somewhere outside the processing function
double* myvarPointerTable[MAXcores];
int myvarSize[MAXcores];
inti;
for(i=0;i{
 myvarPointerTable = NULL;
 myvarSize = 0;
}
...
// inside the processing function
int numThreads = omp_get_num_threads();
int chunkSize = (datacount+numThreads-1) / numThreads;
bool doOnce = TRUE;
#pragma omp parallel for private(doOnce) copyin(doOnce)
for(int i_chunk = 0; i_chunk < datacount; i_chunk+=chunkSize)
{
 double* myvar;
 int i;
 if(doOnce)
 {
 doOnce = FALSE;
 int thread_num = omp_get_thread_num();
 ASSERT(thread_num < MAXcores);
 if(myvarSize[thread_num] < nl) // or is this chunkSize?
 {
 if(myvarPointerTable[thread_num]) delete myvarPointerTable[thread_num];
 myvarPointerTable[thread_num] = new double[nl]; // or chunkSize;
 myvarSize[thread_num] = nl; // or chunkSize;
 }
 myvar = myvarPointerTable[thread_num];
 i = 0;
 }
//Do some stuff
for(int j = 0; j{
myvar = sqrt(stuff);
}

//myvar used down here
...
// end of loop

++i;

}

Jim Dempsey

xraygenfit · ‎08-22-2007

Thanks, I was tired when I typed that. myvar should have been myvar. I'll try your approach. Thanks for the help.

Alexey-Kukanov · ‎08-23-2007

Might be I do not completely understand your need, but to me it seems the solution is as simple as:

#pragma omp parallel
{
   double* myvar = new double[nl];

#pragma omp for
   for(int i = 0; i < datacount; i++)
   {
      //Do some stuff
      for(int j = 0; j      {
         myvar = sqrt(stuff);
      }
   //myvar used down here
   }
}

So at the very beginning of the parallel region, _each_ thread allocates a temporary array of a required size; then the parallel loop starts where each thread uses its copy of array. Might be you will need to explicitly specify myvar as private, but otherwise I think this should work. Though I need to say I do not have significant experience with OpenMP and may be unaware of some peculiarities.

jimdempseyatthecove · ‎08-23-2007

Don't forget the delete at the end of the parallel section.

However....

I do not believe thatyour simplified exampleis what you want.

new and delete are expensive operations. And if this parallel section is entered/exited many times then you should expect memory fragmentation. i.e. you have enough virtual memory, but not in one piece, so eventually an allocation fails.

IMHO a better approach (performance wise, and less fragmentation wise) is for each thead to have a private myvar array that is persistent across entry and exit of the parallel region. Then only if the array size is insufficient (or not allocated) perform the allocation.

For a n threaded system this would mean you would have n myvar arrays that would eventually grow to the largest size experienced during run time. To avoid memory fragmentation you might want to determine in advance what the worst case (largest requirement) is and preallocate the scratch arrays.

In the event that you have many such scratch arrays but only one or a few require concurrancy by the same thread, then you might want to create a pool allocation routine where each thread maintains a pool of buffers. If the pool allocation/free is simplified then there would be no (or few) requirements to call new/delete which have critical sections.

Here is a skeleton of what you might find interesting:

struct poolBuffer
{
 union
 {
 char* cP;
 int* iP;
 float* fP;
 double* dP;
 };
 union
 {
 int numberOfBytes;
 void* padd1;
 };
 union
 {
 bool isAvailable;
 void* padd2;
 };
 poolBuffer() {memset(this, 0, sizeof(*this)); isAvailable=TRUE;};
 ~poolBuffer() {ASSERT(isAvailable); if(cP) delete cP;};

struct myPools_struct
{
 poolBuffer* Pools;
 int numberOfPools;
 myPools() {
 Pools=NULL; numberOfPools=0;};
 ~myPools() {
 if(Pools) delete Pools;};
 init(int n) {
 ASSERT(!Pools);
 Pools = new poolBuffer;
 ASSERT(Pools);
 numberOfPools = n; };
 char* allocate(int n);
 int* allocate(int n);
 float* allocate(int n);
 double* allocate(int n);
 void deallocate(char* cP);
 void deallocate(int* iP);
 void deallocate(float* fP);
 void deallocate(double* dP);
};

...
char* myPools_struct::allocate(int n)
{
 int bestFit = -1;
 int firstAvailable = -1;
 int i;
 for(i=0;i< numberOfPools; ++i)
 {
 if(Pools.isAvailable)
 {
 if(firstAvailable < 0) firstAvailable = i;
 if(!Pools.cP) break;
 if(Pools.numberOfBytes>=n)
 {
 if(bestFit < 0 || Pools.numberOfBytes bestFit = i;
 }
 }
 }
 if(bestFit >= 0) return Pools.cP;
 ASSERT(firstAvailable>=0); // too few of pools
 if(Pools.cP) delete Pools.cP;
 Pools.cP = new char;
 ASSERT(Pools.cP);
 Pools.numberOfBytes = n;
 return Pools.cP;
}
...
double* myPools_struct::allocate(int n)
{
 char* cP = allocate(n * sizeof(double));
 return (double*)cP;
}
__declspec( thread ) myPools_struct* myPools = NULL;

// at start where you initialize the app
#pragma omp parallel
{
 ASSERT(!myPools);
 myPools = new myPools_struct;
 ASSERT(myPools);
#define numberOfPoolsPerThread 10
 myPools->init(numberOfPoolsPerThread); // each is empty
}

// somewhere in your app
#pragma omp parallel
{
 double* myvar = myPools->allocate(nl);
#pragma omp for
 for(int i = 0; i < datacount; i++)
 {
 //Do some stuff
 for(int j = 0; j {
 myvar = sqrt(stuff);
 } //myvar used down here
 }
   myPools->deallocate(myvar);
}

You can fill out the particulars. You may want to analyse memory usage to see if you should return the first available buffer, last available buffer, smallest available buffer, least recently used available buffer, most recently used available buffer.

And don't forget to return myPools at end of program.

Jim Dempsey

xraygenfit · ‎08-27-2007

Hi guys,

I wanted to thank both of you for your answers. Both were correct, and a combination of both of your techniques worked best. As pointers can be kept private, here's what I did:

MyClass:Init(...)
{

array = new double[omp_get_procs_num()*nl]

}

MyClass::NumberCrunch(....)
{

#pragma omp parallel
{
int threadnum = omp_get_thread_num();
double* localarrray = array+threadnum*nl

#pragma omp for
....
}
}

jimdempseyatthecove · ‎08-27-2007

xray,

The class as constructed will work. However, at the expense of new/delete as you enter/leave scope of the function.

Although this conserves memory it has two potential pitfalls. 1) If you enter and exit scope a large number of times you may fracture the memory heap and then at some point an allocation (new) will fail. 2) excessive computation overhead.

Your current method may be satisfactory assuming you can live with a future memory allocation problem. If you run this code yourself then any consequences suffered can be weighed against the little bit of extra programming effortnow.

On the otherhand, if you are planning on shipping your code out to customers then a little bit extra effort to avoid a potential problem might be worth the extra effort.

Jim Dempsey