Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

Dynamically allocated private variables - possible?

xraygenfit
Beginner
1,112 Views
Hi,

I have a large calculation, during which I have to store several arrays. I have successfully parallelized this by dynamically allocating arrays within the loop (with very poor performance), or by declaring them outside of the loop statically (i.e. double ak[5000]) and making them private. The latter method works very well, except when I use a large dataset. If I use double ak[10000] (along with other arrays) in a parallelized region, I get a stack overflow crash. If I dynamically declare my array outside of the loop and make it private, I get a crash. It seems as if I can't allocate memory dynamically and have it be private. Right now, it looks like if I want to dynamically allocate my arrays, I must do it within the parallel for structure. Is there a workaround for this? I've scoured the internet, and can't seem to find an answer. Thanks for any help.


0 Kudos
9 Replies
TimP
Honored Contributor III
1,112 Views
Certain compilers have options to control thread stack size limits, and to control the maximum size of arrays which go on stack rather than heap.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,112 Views

Both C++ and Fortran support Thread Private data area. The Thread Private area isn't large but it is more than adequate to hold a pointer to your large array or a pointer to a structure that contains a large array(s) or pointer(s) to large arrays.

An alternate method is to pass a pointer toa thread private context areaalong with function/subroutine calls.

The following is what I use in Fortran

type

TypeThreadContext

SEQUENCE

type

(TypeObject), pointer :: pObject

type

(TypeTether), pointer :: pTether

type

(TypeFSInput), pointer :: pFSInput

integer

:: LastObjectLoaded

end

type TypeThreadContext


type

(TypeThreadContext) :: ThreadContext

COMMON

/CONTEXT/ ThreadContext

!$OMP THREADPRIVATE(/CONTEXT/)

In this implementation I hold pointers to an element within an array. The thread context could also contain a pointer to a derrived type that contains an array or pointer to an array (both allocatable).

A different technique to use is assume you have t number of threads and n number of "things" to process where each "thing" has a varying number of entities requiring temporary array space. Due to memory constraints you do not wish to allocate the worst case situation to all threads.

For this in Fortran you create a derived type (class if C++, struct if C or C++) containing thread context information including pointers to arrays:

type

TypeCURCAL

integer :: FirstInteger

integer :: MAXSEG

real, pointer :: CTBDUM(:)

real, pointer :: DUMI(:)

integer :: LastInteger

end

type TypeCURCAL

You typically will have one of these types per major subroutine/function requiring scratch space. This happens to declare the data for my subroutine CRUCAL.

Then in a common module have an array of pointers to this data type

! CMODE4.F90

type(TypeCMODE4), allocatable :: cmode4(:)

! CURCAL.F90

type(TypeCURCAL), allocatable :: crucal(:)

On the entry to the subroutine CRUCIAL a function call is made

to obtain the pointer to the entry of Module.crucal(threadnumber).

Prior to returning the pointer to crucal(threadnumber) a

test is made to see if crucal had been allocated. If not

then a critical section is entered and re-test for allocated

on array crucal is made. If not allocated perform allocation

to number of threads. Next the pointer to the crucal(threadnumber)

is obtained and then the arrays within the TypeCRUCAL structure

are tested to see if they are allocated, if not allocate, if

allocated a test is made to see if the size of allocation

meets the size requirements for the particular thing being

processed by the thread. If so, exit with pointer, if not

large enough deallocate/reallocate scratch space.

Once initialized, and large enough, the function is a quick

in and out.




Jim Dempsey

0 Kudos
xraygenfit
Beginner
1,112 Views
Thanks for the replies guys. That last one sounds fairly complicated. I would think this would be one of the things OpenMP would be designed to do.

What I currently have is 

//I have
double myvar[5000];
//I want
//double* myvar = new double[nl];


#pragma omp parallel for private(myvar)
for(int i = 0; i < datacount; i++)
{
//Do some stuff
for(int j = 0; j {
myvar = sqrt(stuff);
}

//myvar used down here
}

What I would like is to be able to dynamically declare myvar outside
of the loop, since I really have no idea how big it will be if a
user has a nonstandard data file. It would be bad form to just crash. It
seems that since OpenMP would have some sort of easy mechanism for
doing this. At least that's what I'm hoping.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,112 Views

xray,

This may be an unintended consequence of your trying to write a simplified example.

Your sample code is declaring myvar as a local array of the full extent of work storage [5000] (I know you want to chang this to fastdynamic allocation). However, notice that parallel loop declares myvar as private. i.e. each thread stack gets a local copy of myvar to the full extent of [5000] whereas if you had n threads you would require only 5000/n+1 amount of scratch space (as i is striped per thread). Or less depending on chunk size arg to schedule.So that is one problem.

I may be wrong bit it seems like you want to have a scratch working array inside the parallel loop that is allocated to at least the size of the current working set (reallocated as needed).

If you intend to have each thread work on a different section of myvar then myvar should not be private to each thread. But each thread must be careful not to stomp on sections of myvar that it ought not to modify.

If you want the myvar private then consider something like the following

#define MAXcores 64 // or 32 or ??
// somewhere outside the processing function
double* myvarPointerTable[MAXcores];
int myvarSize[MAXcores];
inti;
for(i=0;i{
myvarPointerTable = NULL;
myvarSize = 0;
}
...
// inside the processing function
int numThreads = omp_get_num_threads();
int chunkSize = (datacount+numThreads-1) / numThreads;
bool doOnce = TRUE;
#pragma omp parallel for private(doOnce) copyin(doOnce)
for(int i_chunk = 0; i_chunk < datacount; i_chunk+=chunkSize)
{
double* myvar;
int i;
if(doOnce)
{
doOnce = FALSE;
int thread_num = omp_get_thread_num();
ASSERT(thread_num < MAXcores);
if(myvarSize[thread_num] < nl) // or is this chunkSize?
{
if(myvarPointerTable[thread_num]) delete myvarPointerTable[thread_num];
myvarPointerTable[thread_num] = new double[nl]; // or chunkSize;
myvarSize[thread_num] = nl; // or chunkSize;
}
myvar = myvarPointerTable[thread_num];
i = 0;
}
//Do some stuff
for(int j = 0; j{
myvar = sqrt(stuff);
}
//myvar used down here
...
// end of loop

++i;

}
Jim Dempsey
0 Kudos
xraygenfit
Beginner
1,112 Views
Thanks, I was tired when I typed that. myvar should have been myvar. I'll try your approach. Thanks for the help.
0 Kudos
Alexey-Kukanov
Employee
1,112 Views

Might be I do not completely understand your need, but to me it seems the solution is as simple as:

#pragma omp parallel
{
double* myvar = new double[nl];

#pragma omp for
for(int i = 0; i < datacount; i++)
{
//Do some stuff
for(int j = 0; j {
myvar = sqrt(stuff);
}
//myvar used down here
}
}

So at the very beginning of the parallel region, _each_ thread allocates a temporary array of a required size; then the parallel loop starts where each thread uses its copy of array. Might be you will need to explicitly specify myvar as private, but otherwise I think this should work. Though I need to say I do not have significant experience with OpenMP and may be unaware of some peculiarities.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,112 Views

Don't forget the delete at the end of the parallel section.

However....

I do not believe thatyour simplified exampleis what you want.

new and delete are expensive operations. And if this parallel section is entered/exited many times then you should expect memory fragmentation. i.e. you have enough virtual memory, but not in one piece, so eventually an allocation fails.

IMHO a better approach (performance wise, and less fragmentation wise) is for each thead to have a private myvar array that is persistent across entry and exit of the parallel region. Then only if the array size is insufficient (or not allocated) perform the allocation.

For a n threaded system this would mean you would have n myvar arrays that would eventually grow to the largest size experienced during run time. To avoid memory fragmentation you might want to determine in advance what the worst case (largest requirement) is and preallocate the scratch arrays.

In the event that you have many such scratch arrays but only one or a few require concurrancy by the same thread, then you might want to create a pool allocation routine where each thread maintains a pool of buffers. If the pool allocation/free is simplified then there would be no (or few) requirements to call new/delete which have critical sections.

Here is a skeleton of what you might find interesting:

struct poolBuffer
{
union
{
char* cP;
int* iP;
float* fP;
double* dP;
};
union
{
int numberOfBytes;
void* padd1;
};
union
{
bool isAvailable;
void* padd2;
};
poolBuffer() {memset(this, 0, sizeof(*this)); isAvailable=TRUE;};
~poolBuffer() {ASSERT(isAvailable); if(cP) delete cP;};
}
struct myPools_struct
{
poolBuffer* Pools;
int numberOfPools;
myPools() {
Pools=NULL; numberOfPools=0;};
~myPools() {
if(Pools) delete Pools;};
init(int n) {
ASSERT(!Pools);
Pools = new poolBuffer;
ASSERT(Pools);
numberOfPools = n; };
char* allocate(int n);
int* allocate(int n);
float* allocate(int n);
double* allocate(int n);
void deallocate(char* cP);
void deallocate(int* iP);
void deallocate(float* fP);
void deallocate(double* dP);
};
...
char* myPools_struct::allocate(int n)
{
int bestFit = -1;
int firstAvailable = -1;
int i;
for(i=0;i< numberOfPools; ++i)
{
if(Pools.isAvailable)
{
if(firstAvailable < 0) firstAvailable = i;
if(!Pools.cP) break;
if(Pools.numberOfBytes>=n)
{
if(bestFit < 0 || Pools.numberOfBytes bestFit = i;
}
}
}
if(bestFit >= 0) return Pools.cP;
ASSERT(firstAvailable>=0); // too few of pools
if(Pools.cP) delete Pools.cP;
Pools.cP = new char;
ASSERT(Pools.cP);
Pools.numberOfBytes = n;
return Pools.cP;
}
...
double* myPools_struct::allocate(int n)
{
char* cP = allocate(n * sizeof(double));
return (double*)cP;
}
__declspec( thread ) myPools_struct* myPools = NULL;
// at start where you initialize the app
#pragma omp parallel
{
ASSERT(!myPools);
myPools = new myPools_struct;
ASSERT(myPools);
#define numberOfPoolsPerThread 10
myPools->init(numberOfPoolsPerThread); // each is empty
}
// somewhere in your app
#pragma omp parallel
{
double* myvar = myPools->allocate(nl);
#pragma omp for
for(int i = 0; i < datacount; i++)
{
//Do some stuff
for(int j = 0; j {
myvar = sqrt(stuff);
} //myvar used down here
}
myPools->deallocate(myvar);
}

You can fill out the particulars. You may want to analyse memory usage to see if you should return the first available buffer, last available buffer, smallest available buffer, least recently used available buffer, most recently used available buffer.

And don't forget to return myPools at end of program.

Jim Dempsey

0 Kudos
xraygenfit
Beginner
1,112 Views
Hi guys,

I wanted to thank both of you for your answers. Both were correct, and a combination of both of your techniques worked best. As pointers can be kept private, here's what I did:

MyClass:Init(...)
{

array = new double[omp_get_procs_num()*nl]

}


MyClass::NumberCrunch(....)
{

#pragma omp parallel
{
int threadnum = omp_get_thread_num();
double* localarrray = array+threadnum*nl

#pragma omp for
....
}
}
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,112 Views

xray,

The class as constructed will work. However, at the expense of new/delete as you enter/leave scope of the function.

Although this conserves memory it has two potential pitfalls. 1) If you enter and exit scope a large number of times you may fracture the memory heap and then at some point an allocation (new) will fail. 2) excessive computation overhead.

Your current method may be satisfactory assuming you can live with a future memory allocation problem. If you run this code yourself then any consequences suffered can be weighed against the little bit of extra programming effortnow.

On the otherhand, if you are planning on shipping your code out to customers then a little bit extra effort to avoid a potential problem might be worth the extra effort.

Jim Dempsey

0 Kudos
Reply