Re: malloc & free memory problem when openmp activated on win32

lanzors · ‎06-09-2009

I have a program using some large memory chunks (kind of Gbyte).
The weired thing is when I using more than one openmp thread, the large chunk of memory are not freed clearly (not return to the system) . If only using one thread, there is no problem for free.

Another problem is that windows version compiler indicates the /Qopt-malloc-options, I have tried it in vain : I wonder if it's a fake option and can noly be used on Linux or MacOS.

jimdempseyatthecove · ‎06-09-2009

Quoting - lanzors

I have a program using some large memory chunks (kind of Gbyte).
The weired thing is when I using more than one openmp thread, the large chunk of memory are not freed clearly (not return to the system) . If only using one thread, there is no problem for free.

Another problem is that windows version compiler indicates the /Qopt-malloc-options, I have tried it in vain : I wonder if it's a fake option and can noly be used on Linux or MacOS.

The major soruce of these problems are programmer errors.

Check not only your deallocations but your allocations. A particular nasty is allocating into a pointer by multiple threads when the storage location for the pointer isa shared variable. (misplaced {}'s).

Try performing your allocations within a struct/class object where the dtor releases the allocated memory (like the string class). And be sure to place the object within the correct scope.

Jim Dempsey

lanzors · ‎06-09-2009

Quoting - jimdempseyatthecove

The major soruce of these problems are programmer errors.

Check not only your deallocations but your allocations. A particular nasty is allocating into a pointer by multiple threads when the storage location for the pointer isa shared variable. (misplaced {}'s).

Try performing your allocations within a struct/class object where the dtor releases the allocated memory (like the string class). And be sure to place the object within the correct scope.

Jim Dempsey

At first thanks very much for your response!

What you mean that allocating by multiple threads ? I am quite sure that I does not use any dynamic allocation in an openmp region. In fact in my program there is very few openmp regions. Shall I using omp_set_num_threads(1) evry time when I allocate/free memory ?

Another trace, I don't have this malloc/free problem on the linux plateform, both mono-theads/multi-threads works correctely. Even more, I know exactely which memory chunk are not freed to the system, because such a big chunk are very rarely used in the program.

jimdempseyatthecove · ‎06-09-2009

When you want each thread to have their own array

double* array = 0; // *** bad, pointer in wrong scope
// ok to do this when shared(array) on pragma
#pragma omp parallel
{
array = new double[count]; // *** bad all threads sharing same pointer
// *** 2nd and later threads overwrite pointer
...
delete [] array; // *** 2nd and later threads returning same memory
}

------------------------------------

#pragma omp parallel
{
double* array = 0;
array = new double[count]; // *** good when you want each thread to have seperate copy
...
delete [] array; // *** good each thread returning seperate copy
}

--------------------

double* array = 0; // OK because of private(array) on pragma
#pragma omp parallel private(array)
{
array = new double[count]; // *** good when you want each thread to have seperate copy
...
delete [] array; // *** good each thread returning seperate copy
}

--------------------

double* array = 0;
#pragma omp parallel private(array)
{
array = new double[count]; // *** good when you want each thread to have seperate copy
...
}
delete [] array; // ***badmain thread returning one copy

There is nothing wrong with new/delete inside parallel regions, in fact it may be required when you want each thread to have seperate data (e.g. for temporary arrays).

Jim Dempsey

lanzors · ‎06-09-2009

Quoting - jimdempseyatthecove

When you want each thread to have their own array

double* array = 0; // *** bad, pointer in wrong scope
// ok to do this when shared(array) on pragma
#pragma omp parallel
{
array = new double[count]; // *** bad all threads sharing same pointer
// *** 2nd and later threads overwrite pointer
...
delete [] array; // *** 2nd and later threads returning same memory
}

------------------------------------

#pragma omp parallel
{
double* array = 0;
array = new double[count]; // *** good when you want each thread to have seperate copy
...
delete [] array; // *** good each thread returning seperate copy
}

--------------------

double* array = 0; // OK because of private(array) on pragma
#pragma omp parallel private(array)
{
array = new double[count]; // *** good when you want each thread to have seperate copy
...
delete [] array; // *** good each thread returning seperate copy
}

--------------------

double* array = 0;
#pragma omp parallel private(array)
{
array = new double[count]; // *** good when you want each thread to have seperate copy
...
}
delete [] array; // ***badmain thread returning one copy

There is nothing wrong with new/delete inside parallel regions, in fact it may be required when you want each thread to have seperate data (e.g. for temporary arrays).

Jim Dempsey

Thanks for the precision. I don't have any private dynamic memory used in parallel zone, only use the share memory there and very few small stacks.

Problem I obeseved comes from two big arrays allocated in non-parallel zone (which have never been used in any parallel region). The free() does not return these memory chunks to system immediately (or memory heaps become very fregement) then the program will stop later by lack of memory if I want to allocated another big array.

I have experimented the very similar problem before on the AIX plateform and has been suggested to use discliam() function to "declare" and "return" these memory chunks to system. On linux system, one can use mallopt() function to reset some memory allocation parameters to reproduce the same symptomes.

jimdempseyatthecove · ‎06-10-2009

Quoting - lanzors

Thanks for the precision. I don't have any private dynamic memory used in parallel zone, only use the share memory there and very few small stacks.

Problem I obeseved comes from two big arrays allocated in non-parallel zone (which have never been used in any parallel region). The free() does not return these memory chunks to system immediately (or memory heaps become very fregement) then the program will stop later by lack of memory if I want to allocated another big array.

I have experimented the very similar problem before on the AIX plateform and has been suggested to use discliam() function to "declare" and "return" these memory chunks to system. On linux system, one can use mallopt() function to reset some memory allocation parameters to reproduce the same symptomes.

When you have a single threaded application that is tight on memory (as you suggest your application is)
Then when you use OpenMP or any threading tool, each thread is going to be instantiated with its own stack. The low end of these stack size can be a few MB, but you can set the stack size to small or large. Default is stack size of main thread. You might want to experiment with adjusting the stack size for the non-main thread.

Jim Dempsey

lanzors · ‎06-10-2009

Quoting - jimdempseyatthecove

When you have a single threaded application that is tight on memory (as you suggest your application is)
Then when you use OpenMP or any threading tool, each thread is going to be instantiated with its own stack. The low end of these stack size can be a few MB, but you can set the stack size to small or large. Default is stack size of main thread. You might want to experiment with adjusting the stack size for the non-main thread.

Jim Dempsey

This may be what I shall try. Can you please tell me how to do it?
Thanks a lot!

TimP · ‎06-10-2009

Thread stack size hasn't been standardized in OpenMP. Intel OpenMP controls it by the KMP_STACKSIZE environment variable, or by the corresponding library function calls. Defaults vary with target OS.

lanzors · ‎06-10-2009

Quoting - tim18

Thread stack size hasn't been standardized in OpenMP. Intel OpenMP controls it by the KMP_STACKSIZE environment variable, or by the corresponding library function calls. Defaults vary with target OS.

Thanks! I've just googled that, but it seems that default size is 2M.
I'm gonna try it...

I've just tried it, (with 2M and 2 openMP threads), the problem persist.

TimP · ‎06-10-2009

Quoting - lanzors

Thanks! I've just googled that, but it seems that default size is 2M.
I'm gonna try it...

I've just tried it, (with 2M and 2 openMP threads), the problem persist.

Yes, if the default is 2M (normal for 32-bit), setting the same value should change nothing.

jimdempseyatthecove · ‎06-10-2009

Quoting - tim18

Yes, if the default is 2M (normal for 32-bit), setting the same value should change nothing.

Another potential problem is at what point in your application the OpenMP thread pool is established.

IIF thread pool is established on the 1st entry into the 1st parallel region AND IFF that region is deep in your code, the new stack spaces might be allocated at some midpoint in your allocations, thus potentially causing some undesired fragmentation with your heap. An easy way to fix this is to insert a parallel region just after entry to main which does something that does not get optimized out

#pragma omp parallel
{
if(omp_get_thread_num() < 0) exit();
}

You may need to trace your allocations/deallocations to find the problem and/or insert some well crafted _ASSERT

YourAllocatorAssumesSerial(...)
{
_ASSERT(omp_in_parallel() == 0);
// now allocate...
...
}

And you may need code to check for leaks and/or allocations when not required

// static pointer
double* array = NULL;

...
YourAllocationRoutine()
{
_ASSERT(array==NULL);
array = new double[yourSize];
...
}

If each thread mistakenly called the allocation routine you would have a leak.

See what you can do to reduce your footprint. (maybe optimize for reduced size)

Jim Dempsey

lanzors · ‎06-10-2009

Quoting - tim18

Yes, if the default is 2M (normal for 32-bit), setting the same value should change nothing.

Tell me if I made a mistake : 2 threads and each use 2M of stack, doesn't mean 4M stack at total? Which is almost nothing comparing to my memory lacking.
thanks.

lanzors · ‎06-10-2009

Quoting - jimdempseyatthecove

Another potential problem is at what point in your application the OpenMP thread pool is established.

IIF thread pool is established on the 1st entry into the 1st parallel region AND IFF that region is deep in your code, the new stack spaces might be allocated at some midpoint in your allocations, thus potentially causing some undesired fragmentation with your heap. An easy way to fix this is to insert a parallel region just after entry to main which does something that does not get optimized out

#pragma omp parallel
{
if(omp_get_thread_num() < 0) exit();
}

You may need to trace your allocations/deallocations to find the problem and/or insert some well crafted _ASSERT

YourAllocatorAssumesSerial(...)
{
_ASSERT(omp_in_parallel() == 0);
// now allocate...
...
}

And you may need code to check for leaks and/or allocations when not required

// static pointer
double* array = NULL;

...
YourAllocationRoutine()
{
_ASSERT(array==NULL);
array = new double[yourSize];
...
}

If each thread mistakenly called the allocation routine you would have a leak.

See what you can do to reduce your footprint. (maybe optimize for reduced size)

Jim Dempsey

I can make the 2nd test, even I'm quite sure that all of my memory allocation shall be out of parallel region.
3rd test can be interesting, I can always perserve a few static pointers for the big array, it may change something.
First test is just to ensure the abnormal behavoir, isn't it?

Thanks again for the precious advices.

Om_S_Intel · ‎06-10-2009

Could you please help us with test case to look into your problem if this is not resolved?

Om

jimdempseyatthecove · ‎06-11-2009

Lanzors,

By using _ASSERT(expression) the code only expands in Debug build. So your Release build has no overhead.
However, as much as you try to keep your allocations under control, when you hand this code off to someone else to support, they may not be as careful as you are. The _ASSERT is in there to catch for these types of potential errors (now or in the future). You should get in the habit of using _ASSERT throught your code to test for all kinds of errors, principly argument checking, but in some places results checking, or convergence problems testing.

Also, you might try setting "Low Fragmentation Heap"

See MS C++ help on

heap functions | HeapSetInformation

Then once you read that, follow link to "Low Fragmentation Heap"

From MS C++ Help

[cpp]The following example shows you how to enable the low-fragmentation heap.

#include 
#include 

void main()
{
    ULONG  HeapFragValue = 2;

    if(HeapSetInformation(GetProcessHeap(),
                       HeapCompatibilityInformation,
                       &HeapFragValue,
                       sizeof(HeapFragValue))
    )
    {
        printf("Success!n");
    }
    else printf ("Failure (%d)n", GetLastError());
}
[/cpp]

Jim Dempsey

lanzors · ‎06-11-2009

Thanks for evrybody's suggestions!

Jim, I totally agree with you about the assertion purpose and thank again for your new advice about MS heap funtions: I will look it around for the solution.

Om, I have already made a very little test but it can not reproduce the problem, I doubt it depends on the complexity of the allocation scheme in the program. Any way, I shall try it again if I can not find out the solution.

Because I do have other jobs to do for now, so I have just fixed this problem by a work-out : before allocate a big array, do a simple estimation with a loop of malloc/free... But surely I will come back to this problem and keep everybody knows.

amr_o_1 · ‎02-27-2017

sorru for disturbance

i need to ask a question related to open mp

i am running a program that takes a layout as an input and do some computation then write the output in a text file

if the layout is small open mp works well and produce output as running code serial , but if the layout is large it does not produce the same output as running code in series and also if run more than one each time it produces different output and the problem is not racing because as i said in a small layout it works perfect .

Moreover , if i changed the code for large layout to make it make less computations on it ,output produces is similar to series.

I really want to know what is the problem. Is it that openmp does not support large data size.

jimdempseyatthecove · ‎03-01-2017

amr,

Please repost this as a new topic. Tacking it on to a thread that is 8 years old may get little attention.

There are generally two circumstances to be aware of when coding parallel:

1) Accumulation of floating point round off errors may be different when the accumulation is performed in different order or on sub-sets and then subsequently reduced to a single result.

2) When the multiple threads perform output, the output will not necessarily be in the same order as when performed serially.

Your small layout may work by chance.

In a parallel program, the multiple threads do not start your compute section instantaneously nor operate in lock-step. If the code section is small and has relatively few places of (potential) race conditions then the probability of actual race condition is low. As the code in the parallel region gets larger, and the occurrence of (potential) race conditions increase, then the probability of race condition increases.

Jim Dempsey

SergeyKostrov · ‎03-01-2017

>>Moreover , if i changed the code for large layout to make it make less computations on it ,output produces is similar to series. >> >>I really want to know what is the problem. Is it that openmp does not support large data size. OpenMP is absolutely neutral when it comes to sizes of data. I think you have some internal problem with implementation. For example, I've done lots of testing using some algorithms ( OpenMP tuned ) for data sets with Giga elements and results are consistent. Shortly speaking, it doesn't matter how big or small data sets are and if there are no implementation errors in codes OpenMP does the job.

SergeyKostrov · ‎03-01-2017

>>... >>I am quite sure that I does not use any dynamic allocation in an openmp region. In fact in my program there is very few >>openmp regions. Shall I using omp_set_num_threads(1) evry time when I allocate/free memory? >>... No. Make sure that all memory allocation and de-allocation calls ( especially dynamic / I'm not talking about static blocks ) are outside of OpenMP regions. I think you have an implementation issue, some kind of data corruption, and I would recommend to do a set of tests increasing data sizes until a problem is detected. As soon as it is reproduced try to debug to find a possible reason of the problem.

malloc & free memory problem when openmp activated on win32 plateform