Hey Raf, thank you very much

xian_x_ · ‎11-26-2015

Dear all,

I have noticed some memory issues which were similar to what has been reported in 597165. However our context is running plugins code in host applications, which manifests the leak as the fact that unless we kill the host application, a big chunk of memory ( allocated and supposedly released by the plugin code ) was never returned to system.

Another problem is, that the total memory cost is significantly larger than what is required for the data. Please forgive my ignorance in hashing, but I kind of expect it would work more or less as std::map but faster in accessing elements with concurrency.

Here's a simplified standalone to show the problem, uncomment PARALLEL_INSERT, after the table is released/deleted, and before pressing 'q' to exit, we can see how much memory was hold by the program, while on the other hand, sequential inserting won't show this problem. A theoretical memory usage just for the data is printed out but the real cost is at least 6-7 times more...

Any idea, please?

Thanks,

Xiao-Xian

#include <iostream>
#include <vector>

#include <tbb/tbb.h>
#include <tbb/concurrent_hash_map.h>

using namespace std;
using namespace tbb;

//#define PARALLEL_INSERT 1 

typedef tbb::concurrent_hash_map< int, int > HashTable;

int  main()
{ 
    const unsigned long n = 2e+8;

    cout<<" One map value takes "<<sizeof( HashTable::value_type )<<" bytes and total data shall take "
       <<double( n*sizeof( HashTable::value_type ) )/double( 1024*1024 *1024 )<<" Gb "<<endl;
    
    cout<<" About to create hash map of "<<n<<" elements ...";

    HashTable *my_table = new HashTable();

#ifdef PARALLEL_INSERT

    parallel_for( blocked_range< unsigned long >(0, n), [&]( blocked_range<unsigned long> &r)
    {
        for( unsigned long i = r.begin(); i != r.end(); i++ )
        {
            HashTable::accessor a;
            my_table->insert( a, i );
        }
    } );
#else

    for( unsigned long i = 0; i < n; i++ )
    {
        HashTable::accessor a;
        my_table->insert( a, i );
    }

#endif

    cout<<" Done! "<<endl;
    cout<<" About to destroy the table ...";

    my_table->clear();
    delete my_table;

    cout<<" Done! "<<endl;

    // From here check system memory usage before pressing 'q'...
    char x;
    while( 1 )
    {
        cout<<" press q to exist..."<<endl;
        cin >> x;

        if( x == 'q')
        {
            break;
        }
    }

    return 0;
}

RafSchietekat · ‎11-26-2015

Did you read somewhere that the scalable allocator now returns memory to the system after use?

Unless I'm mistaken, you should expect about 16+40=56 bytes per element on a 64-bit system. I have not tried to run the code, but, from your description, you're seeing about 96-104 bytes per element and 18-19 GB total, is that right?

When there are multiple threads, each of them has its own memory pool, which causes some additional overhead, but...

(2015-11-30 Added) This is wrong, see corrections below.

xian_x_ · ‎11-27-2015

Thanks for replying. Would you elaborate a bit more on how to work out the 16+40 bytes, please? This would sum up to 10.4G with 200 million of elements, which actually matched pretty well to what I've seen. The size of HashTable::value_type however, only takes 8 bytes...

Regarding the memory allocator, I was expecting memories to be released after "my_table->clear()" was called, which is what actually happened when inserting elements in single thread ( with #define PARALLEL_INSERT 1 commented out ) . So question could be rephrased to "any differences exist between allocating memory in single thread and multiple threads?"

Cheers,

Xiao-Xian

RafSchietekat · ‎11-27-2015

Sorry, I have been rather careless and have made several mistakes as a result.

Somehow I was thinking of an int as also being 64 bits, so my interpretation of the observed memory use was exaggerated by a factor of 2.

Now my own estimate... Each element is stored in a node, together with a list pointer and a mutex, so that would be 24 bytes. Each list is headed by a bucket object containing a list pointer and a mutex, and the concurrent_hash_map tries to maintain the load factor below 1, so that's another 16+ bytes. Total would be 40+ bytes, or 8+ GB.

I don't know what would explain the difference you saw between single-threaded and multi-threaded execution.

(2015-11-28 Edited)

xian_x_ · ‎11-28-2015

Hey Raf, thank you very much for the information. That arrangement would explain the extra cost.

Regarding the differences in single-threaded and multi-threaded executions, I suspect it might be a memory fragmentation rather a leak, as I've heard similar issues in allocating memories concurrently with other threading tools. My workaround is simply allocating a continuous chunk sequentially for the table...in case you are interested:

std::vector< std::pair< int, int > > keys( n );
parallel_for( blocked_range< int >(0, n), [&]( blocked_range<int> &r)
{
    for( int i = r.begin(); i != r.end(); i++ )
    {	 
        keys[ i ].first = i;
    }
} );

HashTable *my_table = new HashTable( keys.begin(), keys.end() );

And fill the table concurrently with right element values, then memory is returned!

Cheers,

Xiao-Xian

RafSchietekat · ‎11-28-2015

What happens if, instead, you do scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS, NULL) after having cleared the map?

xian_x_ · ‎11-29-2015

Unfortunately the command didn't return the few Gbs to the memory, but thanks for the information, as it leads me to other discussions about this issue, i.e. this one, https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/498725. It seems tbb memory_pool might be a good workaround...

RafSchietekat · ‎11-29-2015

Maybe in a later release...

That workaround with memory_pool will only work for the element nodes, not for the segment arrays. Furthermore, you will definitely not be able to reuse the memory of earlier&smaller segment arrays for later nodes, and it is less likely that these memory regions can be recycled together for later&larger segment arrays, so you should expect your high-water mark to increase to some degree (unless there was never any reuse even for element nodes?).

Even if you go with the memory_pool, leave in the command in case it starts to work as expected later on, or for any effect it already has now (perhaps there's even some extra benefit with the nodes out of the way).

(Added) Also try to reserve space in advance to avoid segment array reallocation.

(Added) And make sure the map is destroyed before the memory_pool, e.g., by having them together on the stack instead of dynamically allocating the map.

Memory issues in concurrent_hash_map