scalable allocator problem (race conditions?)

marc_sanfacon · ‎12-12-2007

Hi all, I have been evaluating the TBB library lately and on of the thing that is really interesting to me is the scalable_allocator. What I did was override the global new and delete operators with the scalable_allocator found in TBB. Actually, we developed our own allocator (that we have been using for years) so I simply replaced the calls to our allocator by the calls to TBB. This seems to work really well and faster than our allocator, however we encountered what seems to be race conditions in the TBB allocator. The process is running and using CPU, but seems to be in an infinite loop, for several minutes. I did a core dump and found several threads in scalable_free. What's interesting, is that after the core dump, the process resumed its operation completely and was yet again running - no need to kill it or restart it. BTW, I am running this on 8 cores machines (Intel & AMD) that has 8-16GB of memory on Windows 2003 Server. The problem occurred on both hardware. Anyone has an idea on why this occurs or how I can fix it? It seems I am using the latest release of TBB for Windows, downloaded about 3 weeks ago. Thank you, Marc

Alexey-Kukanov · ‎12-13-2007

Hi Marc,

Thank you for reporting the issue. If you could provide a reproducer test, it would greatly simplify our investigations.

marc_sanfacon · ‎12-22-2007

Hi Alexey,
I have tried to reproduce it in a smaller sample but I was unable to. I will try to fin d a way to reproduce it easily in our application and then let you know, and probably send you our application for you to reproduce.

Thanks,
Marc

vanswaaij · ‎04-11-2008

Hi Alexey,

Have you been able to find the source of the problem? I might also have a race condition. In my case scalable_malloc returns 0 as if memory is exhausted but only when run in optimized mode using both processors (I'm testing on a 2 processor machine). When using a single thread or in debug mode with 2 threads or when run under valgrind with 2 threads all is fine.
I've tried replicating the problem in a simple test case, but haven't been able too. I could send the application, but it probably would require NDA's (it's a proprietary raytracer) to be signed etc. so I'm wondering if you'all have made any progress, or if I can produce a trace or something that could help you.

Thanks,
Maurice

Alexey-Kukanov · ‎04-13-2008

Maurice,

I never heard back from Marc about a reproducer test case for his problem, and I was not able to reproduce the issue on my own.

For your case, I suggest you read the thread http://software.intel.com/en-us/forums//topic/57610and check ifthe root of your problem is the same as there.

If it does not help, I would appreciate if you provide more info about the TBB package you use, as well as your OS version and other environment such as compiler, C RTL (e.g. glibc), etc.

vanswaaij · ‎04-14-2008

I'll check out the link and get back to you on it.

Thanks,

Maurice.

vanswaaij · ‎04-16-2008

Hi Alexey,

So it was not the errno problem, I had that one figured out before I read the article.

I've made a test case that bombs on my 2 processors box:

uname -a
Linux 2.6.11-1.35_FC3smp #1 SMP Mon Jun 13 01:16:04 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux

compile flags
-march=opteron -m64 -ansi -mieee-fp -fPIC -pthread -g -Wall -W -Wwrite-strings

compiler version:
g++344 (GCC) 3.4.4

tbb version 2.0.17 and 2.0.20080408 show the problem

The two defines in the top of the file control the test, bassically it runs without the scalable allocator or when running with 1 thread.
I hope you can replicate the problem on your hardware.

Thanks

Maurice

=======================================================================================

//#define USE_SCALABLE		// uncomment this, fails
//#define ONE_THREAD		// uncomment this, runs again

#include "errno.h"
#include "iostream"
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/scalable_allocator.h"

typedef unsigned short uint16;
typedef unsigned long long uint64;
typedef unsigned int uint32;

static  const uint64    a = 0x5deece66dULL;
static  const uint32    c = 11;
static  const uint64    mask48 = 0xffffffffffffULL;
static  const double    max48 = 281474976710656.;   // 2 ** 48

using namespace tbb;
using namespace std;

double  drand (uint16 seed[3])
{
    // Emulate erand48 random number generator.

    uint64  num = seed[2];
    num = (num << 16) + seed[1];
    num = (num << 16) + seed[0];

    num = (num * a + c) & mask48;

    double  result = (double) num / max48;
    seed[0] = (uint16) num;
    num >>= 16;
    seed[1] = (uint16) num;
    num >>= 16;
    seed[2] = (uint16) num;

    return result;
}

#ifdef USE_SCALABLE

void * ::operator new (size_t n) throw (std::bad_alloc)
{
    errno = 0;
    void * result = scalable_malloc (n);

    if (!result) {
        cout << "Can't allocate " << n << " bytes" << endl;
        ::exit (1);
    }

    return result;
}

void * ::operator new[] (size_t n) throw (std::bad_alloc)
{
    return ::operator new (n);
}


void ::operator delete (void * ptr)
{
    scalable_free (ptr);
}

void ::operator delete[] (void * ptr)
{
    scalable_free (ptr);
}

#endif
void foo (double *& x, uint16 seed[3])
{
    int size = 1 + (int)(2000 * drand (seed));
    bool alloc = drand (seed) > 0.7;

    if (x) {
        delete [] x;
        x = 0;
    }
    if (alloc) {
        x = new double[size];
    }
}


class parallel_foo {
public:
    parallel_foo (double * a[]) :
        my_a_ (a) {}

    void operator () (const blocked_range& r) const {
        double ** a = my_a_;
        uint16 seed[3] = {r.begin (), r.end (), (uint16)&r};

        for (size_t i=r.begin(); i!=r.end(); ++i) {
            for (size_t j = 0; j<16; ++j) {
                foo (a, seed);
            }
        }
    }
private:
    double ** const my_a_;
    uint16  seed_[3];
};

class parallel_init {
public:
  
  parallel_init (double * a[]) :
        my_a_ (a) {}

    void operator () (const blocked_range& r) const {
        double ** a = my_a_;
        for (size_t i=r.begin(); i!=r.end(); ++i) {
            a = 0;
        }
    }
private:
    double ** const my_a_;
};


int main () {

    size_t N = (size_t)1e6;

    double ** a = new double *;

#ifdef ONE_THREAD
    task_scheduler_init init (1);
#else
    task_scheduler_init init;
#endif
    parallel_for (blocked_range (0, N, 100), parallel_init (a));
    parallel_for (blocked_range (0, N, 100), parallel_foo (a));
    return 0;
}

Alexey-Kukanov · ‎04-16-2008

Well, your test might actually run out of memory, due to memory leak introduced in it (though I do not know if this is just an overlook and not a part of the test). The foo function allocates some memory and frees it on the second call and allocates again, etc. But the memory allocated in the last call to foo is never freed. So if you add one line (below in bold) after the loop calling foo, the test might run just fine (at least, this is the case on my Windows laptop).

for (size_t j = 0; j<16; ++j) {
foo (a, seed);
}
if( a ) delete[] a;

vanswaaij · ‎04-17-2008

Yes that is on purpose, only when a certain amount of memory is 'in play' does the problem occur.
I managed to make the allocation smaller though by replacing the amount '2000' by '1400' and the number of iterations (16) can be set to 1.
I also put in an endless print loop in case of failure so I can inspect the output of 'top', as well as an endless loop at the end, also to inspect 'top' in case of success.

With these numbers it doesn't always fail anymore on my box, but here are two different runs that did fail, notice the difference in allocated memory, at least according to top.

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

19647 maurice 16 0 1560m 454m 1088 S 55.2 5.7 0:03.82 mtbb_3mtg

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

21810 maurice 16 0 3099m 908m 1088 R 99.9 11.4 0:04.36 mtbb_3mtg

And here for a succesfull run with the scalable allocator, note it allocated more than the two unsuccessful

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
;

22241 maurice 25 0 3105m 909m 1056 R 99.8 11.4 0:30.52 mtbb_3mtg

And here a successful run not using the scalable allocator. Note that the virtual memory allocated is a lot less

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

23547 maurice 25 0 1636m 968m 1052 R 99.8 12.2 0:08.60 mtbb_3mtg

To convince you that my box can allocate enough memory, here's a run not using the scalable allocator but with a much larger allocation

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

3501 maurice 25 0 18.0g 4.4g 1056 R 99.9 56.1 0:22.81 mtbb_3mtg

Thanks

==============================================================================
#define USE_SCALABLE
//#define ONE_THREAD

#define SIZE_MEM 1400
#define SIZE_ARR 1e6
#define COUNT 1

#include "errno.h"
#include "iostream"
#inc
lude "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/scalable_allocator.h"

typedef unsigned short uint16;
typedef unsigned long long uint64;
typedef unsigned int uint32;

static  const uint64    a = 0x5deece66dULL;
static  const uint32    c = 11;
static  const uint64    mask48 = 0xffffffffffffULL;
static  const double    max48 = 281474976710656.;   // 2 ** 48

using namespace tbb;
using namespace std;

double  drand (uint16 seed[3])
{
    // Emulate erand48 random number generator.

    uint64  num = seed[2];
    num = (num << 16) + seed[1];
    num = (num << 16) + seed[0];

    num = (num * a + c) & mask48;

    double  result = (double) num / max48;
    seed[0] = (uint16) num;
    num >>= 16;
    seed[1] = (uint16) num;
    num >>= 16;
    seed[2] = (uint16) num;

    return result;
}

#ifdef USE_SCALABLE

void * ::operator new (size_t n) throw (std::bad_alloc)
{
    errno = 0;
    void * result = scalable_malloc (n);

    if (!result) {
        while (1) {
            cout << "Can't allocate " << n << " bytes" << endl;
        }
    }

    return result;
}

void * ::operator new[] (size_t n) throw (std::bad_alloc)
{
    return ::operator new (n);
}
   

void ::operator delete (void * ptr)
{
    scalable_free (ptr);
}

void ::operator delete[] (void * ptr)
{
    scalable_free (ptr);
}

#endif

void foo (double *& x, uint16 seed[3])
{
    int size = 1 + (int)(SIZE_MEM * drand (seed));
    bool alloc = drand (seed) > 0.7;

    if (x) {
        delete [] x;
        x = 0;
    }
    if (alloc) {
        x = new double[size];
    }
}

class parallel_foo {
public:
    parallel_foo (double * a[]) :
        my_a_ (a) {}

    void operator () (const blocked_range& r) const {
        double ** a = my_a_;
        uint16 seed[3] = {r.begin (), r.end (), (uint16)&r};

        for (size_t i=r.begin(); i!=r.end(); ++i) {
            for (size_t j = 0; j                foo (a, seed);
            }
        }
    }
private:
    double ** const my_a_;
    uint16  seed_[3];
};

class parallel_init {
public:
    parallel_init (double * a[]) :
        my_a_ (a) {}

    void operator () (const blocked_range& r) const {
        double ** a = my_a_;
        for (size_t i=r.begin(); i!=r.end(); ++i) {
            a = 0;
        }
    }
private:
    double ** const my_a_;
};
i
nt main () {

    size_t N = (size_t)SIZE_ARR;

    double ** a = new double *;

#ifdef ONE_THREAD
    task_scheduler_init init (1);
#else
    task_scheduler_init init;
#endif
    parallel_for (blocked_range (0, N, 100), parallel_init (a));
    parallel_for (blocked_range (0, N, 100), parallel_foo (a));
    while (1) {}
    return 0;
}

robert-reed · ‎04-17-2008

I think the virtual memory size may be a red herring. Certainly one of the failed scalable results has aboutthe same virtual size as the successful non-scalable allocator example, so having a smaller footprint is not correlated with success. This led me to look at what the expected memory usage might be so I did a little thought experiment;

The code represents an array of 1 million pointers to arrays of doubles, which due to the thresholding of the random variable, should have roughly a 30% occupancy rate (assuming a uniform distribution out of drand(), whichprobably isn'ttrue). The maximum size of these double arrays is 1400, so the upper bound of virtual memory allocation should be:

1400 * 8 (bytes per double) * 1,000,000 (pointers in the base array) * .3 (30% occupancy) = 3360000000 or roughly 3,204 MB

None of the examples, successful or otherwise, exceed this limit. Assuming a uniform or at least symmetric distribution of drand() and assuming that all memory allocated by the system is represented in the arrays of doubles, I'd expect the virtual memory size over repeated runs to cluster around half that upper limit, or pretty close to the size reported for the successful non-scalable allocator case.

However, those are two big assumptions. They don't account, for example, for memory allocated by the underlying malloc (which will show up in top)but held in pools for future allocation as is done in the scalable allocator. So a skew in the memory footprint above the expected average may not be significant. And certainly the seeds for drand are not uniform so the output is not likely uniform either, making any conclusions based on averaging results suspect.

Still, these footprints don't appear nearly large enough to raise any fears about out-of-memory issues. And I don't see anything in this code that might otherwise explain the reported hangs.

vanswaaij · ‎05-05-2008

So it turns out that Alexey was right, my program is simply running out of memory. The fact that 'top' didn't report it was because my program never accessed the memory allocated. Actually accessing it shows the true memory use.

However there is a big difference in total memory use (in this particular example) for the scalable allocator versus regular malloc, because the scalable allocator adds about 16K bytes to every allocation larger than about 8K. Not knowing this made it seem that the scalable memory allocator was running out of memory much too soon when using it in my original application. The (faulty) test seem to confirm this.

The question is then, can that overhead be reduced? An unfortunate application that allocates chunks of about 8K will actually use 3 times as much (8K + 16K + a small header).

Any thoughts?

Thanks,

Maurice.

Alexey-Kukanov · ‎05-07-2008

vanswaaij:
However there is a big difference in total memory use (in this particular example) for the scalable allocator versus regular malloc, because the scalable allocator adds about 16K bytes to every allocation larger than about 8K. Not knowing this made it seem that the scalable memory allocator was running out of memory much too soon when using it in my original application.

Right, we are aware of it, agree this is a problem, and will address it, though I can't say how soon. It requires more efforts and refactoring than it might seem.