Solved: TBB memory allocator crashes.

mrmod · ‎02-21-2011

Running code below compiledas 32-bit codeunder Microsoft Visual Studio 2008 SP1 on Windows 7 x64 and latest TBB crashes.

after changing constant 4*1024*1024 to 2*1024*1024 crashes disappear.

I think it may be a library bug, because othere is no crashes neither with google tcmalloc allocator nor with standard microsoft allocator.

//

// code

//

#include "stdafx.h"

#include

#define INTEL_MALLOC

//#define GOOGLE_MALLOC

#if defined INTEL_MALLOC

#include

const char * AllocName = "Intel";

#elif defined GOOGLE_MALLOC

#pragma comment(lib, "libtcmalloc_minimal.lib")

#pragma comment(linker, "/include:__tcmalloc")

const char * AllocName = "Google";

#else

const char * AllocName = "Microsoft";

#endif

volatile LONG g_stallCount = 0;

const float StallTime = 0.005f;

DWORD WINAPI ThreadFunc(LPVOID)

{

float avgStall = 0.0f, maxStall = 0.0f;

size_t stallCount = 0;

for (size_t i = 0; i < 10000000; ++i)

{

size_t dataSize = (size_t)(4*1024*1024 * (rand()/(float)RAND_MAX) + 1024*1024);

clock_t begin = clock();

char * data = new char[dataSize];

float time = (clock() - begin)/(float)CLOCKS_PER_SEC;

if (time > StallTime)

{

maxStall = max(time, maxStall);

avgStall += time;

++stallCount;

}

begin = clock();

delete [] data;

time = (clock() - begin)/(float)CLOCKS_PER_SEC;

if (time > StallTime)

{

maxStall = max(time, maxStall);

avgStall += time;

++stallCount;

}

if (stallCount)

avgStall /= stallCount;

printf("avg stall: %f, max stall = %f, count = %u\\n", avgStall, maxStall, stallCount);

return 0;

}

int _tmain(int argc, _TCHAR* argv[])

{

HANDLE t[4] = {0};

printf("Allocator: %s\\n", AllocName);

t[0] = CreateThread(NULL, 0, ThreadFunc, NULL, 0, NULL);

t[1] = CreateThread(NULL, 0, ThreadFunc, NULL, 0, NULL);

t[2] = CreateThread(NULL, 0, ThreadFunc, NULL, 0, NULL);

t[3] = CreateThread(NULL, 0, ThreadFunc, NULL, 0, NULL);

clock_t begin = clock();

WaitForMultipleObjects(sizeof(t)/sizeof(t[0]), t, TRUE, INFINITE);

float time = (clock() - begin)/(float)CLOCKS_PER_SEC;

printf(

"stall count: %u\\n"

"total time: %f\\n", g_stallCount, time);

return 0;

}

Alexey-Kukanov · ‎02-22-2011

Thanks for reporting the issue. We investigated the case, and it is an interesting one. The benchmark crashes because it does not catch std::bad_alloc exception that can be thrown by operator new. And this is exactly what happens in case of TBB allocator. I know it sounds weird, but the allocator gets out of memory. I will explain why it happens.

The TBB allocator is primarily tuned for speed, at the expense of somewhat excessive memory consumption. In particular, for sizes of 8K and bigger the allocator holds freed memory objects for some time, in the assumption that the application may need a similar object again. Since applications usually repeat allocations of the same size, TBB allocator distributes all freed objects into "bins" of a certain size range, to reduce internal fragmentation and lock contention. The objectsnot reusedfor some time (with adaptive time threshold separate for each bin) are finally released. Currently, there are 1024 bins covering sizes of 8K to ~8M with the step of 8K.

Now let's see what happens with this benchmark. Itis kind of a worst case flogger for the TBB allocator. It allocates objects of 1M to 5M, with more or less even distribution of sizes in between. Of course the memory is immediately returned, so from user perspective the test consumes not more than 20M (max 5M muliply by 4 threads) at any given moment. But internally, the allocator keeps these objects, and reuses only when the requested size matches the object size. Since every size is regularly requested, the objects are unlikely to age above thresholds, so lots of them get hoarded. The worst case (when every of 513 bins between 1M and 5M keeps as many objects as there are threads running) would require more than 6G of memory: 4 threads * 513 bins * 3M average bin size. Even if a bin only hoards 2 objects on average, it isstill ~3G and the test can run out of memory.

We are improving the allocator design so that it hoards less memory, and more readily returns it to the system. One of improvements is to forcefully release all the pooled objects when OS refuses to give more memory (well, it will bemore intelligent and do this as the last resort when nothing else helps). Due to that, doing fixes withinthe current design for the given corner case seems waste of effort.

View solution in original post

Alexey-Kukanov · ‎02-22-2011

Thanks for reporting the issue. We investigated the case, and it is an interesting one. The benchmark crashes because it does not catch std::bad_alloc exception that can be thrown by operator new. And this is exactly what happens in case of TBB allocator. I know it sounds weird, but the allocator gets out of memory. I will explain why it happens.

The TBB allocator is primarily tuned for speed, at the expense of somewhat excessive memory consumption. In particular, for sizes of 8K and bigger the allocator holds freed memory objects for some time, in the assumption that the application may need a similar object again. Since applications usually repeat allocations of the same size, TBB allocator distributes all freed objects into "bins" of a certain size range, to reduce internal fragmentation and lock contention. The objectsnot reusedfor some time (with adaptive time threshold separate for each bin) are finally released. Currently, there are 1024 bins covering sizes of 8K to ~8M with the step of 8K.

Now let's see what happens with this benchmark. Itis kind of a worst case flogger for the TBB allocator. It allocates objects of 1M to 5M, with more or less even distribution of sizes in between. Of course the memory is immediately returned, so from user perspective the test consumes not more than 20M (max 5M muliply by 4 threads) at any given moment. But internally, the allocator keeps these objects, and reuses only when the requested size matches the object size. Since every size is regularly requested, the objects are unlikely to age above thresholds, so lots of them get hoarded. The worst case (when every of 513 bins between 1M and 5M keeps as many objects as there are threads running) would require more than 6G of memory: 4 threads * 513 bins * 3M average bin size. Even if a bin only hoards 2 objects on average, it isstill ~3G and the test can run out of memory.

We are improving the allocator design so that it hoards less memory, and more readily returns it to the system. One of improvements is to forcefully release all the pooled objects when OS refuses to give more memory (well, it will bemore intelligent and do this as the last resort when nothing else helps). Due to that, doing fixes withinthe current design for the given corner case seems waste of effort.

mrmod · ‎02-22-2011

Thanks for detailed explanation. It was really the simplest benchmark I could write, and such case is almost impossible in my real application (I just wanted to get some numbers to deside what allocator to choose instead of standard one, because profiling results have shown my multi-threaded 3D rendering application stalls primarily on memory allocations).

As for benchmark results - they have shown that TBB allocator is 2 orders faster than stantard (VC 2008) one, and one order faster than Google's one. Both in total bechmark execution time and the number of what I called "stalls" (most often the case, where allocator locks the mutex).

Thank you for your efforts.

Alexey-Kukanov · ‎02-22-2011

I hope that TBB allocator will work just as well for your real application. The impact of an allocator to the application performance is rarely straightforward justfrom looking at microbenchmarks, unless those are crafted to reflect the application behavior, more or less. Since changing allocators can be done with just a few lines of code, the best benchmark is running your app for real with different allocators.