- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marc,
Thank you for reporting the issue. If you could provide a reproducer test, it would greatly simplify our investigations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tried to reproduce it in a smaller sample but I was unable to. I will try to fin d a way to reproduce it easily in our application and then let you know, and probably send you our application for you to reproduce.
Thanks,
Marc
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you been able to find the source of the problem? I might also have a race condition. In my case scalable_malloc returns 0 as if memory is exhausted but only when run in optimized mode using both processors (I'm testing on a 2 processor machine). When using a single thread or in debug mode with 2 threads or when run under valgrind with 2 threads all is fine.
I've tried replicating the problem in a simple test case, but haven't been able too. I could send the application, but it probably would require NDA's (it's a proprietary raytracer) to be signed etc. so I'm wondering if you'all have made any progress, or if I can produce a trace or something that could help you.
Thanks,
Maurice
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maurice,
I never heard back from Marc about a reproducer test case for his problem, and I was not able to reproduce the issue on my own.
For your case, I suggest you read the thread http://software.intel.com/en-us/forums//topic/57610and check ifthe root of your problem is the same as there.
If it does not help, I would appreciate if you provide more info about the TBB package you use, as well as your OS version and other environment such as compiler, C RTL (e.g. glibc), etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll check out the link and get back to you on it.
Thanks,
Maurice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So it was not the errno problem, I had that one figured out before I read the article.
I've made a test case that bombs on my 2 processors box:
uname -a
Linux 2.6.11-1.35_FC3smp #1 SMP Mon Jun 13 01:16:04 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
compile flags
-march=opteron -m64 -ansi -mieee-fp -fPIC -pthread -g -Wall -W -Wwrite-strings
compiler version:
g++344 (GCC) 3.4.4
tbb version 2.0.17 and 2.0.20080408 show the problem
The two defines in the top of the file control the test, bassically it runs without the scalable allocator or when running with 1 thread.
I hope you can replicate the problem on your hardware.
Thanks
Maurice
=======================================================================================
//#define USE_SCALABLE // uncomment this, fails
//#define ONE_THREAD // uncomment this, runs again
#include "errno.h"
#include "iostream"
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/scalable_allocator.h"
typedef unsigned short uint16;
typedef unsigned long long uint64;
typedef unsigned int uint32;
static const uint64 a = 0x5deece66dULL;
static const uint32 c = 11;
static const uint64 mask48 = 0xffffffffffffULL;
static const double max48 = 281474976710656.; // 2 ** 48
using namespace tbb;
using namespace std;
double drand (uint16 seed[3])
{
// Emulate erand48 random number generator.
uint64 num = seed[2];
num = (num << 16) + seed[1];
num = (num << 16) + seed[0];
num = (num * a + c) & mask48;
double result = (double) num / max48;
seed[0] = (uint16) num;
num >>= 16;
seed[1] = (uint16) num;
num >>= 16;
seed[2] = (uint16) num;
return result;
}
#ifdef USE_SCALABLE
void * ::operator new (size_t n) throw (std::bad_alloc)
{
errno = 0;
void * result = scalable_malloc (n);
if (!result) {
cout << "Can't allocate " << n << " bytes" << endl;
::exit (1);
}
return result;
}
void * ::operator new[] (size_t n) throw (std::bad_alloc)
{
return ::operator new (n);
}
void ::operator delete (void * ptr)
{
scalable_free (ptr);
}
void ::operator delete[] (void * ptr)
{
scalable_free (ptr);
}
#endif
void foo (double *& x, uint16 seed[3])
{
int size = 1 + (int)(2000 * drand (seed));
bool alloc = drand (seed) > 0.7;
if (x) {
delete [] x;
x = 0;
}
if (alloc) {
x = new double[size];
}
}
class parallel_foo {
public:
parallel_foo (double * a[]) :
my_a_ (a) {}
void operator () (const blocked_range& r) const {
double ** a = my_a_;
uint16 seed[3] = {r.begin (), r.end (), (uint16)&r};
for (size_t i=r.begin(); i!=r.end(); ++i) {
for (size_t j = 0; j<16; ++j) {
foo (a, seed);
}
}
}
private:
double ** const my_a_;
uint16 seed_[3];
};
class parallel_init {
public:
parallel_init (double * a[]) :
my_a_ (a) {}
void operator () (const blocked_range& r) const {
double ** a = my_a_;
for (size_t i=r.begin(); i!=r.end(); ++i) {
a = 0;
}
}
private:
double ** const my_a_;
};
int main () {
size_t N = (size_t)1e6;
double ** a = new double *;
#ifdef ONE_THREAD
task_scheduler_init init (1);
#else
task_scheduler_init init;
#endif
parallel_for (blocked_range(0, N, 100), parallel_init (a));
parallel_for (blocked_range(0, N, 100), parallel_foo (a));
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, your test might actually run out of memory, due to memory leak introduced in it (though I do not know if this is just an overlook and not a part of the test). The foo function allocates some memory and frees it on the second call and allocates again, etc. But the memory allocated in the last call to foo is never freed. So if you add one line (below in bold) after the loop calling foo, the test might run just fine (at least, this is the case on my Windows laptop).
for (size_t j = 0; j<16; ++j) {
foo (a, seed);
}
if( a ) delete[] a;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes that is on purpose, only when a certain amount of memory is 'in play' does the problem occur.
I managed to make the allocation smaller though by replacing the amount '2000' by '1400' and the number of iterations (16) can be set to 1.
I also put in an endless print loop in case of failure so I can inspect the output of 'top', as well as an endless loop at the end, also to inspect 'top' in case of success.
With these numbers it doesn't always fail anymore on my box, but here are two different runs that did fail, notice the difference in allocated memory, at least according to top.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19647 maurice 16 0 1560m 454m 1088 S 55.2 5.7 0:03.82 mtbb_3mtg
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21810 maurice 16 0 3099m 908m 1088 R 99.9 11.4 0:04.36 mtbb_3mtg
And here for a succesfull run with the scalable allocator, note it allocated more than the two unsuccessful
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND ;
22241 maurice 25 0 3105m 909m 1056 R 99.8 11.4 0:30.52 mtbb_3mtg
And here a successful run not using the scalable allocator. Note that the virtual memory allocated is a lot less
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23547 maurice 25 0 1636m 968m 1052 R 99.8 12.2 0:08.60 mtbb_3mtg
To convince you that my box can allocate enough memory, here's a run not using the scalable allocator but with a much larger allocation
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3501 maurice 25 0 18.0g 4.4g 1056 R 99.9 56.1 0:22.81 mtbb_3mtg
Thanks
==============================================================================#define USE_SCALABLE//#define ONE_THREAD
#define SIZE_MEM 1400#define SIZE_ARR 1e6#define COUNT 1
#include "errno.h"#include "iostream"#inc lude "tbb/task_scheduler_init.h"#include "tbb/parallel_for.h"#include "tbb/blocked_range.h"#include "tbb/scalable_allocator.h"
typedef unsigned short uint16;typedef unsigned long long uint64;typedef unsigned int uint32;
static const uint64 a = 0x5deece66dULL;static const uint32 c = 11;static const uint64 mask48 = 0xffffffffffffULL;static const double max48 = 281474976710656.; // 2 ** 48
using namespace tbb;using namespace std;
double drand (uint16 seed[3]){// Emulate erand48 random number generator.
uint64 num = seed[2];num = (num << 16) + seed[1];num = (num << 16) + seed[0];
num = (num * a + c) & mask48;
double result = (double) num / max48;seed[0] = (uint16) num;num >>= 16;seed[1] = (uint16) num;num >>= 16;seed[2] = (uint16) num;
return result;}
#ifdef USE_SCALABLE
void * ::operator new (size_t n) throw (std::bad_alloc){errno = 0;void * result = scalable_malloc (n);
if (!result) {while (1) {cout << "Can't allocate " << n << " bytes" << endl;}}
return result;}
void * ::operator new[] (size_t n) throw (std::bad_alloc){return ::operator new (n);}
void ::operator delete (void * ptr){scalable_free (ptr);}
void ::operator delete[] (void * ptr){scalable_free (ptr);}
#endif
void foo (double *& x, uint16 seed[3]){int size = 1 + (int)(SIZE_MEM * drand (seed));bool alloc = drand (seed) > 0.7;
if (x) {delete [] x;x = 0;}if (alloc) {x = new double[size];}}
class parallel_foo {public:parallel_foo (double * a[]) :my_a_ (a) {}
void operator () (const blocked_range& r) const { double ** a = my_a_;uint16 seed[3] = {r.begin (), r.end (), (uint16)&r};
for (size_t i=r.begin(); i!=r.end(); ++i) {for (size_t j = 0; j foo (a, seed);}}}private:double ** const my_a_;uint16 seed_[3];};
class parallel_init {public:parallel_init (double * a[]) :my_a_ (a) {}
void operator () (const blocked_range& r) const { double ** a = my_a_;for (size_t i=r.begin(); i!=r.end(); ++i) {a = 0;}}private:double ** const my_a_;};i nt main () {
size_t N = (size_t)SIZE_ARR;
double ** a = new double *;
#ifdef ONE_THREADtask_scheduler_init init (1);#elsetask_scheduler_init init;#endifparallel_for (blocked_range(0, N, 100), parallel_init (a)); parallel_for (blocked_range(0, N, 100), parallel_foo (a)); while (1) {}return 0;}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the virtual memory size may be a red herring. Certainly one of the failed scalable results has aboutthe same virtual size as the successful non-scalable allocator example, so having a smaller footprint is not correlated with success. This led me to look at what the expected memory usage might be so I did a little thought experiment;
The code represents an array of 1 million pointers to arrays of doubles, which due to the thresholding of the random variable, should have roughly a 30% occupancy rate (assuming a uniform distribution out of drand(), whichprobably isn'ttrue). The maximum size of these double arrays is 1400, so the upper bound of virtual memory allocation should be:
1400 * 8 (bytes per double) * 1,000,000 (pointers in the base array) * .3 (30% occupancy) = 3360000000 or roughly 3,204 MB
None of the examples, successful or otherwise, exceed this limit. Assuming a uniform or at least symmetric distribution of drand() and assuming that all memory allocated by the system is represented in the arrays of doubles, I'd expect the virtual memory size over repeated runs to cluster around half that upper limit, or pretty close to the size reported for the successful non-scalable allocator case.
However, those are two big assumptions. They don't account, for example, for memory allocated by the underlying malloc (which will show up in top)but held in pools for future allocation as is done in the scalable allocator. So a skew in the memory footprint above the expected average may not be significant. And certainly the seeds for drand are not uniform so the output is not likely uniform either, making any conclusions based on averaging results suspect.
Still, these footprints don't appear nearly large enough to raise any fears about out-of-memory issues. And I don't see anything in this code that might otherwise explain the reported hangs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However there is a big difference in total memory use (in this particular example) for the scalable allocator versus regular malloc, because the scalable allocator adds about 16K bytes to every allocation larger than about 8K. Not knowing this made it seem that the scalable memory allocator was running out of memory much too soon when using it in my original application. The (faulty) test seem to confirm this.
The question is then, can that overhead be reduced? An unfortunate application that allocates chunks of about 8K will actually use 3 times as much (8K + 16K + a small header).
Any thoughts?
Thanks,
Maurice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vanswaaij:However there is a big difference in total memory use (in this particular example) for the scalable allocator versus regular malloc, because the scalable allocator adds about 16K bytes to every allocation larger than about 8K. Not knowing this made it seem that the scalable memory allocator was running out of memory much too soon when using it in my original application.
Right, we are aware of it, agree this is a problem, and will address it, though I can't say how soon. It requires more efforts and refactoring than it might seem.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page