- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm putting together a first tbb program which does a simple clustering problem
I'm looking for some advice from more experianced tbbers on how to set this up
My current code is of the form
while! converged{
converged = parallel reduction
}
Each task instance created in the parellel reduction does some scalable-mallocs and scalable-frees
This seems kind of unbalanced, any suggestions?
I'd post the code I'm not clear on how to post a program to the forum
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What do you mean, "unbalanced"? That the memory is freed in a different thread than where it was allocated? The scalable memory allocator has special provisions to mitigate the cost of freeing memory from a different thread than where it was allocated, so you should probably consider carefully whether this is where you would want to spend your time optimising your code (which is not to say that nothing can be done).
(Here are some ideas that just cameoccurred to me, early on a Monday morning: the AtomicCompareExchange() in freePublicObject() should get a backoff anyway (backoffs are not just about waiting for something to happen!), maybe some boxcarring functions might improve performance (several (de)allocations could be treated together, or maybe it can be done automatically through a thread-local buffer), or how about a scalable_free overload that provides the size of the original allocation (I'm sure there must be a reason why free() doesn't have that, but it just won't come to me).)
(Corrected) See text, and it's not actually true that this is the first time I thought about this.
(Here are some ideas that just cameoccurred to me, early on a Monday morning: the AtomicCompareExchange() in freePublicObject() should get a backoff anyway (backoffs are not just about waiting for something to happen!), maybe some boxcarring functions might improve performance (several (de)allocations could be treated together, or maybe it can be done automatically through a thread-local buffer), or how about a scalable_free overload that provides the size of the original allocation (I'm sure there must be a reason why free() doesn't have that, but it just won't come to me).)
(Corrected) See text, and it's not actually true that this is the first time I thought about this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry I was not clear, The code that I have has an outer serial loop over a parallel-reduction.
each task in the reduction does an allocation and a free
Same task that allocates does the free
But each time around the loop, the tasks for this iteration allocate the same space used by the tasks from the last iteration.
so there is repeated allocations and deallocations of the same addresses.
I'm wondering if there is a way to keep the allocations around and if that would be a good idea?
loop iterations (~500)
a typical size woudl be
num_clusters (constant for the entire run 2-10)
num-attributes (constant for the entire run 2-10)
num-objects (constant for the entire run ~1,000,000 )
the allocation part looks like
kmeans(kmeans& x, split){
num_clusters = x.num_clusters;
num_attributes = x.num_attributes;
num_objects = x.num_objects;
new_centers_len = (int*) scalable_malloc(num_clusters * sizeof(int));
new_centers = (float**) scalable_malloc(num_clusters * sizeof(float*));
new_centers[0] = (float*) scalable_malloc(num_clusters * num_attrbutes* sizeof(float));
...
}
kmeans(int nclusters, int nattrib, int num_obj)
:delta(0), num_clusters(nclusters),
num_attributes(nattrib), num_objects(num_obj){
new_centers_len = (int*) scalable_malloc(num_clusters * sizeof(int));
new_centers = (float**) scalable_malloc(num_clusters * sizeof(float*));
new_centers[0] = (float*) scalable_malloc(num_clusters * num_attributes* sizeof(float));
...
}
~kmeans() {
scalable_free(new_centers[0]);
scalable_free(new_centers);
scalable_free(new_centers_len);
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's not clear to me yet what this algorithm does, but if you can find a not too convoluted way to hoist that resource handling from the outer loop without somehow introducing obvious cache nonlocalities, and perhaps unless the rest of the code takes significantly more time, I would say go for it. What makes you hesitate?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - norman.rubin @ amd.com
Sorry I was not clear, The code that I have has an outer serial loop over a parallel-reduction.
each task in the reduction does an allocation and a free
Same task that allocates does the free
But each time around the loop, the tasks for this iteration allocate the same space used by the tasks from the last iteration.
so there is repeated allocations and deallocations of the same addresses.
I'm wondering if there is a way to keep the allocations around and if that would be a good idea?
each task in the reduction does an allocation and a free
Same task that allocates does the free
But each time around the loop, the tasks for this iteration allocate the same space used by the tasks from the last iteration.
so there is repeated allocations and deallocations of the same addresses.
I'm wondering if there is a way to keep the allocations around and if that would be a good idea?
Take a look at tbb::enumerable_thread_specific in the latest open-source developer updates; see include/tbb/enumerable_thread_specific.h there. The functionality was designed to help in situations like that. Basically, it is an advanced wrapper over platform specific API for thread-local storage. It is still new and not fully settled, though changes are rather unlikely. And also no documentation is ready; so please ask here for further help if necessary.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just cache that memory in TSS/TLS. Not doing what does not have to be done is definitely worth doing.
[code=cpp]
__thread int* t_new_centers_len; // or __declspec(thread) for MSVC
__thread float** t_new_centers; // or __declspec(thread) for MSVC
void your_task::execute()
{
if (t_new_centers_len == 0)
{
t_new_centers_len = (int*) scalable_malloc( max_num_clusters * sizeof(int));
t_new_centers = (float**) scalable_malloc(max_num_clusters * sizeof(float*));
t_new_centers[0] = (float*) scalable_malloc(max_num_clusters * max_num_attrbutes* sizeof(float));
}
int* __restrict new_centers_len = t_new_centers_len;
float* __restrict* __restrict new_centers = t_new_centers;
// your processing
}
[code=cpp]
__thread int* t_new_centers_len; // or __declspec(thread) for MSVC
__thread float** t_new_centers; // or __declspec(thread) for MSVC
void your_task::execute()
{
if (t_new_centers_len == 0)
{
t_new_centers_len = (int*) scalable_malloc( max_num_clusters * sizeof(int));
t_new_centers = (float**) scalable_malloc(max_num_clusters * sizeof(float*));
t_new_centers[0] = (float*) scalable_malloc(max_num_clusters * max_num_attrbutes* sizeof(float));
}
int* __restrict new_centers_len = t_new_centers_len;
float* __restrict* __restrict new_centers = t_new_centers;
// your processing
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Just cache that memory in TSS/TLS. Not doing what does not have to be done is definitely worth doing.
[code=cpp]
__thread int* t_new_centers_len; // or __declspec(thread) for MSVC
__thread float** t_new_centers; // or __declspec(thread) for MSVC
void your_task::execute()
{
if (t_new_centers_len == 0)
{
t_new_centers_len = (int*) scalable_malloc( max_num_clusters * sizeof(int));
t_new_centers = (float**) scalable_malloc(max_num_clusters * sizeof(float*));
t_new_centers[0] = (float*) scalable_malloc(max_num_clusters * max_num_attrbutes* sizeof(float));
}
int* __restrict new_centers_len = t_new_centers_len;
float* __restrict* __restrict new_centers = t_new_centers;
// your processing
}
[code=cpp]
__thread int* t_new_centers_len; // or __declspec(thread) for MSVC
__thread float** t_new_centers; // or __declspec(thread) for MSVC
void your_task::execute()
{
if (t_new_centers_len == 0)
{
t_new_centers_len = (int*) scalable_malloc( max_num_clusters * sizeof(int));
t_new_centers = (float**) scalable_malloc(max_num_clusters * sizeof(float*));
t_new_centers[0] = (float*) scalable_malloc(max_num_clusters * max_num_attrbutes* sizeof(float));
}
int* __restrict new_centers_len = t_new_centers_len;
float* __restrict* __restrict new_centers = t_new_centers;
// your processing
}
[code=cpp]
__thread int* t_new_centers_len = 0; // or __declspec(thread) for MSVC
__thread float** t_new_centers = 0; // or __declspec(thread) for MSVC
Then add an exit clean-up seperate task that frees the memory.
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page