Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

space allocation question

norman_rubin
Beginner
342 Views

I'm putting together a first tbb program which does a simple clustering problem

I'm looking for some advice from more experianced tbbers on how to set this up

My current code is of the form

while! converged{
converged = parallel reduction
}


Each task instance created in the parellel reduction does some scalable-mallocs and scalable-frees

This seems kind of unbalanced, any suggestions?

I'd post the code I'm not clear on how to post a program to the forum

0 Kudos
6 Replies
RafSchietekat
Valued Contributor III
342 Views
What do you mean, "unbalanced"? That the memory is freed in a different thread than where it was allocated? The scalable memory allocator has special provisions to mitigate the cost of freeing memory from a different thread than where it was allocated, so you should probably consider carefully whether this is where you would want to spend your time optimising your code (which is not to say that nothing can be done).

(Here are some ideas that just cameoccurred to me, early on a Monday morning: the AtomicCompareExchange() in freePublicObject() should get a backoff anyway (backoffs are not just about waiting for something to happen!), maybe some boxcarring functions might improve performance (several (de)allocations could be treated together, or maybe it can be done automatically through a thread-local buffer), or how about a scalable_free overload that provides the size of the original allocation (I'm sure there must be a reason why free() doesn't have that, but it just won't come to me).)

(Corrected) See text, and it's not actually true that this is the first time I thought about this.
0 Kudos
norman_rubin
Beginner
342 Views


Sorry I was not clear, The code that I have has an outer serial loop over a parallel-reduction.

each task in the reduction does an allocation and a free
Same task that allocates does the free

But each time around the loop, the tasks for this iteration allocate the same space used by the tasks from the last iteration.
so there is repeated allocations and deallocations of the same addresses.

I'm wondering if there is a way to keep the allocations around and if that would be a good idea?


loop iterations (~500)
a typical size woudl be
num_clusters (constant for the entire run 2-10)
num-attributes (constant for the entire run 2-10)
num-objects (constant for the entire run ~1,000,000 )




the allocation part looks like

kmeans(kmeans& x, split){
num_clusters = x.num_clusters;
num_attributes = x.num_attributes;
num_objects = x.num_objects;

new_centers_len = (int*) scalable_malloc(num_clusters * sizeof(int));
new_centers = (float**) scalable_malloc(num_clusters * sizeof(float*));
new_centers[0] = (float*) scalable_malloc(num_clusters * num_attrbutes* sizeof(float));
...
}

kmeans(int nclusters, int nattrib, int num_obj)
:delta(0), num_clusters(nclusters),
num_attributes(nattrib), num_objects(num_obj){

new_centers_len = (int*) scalable_malloc(num_clusters * sizeof(int));
new_centers = (float**) scalable_malloc(num_clusters * sizeof(float*));
new_centers[0] = (float*) scalable_malloc(num_clusters * num_attributes* sizeof(float));
...
}

~kmeans() {
scalable_free(new_centers[0]);
scalable_free(new_centers);
scalable_free(new_centers_len);
}
0 Kudos
RafSchietekat
Valued Contributor III
342 Views
It's not clear to me yet what this algorithm does, but if you can find a not too convoluted way to hoist that resource handling from the outer loop without somehow introducing obvious cache nonlocalities, and perhaps unless the rest of the code takes significantly more time, I would say go for it. What makes you hesitate?
0 Kudos
Alexey-Kukanov
Employee
342 Views
Sorry I was not clear, The code that I have has an outer serial loop over a parallel-reduction.

each task in the reduction does an allocation and a free
Same task that allocates does the free

But each time around the loop, the tasks for this iteration allocate the same space used by the tasks from the last iteration.
so there is repeated allocations and deallocations of the same addresses.

I'm wondering if there is a way to keep the allocations around and if that would be a good idea?


Take a look at tbb::enumerable_thread_specific in the latest open-source developer updates; see include/tbb/enumerable_thread_specific.h there. The functionality was designed to help in situations like that. Basically, it is an advanced wrapper over platform specific API for thread-local storage. It is still new and not fully settled, though changes are rather unlikely. And also no documentation is ready; so please ask here for further help if necessary.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
342 Views
Just cache that memory in TSS/TLS. Not doing what does not have to be done is definitely worth doing.

[code=cpp]
__thread int* t_new_centers_len; // or __declspec(thread) for MSVC
__thread float** t_new_centers; // or __declspec(thread) for MSVC

void your_task::execute()
{
if (t_new_centers_len == 0)
{
t_new_centers_len = (int*) scalable_malloc( max_num_clusters * sizeof(int));
t_new_centers = (float**) scalable_malloc(max_num_clusters * sizeof(float*));
t_new_centers[0] = (float*) scalable_malloc(max_num_clusters * max_num_attrbutes* sizeof(float));

}
int* __restrict new_centers_len = t_new_centers_len;
float* __restrict* __restrict new_centers = t_new_centers;

// your processing
}


0 Kudos
jimdempseyatthecove
Honored Contributor III
342 Views
Quoting - Dmitriy Vyukov
Just cache that memory in TSS/TLS. Not doing what does not have to be done is definitely worth doing.

[code=cpp]
__thread int* t_new_centers_len; // or __declspec(thread) for MSVC
__thread float** t_new_centers; // or __declspec(thread) for MSVC

void your_task::execute()
{
if (t_new_centers_len == 0)
{
t_new_centers_len = (int*) scalable_malloc( max_num_clusters * sizeof(int));
t_new_centers = (float**) scalable_malloc(max_num_clusters * sizeof(float*));
t_new_centers[0] = (float*) scalable_malloc(max_num_clusters * max_num_attrbutes* sizeof(float));

}
int* __restrict new_centers_len = t_new_centers_len;
float* __restrict* __restrict new_centers = t_new_centers;

// your processing
}



[code=cpp]
__thread int* t_new_centers_len = 0; // or __declspec(thread) for MSVC
__thread float** t_new_centers = 0; // or __declspec(thread) for MSVC


Then add an exit clean-up seperate task that frees the memory.

Jim Dempsey
0 Kudos
Reply