- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have some code that I'm working with in TBB, and for some reason when using VTune's threading analysis, I see that it's not using the number of threads I'm initializing it with. Here's the gist of what I'm doing:
#define NUM_VERTS 1189576
#define NUM_LOOPS 7137492
#define LOOP_TABLE_SIZE 1000
// this typedef's been
// copied from Blender
typedef struct MLoop {
/* Vertex index. */
unsigned int v;
...
} MLoop;
int main()
{
srand(time(NULL)); // Initialization, should only be called once.
tbb::task_arena this_arena;
this_arena.initialize(64);
// let's set up our data
float(*vnors)[3] = (float(*)[3]) malloc(NUM_VERTS * sizeof(*vnors));
float(*lnors_weighted)[3] = (float(*)[3]) malloc(NUM_LOOPS * sizeof(*lnors_weighted));
// fill with random values, though
// at some point we may want to
// integrate OpenGL and some actual
// mesh data
for (int i = 0; i < NUM_VERTS; i++)
{
*vnors[i] = rand();
}
for (int i = 0; i < NUM_LOOPS; i++)
{
*lnors_weighted[i] = rand();
}
// build the loop table, this just points to entries in
// the vnors table
MLoop* mloop = (MLoop*)malloc(NUM_LOOPS * sizeof(MLoop));
for (int i = 0; i < NUM_LOOPS; i++)
{
mloop[i].v = (rand() % (NUM_VERTS));
}
int** vert_loop_lookup = NULL;
if (vert_loop_lookup == NULL) {
// let's make a lookup table with more contiguous memory access
vert_loop_lookup = (int**) malloc(NUM_VERTS * sizeof(int*));
for (int i = 0; i < NUM_VERTS; i++) {
// making an assumption here that a vert can be a part of up to 100 loops
// the real number I suspect will be lower
vert_loop_lookup[i] = (int*) malloc(LOOP_TABLE_SIZE * sizeof(vert_loop_lookup[0]));
memset(vert_loop_lookup[i], -1, LOOP_TABLE_SIZE * sizeof(vert_loop_lookup[0]));
}
// this is just to track the maximum index for the
// loop entry for a given vertex, that way we avoid
// a second loop
int* index_counter = (int*) malloc(NUM_VERTS * sizeof(int));
memset(index_counter, 0, NUM_VERTS * sizeof(index_counter[0]));
// fill up our new table
for (int lidx = 0; lidx < NUM_LOOPS; lidx++) {
// get the vert index
unsigned int vert_index = mloop[lidx].v;
int curr_loop_table_value = index_counter[vert_index];
vert_loop_lookup[vert_index][curr_loop_table_value] = lidx;
index_counter[vert_index]++;
}
free(index_counter);
}
this_arena.execute([&] {
// TBB here
tbb::parallel_for(tbb::blocked_range<int>(0, NUM_VERTS, 1 /* Grain Size */),
[&](tbb::blocked_range<int> r)
{
for (int i = r.begin(); i < r.end(); ++i)
{
// loop through the ... loops
// of these verts and do some
// accumulation
int* loop_table = vert_loop_lookup[i];
int curr_index = 0;
for (curr_index = 0; loop_table[curr_index] != -1; curr_index++) {
int lidx = loop_table[curr_index];
add_v3_v3(vnors[mloop[lidx].v], lnors_weighted[lidx]);
}
}
});
});
}
When using VTune's Threading Analysis feature, I see only 23 threads having been used by the application. There's more than enough work to justify more threads, what's going on here?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Forums. We tried to run the code you've sent. But additional information is required from your side. So could you please provide us the Use-case details you worked on and the project directory along with all the steps to reproduce. Also please let us know your VTune version.
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there, thank you for your response. In Visual Studio 2019 (Community Edition), you can create a console app and stick this code in there to run it (you'll also need the latest version of TBB, which you can get off the github). The use case here is work in computer graphics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We could reproduce your issue and we are working on this internally. We will get back to you soon with an update.
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
You defined a number of slots in the task arena (you created). One can create many task arenas -- each with a different number of slots. The number of slots in these arenas are local to these arenas and they are conceptually different from the global number of threads. By specifying number of slots for a given arena, you are limiting local concurrency for a given arena only.
If you'd like to understand better how arena(s) slots are working with the threads, please refer to one the chapters in the (free) pro TBB book, e.g. https://link.springer.com/chapter/10.1007/978-1-4842-4398-5_11
Although this book uses deprecated API for controlling number of threads. Currently, you may control a number of threads by using tbb::global_control, i.e.:
int nth = 24; // number of threads
auto mp = tbb::global_control::max_allowed_parallelism;
tbb::global_control gc(mp, nth + 1); // One more thread, but sleeping
Here is an example of using tbb::global_control in oneTBB sample:
If you do not specify a number of threads (as is the case in your code), TBB creates one worker thread fewer than the number of logical cores on the node, leaving one of the cores available to execute the main application thread. So it is likely, since you did not specify a number of threads, the TBB runtime created 23 working threads -- I guess the number of logical cores on your platform was 24. And that's what you see in Vtune.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. thanks for your response, but your assumption is incorrect, the number of logical processes on my system is 112 (Dual Xeon).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
@Ravi__Jagannadhan what version of TBB are you using?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please check the output of this code?
tbb::enumerable_thread_specific<int> ets(0);
this_arena.execute([&] {
// TBB here
tbb::parallel_for(tbb::blocked_range<int>(0, NUM_VERTS, 1 /* Grain Size */),
[&](tbb::blocked_range<int> r)
{
ets.local() = 1;
for (int i = r.begin(); i < r.end(); ++i)
{
// loop through the ... loops
// of these verts and do some
// accumulation
int* loop_table = vert_loop_lookup[i];
int curr_index = 0;
for (curr_index = 0; loop_table[curr_index] != -1; curr_index++) {
int lidx = loop_table[curr_index];
add_v3_v3(vnors[mloop[lidx].v], lnors_weighted[lidx]);
}
}
});
});
std::cout << "Number of threads: " << ets.combine([] (int r, int l) { return r + l; }) << std::endl;
If result still be 23, please add short sleep in the loop, maybe it's not enough work for TBB workers to come.
tbb::enumerable_thread_specific<int> ets(0);
this_arena.execute([&] {
// TBB here
tbb::parallel_for(tbb::blocked_range<int>(0, NUM_VERTS, 1 /* Grain Size */),
[&](tbb::blocked_range<int> r)
{
ets.local() = 1;
std::this_thread::sleep_for(std::chrono::nanoseconds(10));
for (int i = r.begin(); i < r.end(); ++i)
{
// loop through the ... loops
// of these verts and do some
// accumulation
int* loop_table = vert_loop_lookup[i];
int curr_index = 0;
for (curr_index = 0; loop_table[curr_index] != -1; curr_index++) {
int lidx = loop_table[curr_index];
add_v3_v3(vnors[mloop[lidx].v], lnors_weighted[lidx]);
}
}
});
});
std::cout << "Number of threads: " << ets.combine([] (int r, int l) { return r + l; }) << std::endl;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Due to inactivity, there will be no longer support from Intel on this issue. Of course, the community support may continue.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page