Task Arena being initialized with 64 but only using 23?

Ravi__Jagannadhan · ‎06-09-2021

I have some code that I'm working with in TBB, and for some reason when using VTune's threading analysis, I see that it's not using the number of threads I'm initializing it with. Here's the gist of what I'm doing:

#define NUM_VERTS 1189576
#define NUM_LOOPS 7137492
#define LOOP_TABLE_SIZE 1000

// this typedef's been
// copied from Blender
typedef struct MLoop {
    /* Vertex index. */
    unsigned int v;
	...
} MLoop;

int main()
{
    srand(time(NULL));   // Initialization, should only be called once.
    
    tbb::task_arena this_arena;
    this_arena.initialize(64);

    // let's set up our data
    float(*vnors)[3] = (float(*)[3]) malloc(NUM_VERTS * sizeof(*vnors));
    float(*lnors_weighted)[3] = (float(*)[3]) malloc(NUM_LOOPS * sizeof(*lnors_weighted));

    // fill with random values, though
    // at some point we may want to
    // integrate OpenGL and some actual
    // mesh data
    for (int i = 0; i < NUM_VERTS; i++) 
    {
        *vnors[i] = rand();
    }

    for (int i = 0; i < NUM_LOOPS; i++)
    {
        *lnors_weighted[i] = rand();
    }

    // build the loop table, this just points to entries in 
    // the vnors table
    MLoop* mloop = (MLoop*)malloc(NUM_LOOPS * sizeof(MLoop));
    for (int i = 0; i < NUM_LOOPS; i++)
    {
        mloop[i].v = (rand() % (NUM_VERTS));
    }

    int** vert_loop_lookup = NULL;
    if (vert_loop_lookup == NULL) {
        // let's make a lookup table with more contiguous memory access
        vert_loop_lookup = (int**) malloc(NUM_VERTS * sizeof(int*));
        for (int i = 0; i < NUM_VERTS; i++) {
            // making an assumption here that a vert can be a part of up to 100 loops
            // the real number I suspect will be lower
            vert_loop_lookup[i] = (int*) malloc(LOOP_TABLE_SIZE * sizeof(vert_loop_lookup[0]));
            memset(vert_loop_lookup[i], -1, LOOP_TABLE_SIZE * sizeof(vert_loop_lookup[0]));
        }

        // this is just to track the maximum index for the
        // loop entry for a given vertex, that way we avoid
        // a second loop
        int* index_counter = (int*) malloc(NUM_VERTS * sizeof(int));
        memset(index_counter, 0, NUM_VERTS * sizeof(index_counter[0]));

        // fill up our new table
        for (int lidx = 0; lidx < NUM_LOOPS; lidx++) {
            // get the vert index
            unsigned int vert_index = mloop[lidx].v;
            int curr_loop_table_value = index_counter[vert_index];
            vert_loop_lookup[vert_index][curr_loop_table_value] = lidx;
            index_counter[vert_index]++;
        }

        free(index_counter);
    }

    this_arena.execute([&] {
        // TBB here
        tbb::parallel_for(tbb::blocked_range<int>(0, NUM_VERTS, 1 /* Grain Size */),
            [&](tbb::blocked_range<int> r)
            {
                for (int i = r.begin(); i < r.end(); ++i)
                {
                    // loop through the ... loops
                    // of these verts and do some
                    // accumulation
                    int* loop_table = vert_loop_lookup[i];
                    int curr_index = 0;
                    for (curr_index = 0; loop_table[curr_index] != -1; curr_index++) {
                        int lidx = loop_table[curr_index];
                        add_v3_v3(vnors[mloop[lidx].v], lnors_weighted[lidx]);
                    }
                }
            });
        });
}

When using VTune's Threading Analysis feature, I see only 23 threads having been used by the application. There's more than enough work to justify more threads, what's going on here?

AlekhyaV_Intel · ‎06-10-2021

Hi,

Thank you for posting in Intel Forums. We tried to run the code you've sent. But additional information is required from your side. So could you please provide us the Use-case details you worked on and the project directory along with all the steps to reproduce. Also please let us know your VTune version.

Regards,

Alekhya

Ravi__Jagannadhan · ‎06-10-2021

Hi there, thank you for your response. In Visual Studio 2019 (Community Edition), you can create a console app and stick this code in there to run it (you'll also need the latest version of TBB, which you can get off the github). The use case here is work in computer graphics.

AlekhyaV_Intel · ‎06-17-2021

Hi,

We could reproduce your issue and we are working on this internally. We will get back to you soon with an update.

Regards,

Alekhya

Mark_L_Intel · ‎06-30-2021

Hello,

You defined a number of slots in the task arena (you created). One can create many task arenas -- each with a different number of slots. The number of slots in these arenas are local to these arenas and they are conceptually different from the global number of threads. By specifying number of slots for a given arena, you are limiting local concurrency for a given arena only.

If you'd like to understand better how arena(s) slots are working with the threads, please refer to one the chapters in the (free) pro TBB book, e.g. https://link.springer.com/chapter/10.1007/978-1-4842-4398-5_11

Although this book uses deprecated API for controlling number of threads. Currently, you may control a number of threads by using tbb::global_control, i.e.:

int nth = 24; // number of threads

auto mp = tbb::global_control::max_allowed_parallelism;

tbb::global_control gc(mp, nth + 1); // One more thread, but sleeping

Here is an example of using tbb::global_control in oneTBB sample:

https://github.com/oneapi-src/oneAPI-samples/blob/0494db82e947d7bd6bd681057cd6b33ebd842999/Libraries/oneTBB/tbb-async-sycl/src/tbb-async-sycl.cpp#L102

If you do not specify a number of threads (as is the case in your code), TBB creates one worker thread fewer than the number of logical cores on the node, leaving one of the cores available to execute the main application thread. So it is likely, since you did not specify a number of threads, the TBB runtime created 23 working threads -- I guess the number of logical cores on your platform was 24. And that's what you see in Vtune.

Ravi__Jagannadhan · ‎07-04-2021

Hi. thanks for your response, but your assumption is incorrect, the number of logical processes on my system is 112 (Dual Xeon).

Pavel_K_Intel1 · ‎07-06-2021

Hi,

@Ravi__Jagannadhan what version of TBB are you using?

Ravi__Jagannadhan · ‎07-12-2021

The latest one from the github

Pavel_K_Intel1 · ‎07-13-2021

Could you please check the output of this code?

tbb::enumerable_thread_specific<int> ets(0);
    this_arena.execute([&] {
        // TBB here
        tbb::parallel_for(tbb::blocked_range<int>(0, NUM_VERTS, 1 /* Grain Size */),
            [&](tbb::blocked_range<int> r)
            {
                ets.local() = 1;
                for (int i = r.begin(); i < r.end(); ++i)
                {
                    // loop through the ... loops
                    // of these verts and do some
                    // accumulation
                    int* loop_table = vert_loop_lookup[i];
                    int curr_index = 0;
                    for (curr_index = 0; loop_table[curr_index] != -1; curr_index++) {
                        int lidx = loop_table[curr_index];
                        add_v3_v3(vnors[mloop[lidx].v], lnors_weighted[lidx]);
                    }
                }
            });
    });
    std::cout << "Number of threads: " << ets.combine([] (int r, int l) { return r + l; }) << std::endl;

If result still be 23, please add short sleep in the loop, maybe it's not enough work for TBB workers to come.

tbb::enumerable_thread_specific<int> ets(0);
    this_arena.execute([&] {
        // TBB here
        tbb::parallel_for(tbb::blocked_range<int>(0, NUM_VERTS, 1 /* Grain Size */),
            [&](tbb::blocked_range<int> r)
            {
                ets.local() = 1;
                std::this_thread::sleep_for(std::chrono::nanoseconds(10));
                for (int i = r.begin(); i < r.end(); ++i)
                {
                    // loop through the ... loops
                    // of these verts and do some
                    // accumulation
                    int* loop_table = vert_loop_lookup[i];
                    int curr_index = 0;
                    for (curr_index = 0; loop_table[curr_index] != -1; curr_index++) {
                        int lidx = loop_table[curr_index];
                        add_v3_v3(vnors[mloop[lidx].v], lnors_weighted[lidx]);
                    }
                }
            });
    });
    std::cout << "Number of threads: " << ets.combine([] (int r, int l) { return r + l; }) << std::endl;

Mark_L_Intel · ‎07-30-2021

Due to inactivity, there will be no longer support from Intel on this issue. Of course, the community support may continue.