Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

What is the current state-of-art solution to NUMA effects with TBB?

IvanKabadzhov
Novice
2,221 Views

Hello,

I am failing to create a TBB benchmark which suffers from NUMA effects. On the other hand, I succeeded easily with MPI.

 

My question are:
1. What is a scenario in which TBB suffers from NUMA effects? I read that TBB does not restrict threads to cores: https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Guiding_Task_Scheduler_Execution.html . When should this pinning to a list of arenas should be applied?
2. Is the work isolation (https://oneapi-src.github.io/oneTBB/main/tbb_userguide/work_isolation.html) helping in the context of such domains? 

0 Kudos
9 Replies
NoorjahanSk_Intel
Moderator
2,169 Views

Hi,

 

Thanks for reaching out to us.

 

The current approach of NUMA is to rely on the task arenas – e.g., in the case of 2 sockets, we create 2 task arenas one for each socket. 

NUMA approach is a performance optimization, there is a potential for performance degradation depending on the application.

 

Please refer to ProTBB Text Book chapter:20 “TBB on NUMA Architectures”

Also, refer to the below links for more examples:

https://github.com/Apress/pro-TBB/tree/master/ch20

Refer to the below link about TBB and NUMA on YouTube: 

https://www.youtube.com/watch?v=2t79ckf1vZY

 

By default TBB runtime will launch threads to code based on the hardware availability, if we want to specify any threads to a particular core then we can do pinning oneTBB worker threads to hardware threads

We can use the PinningObserver constructor using task_scheduler_observer{arena} which is an observer that pins oneTBB worker threads to hardware threads.

Please refer to the below link for more details:

https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_scheduler_observer_cls.html

 

You can use the hwloc library to allocate and pin the threads, and the TBB task_arena and task_scheduler_observer classes were instrumental in identifying the threads entering a particular NUMA node

 

Thanks & Regards,

Noorjahan.

 

IvanKabadzhov
Novice
2,143 Views

Thanks a lot Noorjahan! It is super helpful to know that there is a default pinning.

Nevertheless, I was trying to follow chapter:20 “TBB on NUMA Architectures”.

I do not have sudo access on the machines with NUMA domains, hence I am limited in what I can profile.

What I did was to use `taskset` to restrict execution to specific cores and run https://github.com/Apress/pro-TBB/blob/master/ch20/fig_20_05.cpp either in 1 domain or 2 domains (on 4 cores with vectors of size 1000000000) - either 4 cores in 1 domain, or 2+2.

I should have minimal noise in my system. Unfortunately, I did not see any NUMA effects.

I gave a little bit more concrete details here: https://github.com/oneapi-src/oneTBB/issues/865.

0 Kudos
Mark_L_Intel
Moderator
2,105 Views

The first example, does not use TBB approach for NUMA based on the TBB arenas. It's just straight parallel_for.  Since calculation is rather small, I used 4 threads  

 

only 1st socket  -- all 4 threads on 1st socket

taskset -c 0,1,2,3 ./fig_20_05

Time: 0.134963 seconds; Bandwidth: 17782.6MB/s

 

only 2nd socket -- all 4 threads on 2nd socket

taskset -c 36,37,38,39 ./fig_20_05

Time: 0.132922 seconds; Bandwidth: 18055.7MB/s

 

both sockets (2 threads on 1 socket, 2 threads on 2nd socket)

taskset -c 0,36,2,37 ./fig_20_05

Time: 0.184636 seconds; Bandwidth: 12998.6MB/s

 

So the last one is slower. And that was motivation to develop new TBB approach based on arenas described in Chapter 20.

By the way, I used a standard method of specifying number of threads in oneTBB based on the global control (the book examples at GitHub are still relying on "old" TBB unfortunately) .  For reference, please see the pdf document about replacing old TBB APIs with new oneTBB APIs at https://www.intel.com/content/www/us/en/developer/articles/technical/tbb-revamp.html.  

 

int nth = 4; // number of threads
auto mp = tbb::global_control::max_allowed_parallelism;
tbb::global_control gc(mp, nth + 1);  // One more thread

 

  The second example is using hwloc library and TBB arenas to illustrate basic approach to get better performance from TBB on NUMA systems.  First, as explained in the book, you need to install freely available hwloc library. You don't need any sudo privileges.  You can install hwloc in your own directory.

   Unfortunately, as you can see in the 2nd and 3rd examples, the methodology with arenas and hwloc requires some work from the user (more than I'd like to my taste) -- but as explained in the book, it gives you more general solution. After you install hwloc please create the script similar to the one below and source it.

export HWLOC_PREFIX=/scratch/users/mlubin/tbb/hwloc-2.0.4-build
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HWLOC_PREFIX}/lib

 

After compiling 2nd example, it should give you the output similar to the one below, e.g., in the case of 8 NUMA nodes:

bash-4.4$ ./fig_20_06
There are 8 NUMA node(s)
NUMA node 0 has cpu bitmask: 0x3fff0000,,,0x00003fff
Allocate data on node 0 with node bitmask 0x00000001
NUMA node 1 has cpu bitmask: 0x00000fff,0xc0000000,,,0x0fffc000
Allocate data on node 1 with node bitmask 0x00000002
NUMA node 2 has cpu bitmask: 0x03fff000,,,0x000003ff,0xf0000000
Allocate data on node 2 with node bitmask 0x00000004
NUMA node 3 has cpu bitmask: 0x000000ff,0xfc000000,,,0x00fffc00,0x0
Allocate data on node 3 with node bitmask 0x00000008
NUMA node 4 has cpu bitmask: 0x003fff00,,,0x0000003f,0xff000000,0x0
Allocate data on node 4 with node bitmask 0x00000010
NUMA node 5 has cpu bitmask: 0x0000000f,0xffc00000,,,0x000fffc0,,0x0
Allocate data on node 5 with node bitmask 0x00000020
NUMA node 6 has cpu bitmask: 0x0003fff0,,,0x00000003,0xfff00000,,0x0
Allocate data on node 6 with node bitmask 0x00000040
NUMA node 7 has cpu bitmask: 0xfffc0000,,,0x0000fffc,,,0x0
Allocate data on node 7 with node bitmask 0x00000080
I'm masterThread: 0 out of 8
Before: Thread: 0 with tid 139801072736000 on core 210
After: Thread: 0 with tid 139801072736000 on core 0
I'm masterThread: 1 out of 8
Before: Thread: 1 with tid 139801064343296 on core 211
After: Thread: 1 with tid 139801064343296 on core 14
I'm masterThread: 3 out of 8
Before: Thread: 3 with tid 139801047557888 on core 214
After: Thread: 3 with tid 139801047557888 on core 42
I'm masterThread: 2 out of 8
Before: Thread: 2 with tid 139801055950592 on core 212
After: Thread: 2 with tid 139801055950592 on core 28
I'm masterThread: 5 out of 8
Before: Thread: 5 with tid 139801030772480 on core 216
After: Thread: 5 with tid 139801030772480 on core 70
I'm masterThread: 4 out of 8
Before: Thread: 4 with tid 139801039165184 on core 215
After: Thread: 4 with tid 139801039165184 on core 56
I'm masterThread: 7 out of 8
Before: Thread: 7 with tid 139801013987072 on core 218
After: Thread: 7 with tid 139801013987072 on core 218
I'm masterThread: 6 out of 8
Before: Thread: 6 with tid 139801022379776 on core 217
After: Thread: 6 with tid 139801022379776 on core 84

 

You can continue to experiment using the last example fig_20_10  (which is modified second example fig_20_06). Please follow instructions in the book.

I started to modify these 3 examples so they can be compiled with the latest oneTBB; so for starters I attached the modified examples.

Please let me know if you have any questions.   

 

IvanKabadzhov
Novice
2,092 Views

Thanks a bunch, Mark. I am now taking a closer look. I previously used TBB as a black box, but thanks to you and 
Noorjahan I am in the process of opening the box.

I am always compiling my executables with `-O2` and I have build TBB from source following https://github.com/oneapi-src/oneTBB/blob/master/INSTALL.md#single-configuration-generators.

I could not see any clear NUMA effects with the original vector sizes:

Machine 1:
```
$ lscpu | grep NUMA
NUMA node(s): 2
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
```
I first did an experiments on 2 cores only:

* 1st numa domain only:
```
Time: 0.181537 seconds; Bandwidth: 13220.5MB/s
Time: 0.181956 seconds; Bandwidth: 13190MB/s
Time: 0.182535 seconds; Bandwidth: 13148.1MB/s
```

* 2nd numa domain only:
```
Time: 0.17827 seconds; Bandwidth: 13462.7MB/s
Time: 0.177679 seconds; Bandwidth: 13507.5MB/s
Time: 0.17809 seconds; Bandwidth: 13476.3MB/s
```
* both domains:
```
Time: 0.179466 seconds; Bandwidth: 13373MB/s
Time: 0.179267 seconds; Bandwidth: 11224.3MB/s
Time: 0.179832 seconds; Bandwidth: 13345.8MB/s

```

 

The using in total 16 cores:
* 1st numa domain only:
```
Time: 0.0808281 seconds; Bandwidth: 29692.7MB/s
Time: 0.0816297 seconds; Bandwidth: 29401.1MB/s
Time: 0.080435 seconds; Bandwidth: 29837.8MB/s
```

* 2nd numa domain only:
```
Time: 0.0809949 seconds; Bandwidth: 29631.5MB/s
Time: 0.081562 seconds; Bandwidth: 29425.5MB/s
Time: 0.0806376 seconds; Bandwidth: 29762.8MB/s
```
* both domains:
```
Time: 0.0769107 seconds; Bandwidth: 31205MB/s
Time: 0.0763066 seconds; Bandwidth: 31452.1MB/s
Time: 0.0767881 seconds; Bandwidth: 31254.8MB/s
```

I should have no noise in my machine. And since the vector sizes seemed too small, I decided to increase it by 10. Then trends were even more surprsing:
2 cores:
```

$ perf stat -r 3 taskset -c 0,1 ./fig05 && perf stat -r 3 taskset -c 16,17 ./fig05 && perf stat -r 3 taskset -c 0,16 ./fig05
Time: 2.71473 seconds; Bandwidth: 8840.67MB/s
Time: 2.6746 seconds; Bandwidth: 8973.3MB/s
Time: 2.64174 seconds; Bandwidth: 9084.92MB/s

Performance counter stats for 'taskset -c 0,1 ./fig05' (3 runs):

10,923.37 msec task-clock:u # 1.309 CPUs utilized ( +- 0.64% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
1,963,889 page-faults:u # 178.375 K/sec ( +- 1.05% )
11,428,109,602 cycles:u # 1.038 GHz ( +- 0.34% )
11,006,272,541 instructions:u # 0.96 insn per cycle ( +- 0.00% )
1,502,716,426 branches:u # 136.488 M/sec ( +- 0.00% )
29,776 branch-misses:u # 0.00% of all branches ( +- 0.17% )

8.3463 +- 0.0533 seconds time elapsed ( +- 0.64% )

Time: 2.59589 seconds; Bandwidth: 9245.39MB/s
Time: 2.61791 seconds; Bandwidth: 9167.61MB/s
Time: 2.5538 seconds; Bandwidth: 9397.75MB/s

Performance counter stats for 'taskset -c 16,17 ./fig05' (3 runs):

11,843.29 msec task-clock:u # 1.273 CPUs utilized ( +- 0.16% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
3,187,960 page-faults:u # 268.513 K/sec ( +- 0.00% )
11,352,032,594 cycles:u # 0.956 GHz ( +- 0.12% )
11,008,239,240 instructions:u # 0.97 insn per cycle ( +- 0.00% )
1,504,083,255 branches:u # 126.685 M/sec ( +- 0.00% )
29,833 branch-misses:u # 0.00% of all branches ( +- 0.07% )

9.30231 +- 0.00456 seconds time elapsed ( +- 0.05% )

Time: 1.96067 seconds; Bandwidth: 12240.7MB/s
Time: 1.96149 seconds; Bandwidth: 12235.6MB/s
Time: 2.0585 seconds; Bandwidth: 11659MB/s

Performance counter stats for 'taskset -c 0,16 ./fig05' (3 runs):

9,456.77 msec task-clock:u # 1.164 CPUs utilized ( +- 3.49% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
278,244 page-faults:u # 27.505 K/sec ( +-186.58% )
11,332,539,900 cycles:u # 1.120 GHz ( +- 0.22% )
11,004,601,566 instructions:u # 0.97 insn per cycle ( +- 0.00% )
1,501,032,843 branches:u # 148.380 M/sec ( +- 0.03% )
30,439 branch-misses:u # 0.00% of all branches ( +- 0.20% )

8.127 +- 0.361 seconds time elapsed ( +- 4.44% )
```

16 cores:
```

$ perf stat -r 3 taskset -c 0-15 ./fig05 && perf stat -r 3 taskset -c 16-31 ./fig05 && perf stat -r 3 taskset -c 0-7,16-23 ./fig05
Time: 1.13578 seconds; Bandwidth: 21130.9MB/s
Time: 1.15266 seconds; Bandwidth: 20821.4MB/s
Time: 1.13669 seconds; Bandwidth: 21113.9MB/s

Performance counter stats for 'taskset -c 0-15 ./fig05' (3 runs):

11,330.55 msec task-clock:u # 1.666 CPUs utilized ( +- 0.20% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
1,960,315 page-faults:u # 172.506 K/sec ( +- 0.01% )
11,139,654,834 cycles:u # 0.980 GHz ( +- 0.05% )
11,006,990,155 instructions:u # 0.99 insn per cycle ( +- 0.00% )
1,502,845,848 branches:u # 132.249 M/sec ( +- 0.00% )
37,139 branch-misses:u # 0.00% of all branches ( +- 0.54% )

6.80129 +- 0.00800 seconds time elapsed ( +- 0.12% )

Time: 1.13709 seconds; Bandwidth: 21106.6MB/s
Time: 1.13615 seconds; Bandwidth: 21123.9MB/s
Time: 1.1468 seconds; Bandwidth: 20927.8MB/s

Performance counter stats for 'taskset -c 16-31 ./fig05' (3 runs):

12,558.93 msec task-clock:u # 1.572 CPUs utilized ( +- 0.16% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
3,187,827 page-faults:u # 254.109 K/sec ( +- 0.00% )
11,174,513,842 cycles:u # 0.891 GHz ( +- 0.09% )
11,008,184,303 instructions:u # 0.99 insn per cycle ( +- 0.00% )
1,504,067,788 branches:u # 119.893 M/sec ( +- 0.00% )
37,266 branch-misses:u # 0.00% of all branches ( +- 0.86% )

7.9900 +- 0.0205 seconds time elapsed ( +- 0.26% )

Time: 0.967749 seconds; Bandwidth: 24799.8MB/s
Time: 0.973296 seconds; Bandwidth: 24658.5MB/s
Time: 0.980055 seconds; Bandwidth: 24488.4MB/s

Performance counter stats for 'taskset -c 0-7,16-23 ./fig05' (3 runs):

10,403.46 msec task-clock:u # 1.605 CPUs utilized ( +- 0.20% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
1,204,034 page-faults:u # 116.113 K/sec ( +- 1.30% )
11,247,948,917 cycles:u # 1.085 GHz ( +- 0.17% )
11,006,233,640 instructions:u # 0.98 insn per cycle ( +- 0.00% )
1,502,089,922 branches:u # 144.856 M/sec ( +- 0.00% )
36,527 branch-misses:u # 0.00% of all branches ( +- 1.27% )

6.48192 +- 0.00754 seconds time elapsed ( +- 0.12% )
```

I saw very similar effects on my other machine. I am speculating that I am hitting some memory limits. I am sharing to check if this hints for something rotten in my system. I am investigating further myself anyway.

And thanks again for the quick replies.

0 Kudos
Mark_L_Intel
Moderator
2,035 Views

I learned that there is a newer approach to TBB NUMA API based on the following API:

https://oneapi-src.github.io/oneTBB/main/reference/constraints_extensions.html


with the example here:

https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_arena_cls.html


This new approach still relies underneath on hwloc library but it does not require for the user to understand hwloc API, code around hwloc, or use observer as described in Ch20 of proTBB book. I've not experimented with this task arenas extensions API -- just to let you know that this method looks a lot simpler and therefore should be recommended.


0 Kudos
Mark_L_Intel
Moderator
1,991 Views

Have you tried this newer API described in previous post?


0 Kudos
IvanKabadzhov
Novice
1,981 Views

Dear Mark,

 

Thanks a lot for the responses. I attempted the arena solution. I have access to 3 machines that have 2 NUMA domains. These are:
1. Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
2. Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
3. AMD EPYC 7302 16-Core Processor
For the benchmarks we used, pinning to different NUMA domain vs staying in 1 NUMA domain had an impact on the first 2 machines, but not the third one. However, writing 

tbb::info::numa_nodes()

on the first 2 machines return a vector of length 1, with entry -1. So we were not able to test the solution in the problematic cases. And the topology is correctly obtained on the third machine, but that leads to no difference in the performance, than the current solution of 1 Arena (no effects anyway). Both problematic machines have `hwloc-info`. I referred to https://spec.oneapi.io/versions/latest/elements/oneTBB/source/info_namespace.html , but that did not tell me what was missing. A cross check with gdb showed that there is a direct call to `../oneTBB/include/oneapi/tbb/info.h:90`.  The `
`line std::vector<numa_node_id> node_indices(r1::numa_node_count());` wrongly obtained a vector of size 1, but with entry 0. Then after `r1::fill_numa_indices(node_indices.data())`, the only entry of `node_indices` becomes -1.

**So my question now is, how can I understand what is missing to obtain the topology on the machines.**


FYI We are directly interested in the performance of multithreaded executions of a framework on top of TBB.
In particular, the body of the driver parallel for is: https://github.com/root-project/root/blob/915c6ba995abeb52a57fded0db014a4fd7e79766/core/imt/src/TThreadExecutor.cxx#L161-L175.

The patched version needs a vector of task arenas and a vector of task groups, hence becomes:

void TThreadExecutor::ParallelFor(unsigned int start, unsigned int end, unsigned int step,
const std::function<void(unsigned int i)> &f)
{
...
auto arenas = fTaskArenaW->Access();
auto size = arenas.size();
auto task_groups = fTaskArenaW->GroupAccess();
for (auto i = 0u; i < size; i++) {
arenas[i]->execute([&task_groups, &f, i, start, end, step, size] {
task_groups[i]->run([&task_groups, &f, i, start, end, step, size] {
tbb::this_task_arena::isolate([&task_groups, &f, i, start, end, step, size] {
unsigned int lowerBound = start + i * (end - start + size) / size;
unsigned int upperBound = std::min((unsigned int)(start + (i + 1) * (end - start + size) / size), end);
tbb::parallel_for(lowerBound, upperBound, step, f); }); }); });
}
for (auto i = 0u; i < arenas.size(); i++) {
arenas[i]->execute([&task_groups, i] {
task_groups[i]->wait(); });
}

 

0 Kudos
Mark_L_Intel
Moderator
1,572 Views

The hwloc can be tested separately using tests that come with this utility. If it works on some systems and does not on others -- suggests that having systematic tests of standalone hwloc might help.


The new API I referenced above does not rely on hwloc explicitly.


Finally, sorry for a long delay. I'm not sure if you are even interested in this issue anymore?


0 Kudos
Mark_L_Intel
Moderator
1,528 Views

Due to lack of response, if you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

 

0 Kudos
Reply