- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a large program that is threaded using TBB. For various reasons, we explicitly set the number of threads using scheduler = new tbb::task_scheduler_init(numThreads) at the beginning of the program. There is plenty of parallel work in parallel_for loops, and the grain size is selected for multiple grains per core to allow task stealing if needed.
I run the program under likwid-perfctr (version 4.2) to pin the threads to physical cores and to monitor the activity of individual cores. I'm using the NUMA group and looking at Memory data volume as a way to measure how much work each core is doing, but other performance counters show similar behavior.
% likwid-perfctr -C 4-11 -g NUMA myprogram -T8
where -T8 tells myprogram to use 8 threads. likwid-topology shows threads 0-15 as the physical threads, so I'm not using hyperthreads. The computer is a dual-socket E5-2667 v2 with 16 physical cores and 32 threads, running RHEL 6.9. Only standard operating system tasks are running when I do the measurements. The program is using less data that there is physical RAM, so swapping is not happening.
When I run it multiple times, I see a different number of threads being used. Most often it is only using 5 of the 8 requested threads; sometimes 6, rarely 7, never the full 8. Here is some data [in GB] for Memory data volume:
66 37 37 38 0 31 0 0 // 5 active
66 37 37 38 0 31 0 0 // 5 active
66 36 37 38 0 31 0 0 // 5 active
61 32 32 30 0 28 0 27 // 6 active
66 37 37 38 0 31 0 0 // 5 active
If I request some other number of threads (say 4 to 16), TBB consistently uses 1 to 3 fewer threads than requested.
I know the documentation says that the argument to tbb::task_scheduler_init() sets the maximum number of threads, but only creating 5 threads when 8 were requested on an otherwise idle computer with 16 physical cores seems wrong. Is there some way to force it to use the full 8 requested threads?
Thanks,
Rick
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rick,
I do not have much experience with the likwid-perfctr tool but it does not look accurate to measure CPU utilization with memory access pattern. Is it possible that some data in caches is reused and not calculated as memory usage? As far as I know, the tool should report CPU utilization metrics like INSTR_RETIRED_ANY or CPU_CLK_UNHALTED. Have you checked them?
Have you tried the "top" command or some sort of task/system manager to observe the number of threads and CPU utilization?
Regards,
Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex.
Sorry for the slow reply. Other CPU metrics like CPU_CLK_UNHALTED or double precision flops showed a similar discrepancy. Top confirmed the use of fewer cores, as did the runtime of a test program. It appears to be a problem with likwid, as using taskset for core affinity did not show this problem.
Moral of the story: use likwid to look at CPU metrics, but use taskset to enforce affinity.
Thanks,
Rick

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page