Is there a way to load balance tasks across multiple CPUs in a system?

VictorD · ‎09-14-2022

Is there a way to control how tasks are load-balanced across multiple CPUs in a system. I'm using a dual-CPU AWS node with 24-cores and hyperthreaded. When I split work among 24 tasks (on Windows) I see that all 24 are running on CPU 0 and none are on CPU 1. Is there a way to change this to load balance across the two CPUs, to run 12 on CPU 0 and 12 on CPU 1?

Thank you,

-Victor

PriyanshuK_Intel · ‎09-15-2022

Hi,

Thank you for posting in Intel Communities.

Could you elaborate more on your issue like which program or application you are running that you are facing this issue and also how are you launching 24 tasks?

Could you please provide us with the below details to investigate more on your issue?

1. Sample Reproducer code.

2. Steps to follow in order to reproduce your issue.

Thanks & Regards,

Priyanshu.

VictorD · ‎09-16-2022

The following is a Parallel Summation function, using TBB, which breaks the array recursively in half, until it gets to array size that is smaller than parallelThreshold, executing Serial summation then.

// left (l) boundary is inclusive and right (r) boundary is exclusive
inline unsigned long long SumParallel(unsigned long long in_array[], size_t l, size_t r, size_t parallelThreshold = 16 * 1024)
{
if (l >= r) // zero elements to sum
return 0;
if ((r - l) <= parallelThreshold)
{
unsigned long long sum_left = 0;
for (size_t current = l; current < r; current++)
sum_left += in_array[current];
return sum_left;
}

unsigned long long sum_left = 0, sum_right = 0;

size_t m = r / 2 + l / 2 + (r % 2 + l % 2) / 2; // average without overflow

tbb::parallel_invoke(
[&] { sum_left = SumParallel(in_array, l, m, parallelThreshold); },
[&] { sum_right = SumParallel(in_array, m, r, parallelThreshold); }
);
sum_left += sum_right; // Combine left and right results

return sum_left;
}

When this code is run on a 96-core AWS c5.24xlarge node by calling it in the following way, it runs on all 96 cores on both CPUs:

unsigned long long sum = ParallelAlgorithms::SumParallel(u64Array, 0, u64array_size);

When this code is run on the same machine in the following way, it runs on 32 cores of CPU 0 only with CPU 1 not running any work at all:

unsigned long long sum = ParallelAlgorithms::SumParallel(u64Array, 0, ulongs.size(), u64array_size / 24);

The u64array_size is 1 Billion elements, with each element being unsigned long long (64-bit unsigned).

I found a way in TBB to use only one thread per core (to effectively not use hyperthreading).

I would love to be able to load balance the 32 threads (not 24 as I thought earlier) across the two CPUs on this machine (CPU 0 and CPU 1), as a first order of performance optimization. Ideally, I would love for the first recursive split to be among the CPUs - i.e., left half of the array goes to CPU 0 and right half of the array goes to CPU 1, with the rest of the recursion sub-trees staying on their respective CPUs.

Thanks and Regards,

-Victor

PriyanshuK_Intel · ‎09-23-2022

Hi,

You can refer to the concept of Thread-to-Core Affinity.

It is used when we want to influence the OS so that it schedules the software threads onto particular core(s).

You can refer to the ProTBB Textbook, page no:358.

Thanks,

Priyanshu.

PriyanshuK_Intel · ‎10-02-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks,

Priyanshu.

PriyanshuK_Intel · ‎10-10-2022

Hi,

We have not heard back from you. This thread will no longer be monitored by Intel.

If you need further assistance, please post a new question.

Thanks,

Priyanshu.