Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

TBB on NUMA Plattform?

michaelnikelsky1
Beginner
1,776 Views
Hi there,

I had recently the chance to test my raytracer based on TBB 3.0 on a 48 Core AMD NUMA plattform (so that are 8 Processor organized in 2 boards connected with hypertransport, each board with 4 CPUs). Sadly the results were disastrous.
My first try was to just use parallel-for with a grainsize of 24x24 which gave me the best results on SMP Machines so far. This actually resulted in 48 Cores being actually about 20% slower than 24 Cores.

So my new approach was to just use a small parallel_for loop from 1 to number of cores and maintain a simple stack where I pull blocks to render from (so just 1 atomic increment for each tile, with about 2000 tiles per FullHD Frame and no mutexes whatsoever). The results where a lot better, 24 Cores were about 10% faster, 48 Cores were about 30% faster than before.

Nonetheless: 48 Cores are about 4% faster than 24 Cores which is a littlebit rediculous. Whats even more interesting: Using 24 Cores I get about 50% CPU usage (so exactly 4 of the 8 NUMA nodes run close to 100%just like it should be). Upping to 48 Cores gives me about 60% CPU usage with still 4 NUMA nodes peaking at 100% while the other 4 are more or less idle at 5% maximum. It also doesnt improve if I massively increase the amount of work to be done for each pixel, so I doubt that the atomic operation for pulling from my stack have any influence.

Although the hypertransport will slow down memory access a little ( from what I have read so far it should be 30% slower compared to a direct memory access on the same board), this is nowhere near the performance that should be possible. It actually looks to me like the TBB Scheduler / Windows 2008 Server R2 Scheduler puts 2 threads on each core of one board and leaves the second board pretty much idle.

Does anyone have an Idea what might go wrong?

Michael

P.S. By the way, scaling up to 24 Cores is pretty ok, considering there is still some serial part in my tracer:

1 Core: Factor 1.0
2 Cores: Factor 2.0
4 Cores: Factor 3.98
8 Cores: Factor 7.6
12 Cores: Factor 10.9
16 Cores: Factor 14.2
24 Cores: Factor 19.5
32 Cores: Factor 23.18
48 Cores: Factor 22.0 <- This is not good

0 Kudos
44 Replies
jimdempseyatthecove
Honored Contributor III
109 Views

That is probably a much more efficient source of randomness. I don't agree with the algorithm because it increases the potential wait on each unsuccessful probe/attempt, leading to longer waits earlier than needed and more unfairness: the idea is to wait only to avoid the thundering herd problem, which resembles a collision on Ethernet. But I don't have relevant hardware and budget to easily test any of this, so it's all just hypothetical. Surely somebody else must have already explored this, though?

This is the point of having the clipping mask with an upper limit. When set at 4 you will obtain pseudo random numbers between 0:15.

While you could always take 4 bits of the counter, in the case when 2 threads are competing for a short lived resource then a wait of 15 _mm_pauses, which will happen 1/16th the time may be too long of a latency. As threads collied for the resource, then the universe of random counts increase (to an upper limit), and thus avoiding unnecessary LOCK XCHG instructions that "whack" the cache coherency.

Now, as to what a good upper limit is... the programmer will have to make this decision.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
109 Views
We both use clipping, but the code in #41 still starts the backoff at an arbitrary point and will therefore try again at an arbitrary point, with the wait time range always increasing, even though it is now random. It seems better to keep monitoring (within a fraction of a quantum) until the lock appears free, and to only then back off for an increasing random amount of time, so that initially the reaction time remains limited even if the new contender has been blocked for a while.

Instead of pauses<10... in #31 (I don't know what a good number would be in this case), the test should probably be to get the current time quantum from sched_rr_get_interval() or whatever applies to Windows or any other O.S., divided by 5 or so, limited by 10 ms or so (to avoid surprises), multiplied with CLOCKS_PER_SEC, and to then use clock(), which I hope is cheap enough to be used in a test? This would then be the backstop against inappropriately long blocking or preempted owners, causing the thread to yield instead.
0 Kudos
jimdempseyatthecove
Honored Contributor III
109 Views
>>We both use clipping, but the code in #41 still starts the backoff at an arbitrary point and will therefore try again at an arbitrary point, with the wait time range always increasing, even though it is now random.

The code in #41 starts with an _mm_pause() loop iteration count of 0,{0:1},{0:3},{0:7},...,{0:nMax^2-1}


>>It seems better to keep monitoring (within a fraction of a quantum) until the lock appears free, and to only then back off for an increasing random amount of time, so that initially the reaction time remains limited even if the new contender has been blocked for a while.

You do have a point in that in #41 the late commers start with the shift count of 0 thus giving them a higher averageattempt at lock frequency. This could be adjusted (if desired) by placing the initializer for the shift count in static. Then dynamicly adjust it. e.g. if any thread's local count is bumped after initialization and acquisition is not attained, then perform a clipped increment of the initializer (this need not be an interlocked operation as we are not keeping accurate counts). When a thread obtains the lock immediately, then it can either reset the shift count initializer with direct write of 0, or back it off with decrement and test for underflow:

if(initializer && (--initializer < 0))
initializer = 0;

Where initializer is a volatile static

You need not use interlocked operations on this since it is an aggragate tuning parameter, but you do need protections against quirks. e.g. initializer seen as -n (you set to 0), initializer seen at nMax (you reset to nMax) etc...

Jim Dempsey
0 Kudos
Reply