TBB on NUMA Plattform? - Page 2

michaelnikelsky1 · ‎06-17-2010

Hi there,

I had recently the chance to test my raytracer based on TBB 3.0 on a 48 Core AMD NUMA plattform (so that are 8 Processor organized in 2 boards connected with hypertransport, each board with 4 CPUs). Sadly the results were disastrous.
My first try was to just use parallel-for with a grainsize of 24x24 which gave me the best results on SMP Machines so far. This actually resulted in 48 Cores being actually about 20% slower than 24 Cores.

So my new approach was to just use a small parallel_for loop from 1 to number of cores and maintain a simple stack where I pull blocks to render from (so just 1 atomic increment for each tile, with about 2000 tiles per FullHD Frame and no mutexes whatsoever). The results where a lot better, 24 Cores were about 10% faster, 48 Cores were about 30% faster than before.

Nonetheless: 48 Cores are about 4% faster than 24 Cores which is a littlebit rediculous. Whats even more interesting: Using 24 Cores I get about 50% CPU usage (so exactly 4 of the 8 NUMA nodes run close to 100%just like it should be). Upping to 48 Cores gives me about 60% CPU usage with still 4 NUMA nodes peaking at 100% while the other 4 are more or less idle at 5% maximum. It also doesnt improve if I massively increase the amount of work to be done for each pixel, so I doubt that the atomic operation for pulling from my stack have any influence.

Although the hypertransport will slow down memory access a little ( from what I have read so far it should be 30% slower compared to a direct memory access on the same board), this is nowhere near the performance that should be possible. It actually looks to me like the TBB Scheduler / Windows 2008 Server R2 Scheduler puts 2 threads on each core of one board and leaves the second board pretty much idle.

Does anyone have an Idea what might go wrong?

Michael

P.S. By the way, scaling up to 24 Cores is pretty ok, considering there is still some serial part in my tracer:

1 Core: Factor 1.0
2 Cores: Factor 2.0
4 Cores: Factor 3.98
8 Cores: Factor 7.6
12 Cores: Factor 10.9
16 Cores: Factor 14.2
24 Cores: Factor 19.5
32 Cores: Factor 23.18
48 Cores: Factor 22.0 <- This is not good

Andrey_Marochko · ‎06-21-2010

Which partitioner do you use with parallel_for: simple, auto or by default?

michaelnikelsky1 · ‎06-21-2010

Good point. I had the affinity_partitioner running, maybe thats why the parallel_for stopped at 69%. I will retry with the simple_partitioner and grainsize of 1.

Just did another test:
Just let the parallel_for loop not run from 0 to number of core but from 0 to number of tiles with an auto_partitioner. Turned out to be about 40% slower (72s vs. 53s) on my Dual Nehalem compared to just popping the tiles from the stack.

In my opinion the partitioners dont work if the amount of work that is done for each entry varies a lot, I guess they assume that each segment does basicly the same which is not the case for a raytracer (for example if you look at a car the headlights require magnitues more work than the rest or the background).

RafSchietekat · ‎06-21-2010

Ah, yes, you have that stack... Weird indeed.

Wild idea:Have you triedboth tbb::mutex and tbb::spin_mutex for the stack, and do you get 100% CPU use in both cases? Have you tried to attach a debugger at various times, to find out how threads are spending their time?

RafSchietekat · ‎06-21-2010

"In my opinion the partitioners dont work if the amount of work that is done for each entry varies a lot, I guess they assume that each segment does basicly the same which is not the case for a raytracer (for example if you look at a car the headlights require magnitues more work than the rest or the background)."
That is correct for auto_partitioner, so if you want to use that you should probably hash the tiles, e.g., by inverting the bits in a binary index.

(Added) To avoid confusion: auto_partitioner (the default) has that "problem", and simple_partitioner doesn't, so you might want to specify simple_partitioner explicitly, or use a hashed index with auto_partioner.

michaelnikelsky1 · ‎06-21-2010

Ok, the simple_partitioner changed the CPU usage to 100%. But it seems to hold true that using less CPU% is faster than using 100% CPU, no matter what I am actually using to distribute the work. So something really seems to be wrong here.

Currently I have used a atomic fetch_and_increment to get a tile from the stack. Just changed that to a spin_mutex and in slightly increased the time from 300s to 285s (the 70% version was at 245s).

I guess I am really hitting a bandwidth limit here, as far as I can understand what the AMD Codeanalyst tells me, there are a lot non-local-memory accesses and the functions that take up most of the time are the intersection function which are read only while the locks nearly disappear in overall time.

robert-reed · ‎06-21-2010

Is there any chance you can break that tile stack up into multiple stacks, and partition the threads to seek their work on different stacks? It may ultimately lead to some load imbalance as you eventually try to balance the stacks, but it would reduce the number of threads simultaneously contending for stack access. And being up in the range of 48 threads, you're definitely up there where the number of agents contending for a particular lock could produce more serializing overhead holding up the whole system.

I think ultimately the formula for good work with so many players may be much like playing good jazz: you can get out front and impress everyone with your machine-gun precision and virtuosity, but if you pay attention to the other players and leave room for them, you'll generally geta better result.

RafSchietekat · ‎06-21-2010

"Currently I have used a atomic fetch_and_increment to get a tile from the stack. Just changed that to a spin_mutex and in slightly increased the time from 300s to 285s (the 70% version was at 245s)."
From 300 seconds to 285 seconds would be a decrease, not an increase, so did you mean a "decrease" with those numbers or are the numbers different? But could you then also try tbb::mutex? My wild idea is about the potential overhead associated with 48 threads contending for access to a single resource, and I'm not 100.00% sure that a tbb::spin_mutex scales even if it is more efficient than tbb::mutex for low levels of contention.

(Added) It's sort of like several forum participants falling over each other all trying to be the first to solve your problem and claim the glory. :-)

Andrey_Marochko · ‎06-21-2010

Both mutices should work worse than lock-free variant that Michael used. And spin mutex normally works better than kernel mode mutex (for which tbb::mutex is a wrapper), unless there is oversubscription. Though tbb::mutex may result in less CPU load because threads will wait inside the kernel instead of spinning.

RafSchietekat · ‎06-21-2010

"Both mutices should work worse than lock-free variant that Michael used."
Hmm... yes, fetch_and_increment uses a hardware lock on x86. But then why doesn't the use of spin_mutex decrease performance? (Plural is just "mutexes", though.)

"And spin mutex normally works better than kernel mode mutex (for which tbb::mutex is a wrapper), unless there is oversubscription."
Can you remind us what the oversubscription is about? You have convoying with both kinds of mutexes, but even without oversubscription there's a potential problem with the thundering herd thing when using a spin_mutex, and I'm not sure that __TBB_LockByte handles that well, although I'm going by what I remember from an earlier version for that.

"Though tbb::mutex may result in less CPU load because threads will wait inside the kernel instead of spinning."
That was the reason for checking: getting a lock through the interconnect might make the spin_mutex scalability more of an issue. But I'm just guessing, really, until we know what happens after Michael removes "spin_".

(Added) __TBB_LockByte has not changed, I see. Still, each tile takes so much processing effort that the contention levels on this lock should be low enough for a significant benefit to be a surprise, anyway. That means I'm out of ideas for now.

michaelnikelsky1 · ‎06-21-2010

Ah, sorry, I meant the performance increased, of course the time decreased. The numbers are correct, with fetch_and_increment the rendering took 300 seconds, with spin mutex it was 285 seconds. I will try the normal mutex (what about queing mutex??) as well.

I also tried having multiple stacks without any lock, but it seams my scene was so much imbalanced that it didnt show much of a difference. But I can recheck that tomorrow.

I will also try to increase the block size, maybe with the stack approach larger block sizes may work better than the last time I tried it (which still used a simple 2D parallel_for). I will try to get about 500 Blocks overall, so each thread will process about 10 blocks on average. Maybe this helps.

I just did another test starting 2 instances of the software with 24 Cores assigned to each instance. Those required between 260 and 270 seconds to render 2 frame of the scene (so essentially the 24 Cores are faster than 48 Cores for that particular scene), so this would mean 135 seconds per frame. I am still not shure how to interpret these numbers, one thing thats for shure is that both instances use different memory areas so they dont interfere with each other.

RafSchietekat · ‎06-21-2010

I have for a long time wanted somebody to try the following potential optimisation of spin locks with a high number of cores (48 certainly qualifies), provided here as a replacement for include/tbb/tbb_machine.h in tbb30_035ss.

(Correction) __TBB_LockByte, if #define'd, now refers to a redundant copy of the same code instead of to a tight loop, but it might then still be replaced just the same.

Andrey_Marochko · ‎06-22-2010

To #29.

[After skimming the web a little I agree that plural of "mutex" should be "mutexes", as it is not a loan word :) ]

In case of oversubscription pure spin mutex aggravates convoying issue, because if a thread holding the lock gets preempted, remaining threads will just dumbly spin through the tremainders of their time quanta, while with kernel mode mutex they quickly begin getting parked in the kernel, the ready queue becomes empty, and preempted thread is allowed to continue.

Exponential backoff used in TBB spin mutexes alleviates the issue to some extent, especially when oversubscription is not too high and locked regions of code are short enough (so that if preemption occurs, there was enough mutexes in the short spinning loop to quickly concede their CPUs to all waiting threads including the one holding the lock).

I honestly surprized that FetchAndIncrement variant takes longer than the one with mutex, because

with interlocked instruction the locked region is the shortest possible;
FetchAndIncrement necer fails (that is there is no need to use it in a loop);
mutex also use interlocked operation, and it is a heavier one (CAS that can fail, and thus generally requires repetitions in a loop).

RafSchietekat · ‎06-22-2010

"In case of oversubscription pure spin mutex aggravates convoying issue, because if a thread holding the lock gets preempted, remaining threads will just dumbly spin through the tremainders of their time quanta, while with kernel mode mutex they quickly begin getting parked in the kernel, the ready queue becomes empty, and preempted thread is allowed to continue."
That seems somehow plausible. Still, with spin_mutex preferably used only with resources that block only briefly, a thread waiting for access should get its turn fairly quickly, so it may be more efficient to spin a little while longer than to get preempted too quickly. So I find this a bit strange. I'll assume it has nevertheless been observed and measured?

"Exponential backoff used in TBB spin mutexes alleviates the issue to some extent, especially when oversubscription is not too high and locked regions of code are short enough (so that if preemption occurs, there was enough mutexes in the short spinning loop to quickly concede their CPUs to all waiting threads including the one holding the lock)."
I'm afraid I don't see what you mean. I see the exponential backoff primarily as a way to dynamically tune down contention for exclusive ownership of a lock's cache line, and my proposal in #31 elaborates on that. A problem with the current implementation is that backoff increases while a thread is waiting, not just when an acquisition attempt fails, so on a heavily contented resource many candidates spend more time waiting to try again than needed. Another is that after waiting a bit the contender goes straight to an expensive acquisition attempt instead of first listening whether it's a good time. The current implementation is also rather unfair by favouring newly arrived threads whose backoff has not yet increased so much.

"I honestly surprized that FetchAndIncrement variant takes longer than the one with mutex, because
- with interlocked instruction the locked region is the shortest possible;
- FetchAndIncrement necer fails (that is there is no need to use it in a loop);
- mutex also use interlocked operation, and it is a heavier one (CAS that can fail, and thus generally requires repetitions in a loop)."
Indeed. Unless the hardware implementation of an interlocked increment has its own problems, one would expect it to be rather efficient, although I'm not sure that the difference is so significant in view of processor speed relative to communication across even a fast interconnect (well, I don't know might be more accurate). FetchAndIncrement doesn't fail on x86 because it has a lock attribute on the instruction, but on other architectures it can still be a loop. I have no real idea of performance relative to a CAS loop?

Andrey_Marochko · ‎06-22-2010

I see the exponential backoff primarily as a way to dynamically tune down contention for exclusive ownership of a lock's cache line, and my proposal in #31 elaborates on that. A problem with the current implementation is that backoff increases while a thread is waiting, not just when an acquisition attempt fails, so on a heavily contented resource many candidates spend more time waiting to try again than needed. Another is that after waiting a bit the contender goes straight to an expensive acquisition attempt instead of first listening whether it's a good time. The current implementation is also rather unfair by favouring newly arrived threads whose backoff has not yet increased so much.

Well, actually the original purpose of the backoff was to

graciously handle the oversubscription situation I mentioned above (by dunking the spinning thread into the kernel periodically so that the kernel could reschedule earlier preempted thread holding the lock).
decrease power consumption if the code does a blocking operation under the spin lock (not the best practice generally, but people do it, and sometimes it may even be necessary).

But your idea of decreasing bus traffic by avoiding premature attempts of bringing cache line into exclusive state makes sense to me. I remember you already suggested this optimization a year or two ago. Soon after then one of our developers made measurements on the new Intel platforms (Nehalem), and as far as I know what we have now in our machine.h was what gave better results on average. It is possible that (recent) architectures do exactly this kind of state check internally to CAS before attempting more costly actions, and thus achive similar effect as your implementation (or even slightly better one due to packing it all into a single instruction).

More generally speaking doing this kind of perfromance analysis is a time consuming endeavour, as we have to cover multiple platforms/OSes/compilers, and the tests results tend to be not as reproducible as would be sufficient to quickly make reliable conclusions. This is why optimizations that are not absolutely obviously benefitial often do not make it into the codebase fast enough. Anyway I'll talk to our managers if they have someone who could investigate the benefit of your proposal once again.

RafSchietekat · ‎06-22-2010

I guess I'm still in love with the idea, and if the number 48 (plus interconnect) won't tip the scales nothing will, so I decided to try and abuse Michael's predicament to give it one more chance. :-)

michaelnikelsky1 · ‎06-22-2010

I tried to replace the tbb_machine.h but it gives me errors during compilation, probably because I am using the 3.0 Update 1.
To be exact, these wont work ( line 119):

#if !defined(__TBB_CompareAndSwap4) \
|| !defined(__TBB_CompareAndSwap8) \
|| !defined(__TBB_Yield) \
|| !defined(__TBB_full_memory_fence) \
|| !defined(__TBB_release_consistency_helper)

Or is there any compiler flag I need to set for these?

RafSchietekat · ‎06-22-2010

That's the version, building from source tbb30_035oss_src.tgz after replacing include/tbb/tbb_machine.h. On Windows you can probably use rand() instead of rand_r(), leaving out the context argument for this test, but I don't see why it would not build if you can build the unmodified source. I've never tried any of the prebuilt packages, so I don't know what the issues are.

I have high hopes, so I dare ask for a few minutes of your time, but don't count on anything, because it hasn't actually been tested on many cores, apparently disappointed on lower core counts when tested by Intel, and might even need some tweaking and tuning, if it does anything for you at all. Still, I think I remember seeing reports about good results with something similar.

michaelnikelsky1 · ‎06-23-2010

Mmhh, ok, I tried to replace the rand_r with rand (which is not threadsafe, so probably not a good idea anyway). But this does not work at all, just brings the whole programm to a halt, even on my 16 Core machine.

I am not shure what exactly you are trying to do: From skipping over the code I guess the idea is this:

Usually, if you have many cores trying to access the same codeblock at the same time, this will happen over and over again.
Now if you assign a random sleep to the threads when encountering the mutex the chance that next time they want to access the same piece of code they arrive in a different order and probably better interleaved increases so they dont interfere as much.

Am I right with that?

RafSchietekat · ‎06-23-2010

Again, it's just an educated guess at this point, but it also might be just the thing for your 48 cores. Although first you should probably try to reproduce the difference you observed between the fetch-and-increment and the lock (with the lock as the very surprising winner), and try other locks (plain mutex, queueing mutex) to see what the variability is, and whether there's anything to work with. If the idea turns out to be successful, it might also make a difference for scheduling tasks.

The idea is to treat the lock like an Ethernet medium: wait until the line/lock appears to be free, wait a random amount of time after that, and if nobody is talking/has the lock yet just go ahead and try, and if unsuccessful increase the interval from which to choose a random delay, unless yielding seems to be a better solution in the end (that's only for locks). An improvement could be to record in the lock what the current contention levels are, so that newly arrived contenders know how to behave politely, because there are many locks and they are not being constantly monitored by each thread as the few ethernet media are by the connected hardware dedicated to them. Perhaps it might also be better to adopt the actual Ethernet backoff algorithm instead of improvising one, especially with that possibly expensive rand(_r) call.

As a workaround for the freeze-up, you could omit the rand() call and take the full backoff value to see what that gives, or decrease that large number I've put in (if you do a diff you should see it), but I'll have another look tonight for what might have failed.

(Added) Just rand() seems to work on Core Duo/Linux.

(Added 2010-06-24) That big number at line 649 or so appears to have to be decreased to perhaps about 1000 or less, probably for oversubscribed longer blocks. Since I don't know how long a pause really lasts on x86 and friends, maybe the criterium should be a number of ticks instead. The maximum backoff should probably be decreased again as well: it was 16 which I thought was a bit small, but 1<<16 may be a bit too much again, although now it takes a larger number of failures to get there.

jimdempseyatthecove · ‎06-24-2010

Raf,

For the random wait, consider using _rdtsc() and mask off the low n bits and use that in your _mm_pause loop. Note, the value of n can increase (to a ceiling level) on each failure to attain the lock. Pseudo code

n = 0;
while((lock) || (xchg(lock,1) == 1))
{
mask = (1 << n) - 1;
if(n < nMax)
++n;
c = _rdtsc() & mask;
for(int i = 0; i < c; ++i)
_mm_pause();
}

Note, if the hold time can be very long, then think of inserting event or condition variable waits.

TBB is a tasking system not a thread system, so try to rework the code such that you never have long block times.

Jim Dempsey

RafSchietekat · ‎06-24-2010

That is probably a much more efficient source of randomness. I don't agree with the algorithm because it increases the potential wait on each unsuccessful probe/attempt, leading to longer waits earlier than needed and more unfairness: the idea is to wait only to avoid the thundering herd problem, which resembles a collision on Ethernet. But I don't have relevant hardware and budget to easily test any of this, so it's all just hypothetical. Surely somebody else must have already explored this, though?