- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I had recently the chance to test my raytracer based on TBB 3.0 on a 48 Core AMD NUMA plattform (so that are 8 Processor organized in 2 boards connected with hypertransport, each board with 4 CPUs). Sadly the results were disastrous.
My first try was to just use parallel-for with a grainsize of 24x24 which gave me the best results on SMP Machines so far. This actually resulted in 48 Cores being actually about 20% slower than 24 Cores.
So my new approach was to just use a small parallel_for loop from 1 to number of cores and maintain a simple stack where I pull blocks to render from (so just 1 atomic increment for each tile, with about 2000 tiles per FullHD Frame and no mutexes whatsoever). The results where a lot better, 24 Cores were about 10% faster, 48 Cores were about 30% faster than before.
Nonetheless: 48 Cores are about 4% faster than 24 Cores which is a littlebit rediculous. Whats even more interesting: Using 24 Cores I get about 50% CPU usage (so exactly 4 of the 8 NUMA nodes run close to 100%just like it should be). Upping to 48 Cores gives me about 60% CPU usage with still 4 NUMA nodes peaking at 100% while the other 4 are more or less idle at 5% maximum. It also doesnt improve if I massively increase the amount of work to be done for each pixel, so I doubt that the atomic operation for pulling from my stack have any influence.
Although the hypertransport will slow down memory access a little ( from what I have read so far it should be 30% slower compared to a direct memory access on the same board), this is nowhere near the performance that should be possible. It actually looks to me like the TBB Scheduler / Windows 2008 Server R2 Scheduler puts 2 threads on each core of one board and leaves the second board pretty much idle.
Does anyone have an Idea what might go wrong?
Michael
P.S. By the way, scaling up to 24 Cores is pretty ok, considering there is still some serial part in my tracer:
1 Core: Factor 1.0
2 Cores: Factor 2.0
4 Cores: Factor 3.98
8 Cores: Factor 7.6
12 Cores: Factor 10.9
16 Cores: Factor 14.2
24 Cores: Factor 19.5
32 Cores: Factor 23.18
48 Cores: Factor 22.0 <- This is not good
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just did another test:
Just let the parallel_for loop not run from 0 to number of core but from 0 to number of tiles with an auto_partitioner. Turned out to be about 40% slower (72s vs. 53s) on my Dual Nehalem compared to just popping the tiles from the stack.
In my opinion the partitioners dont work if the amount of work that is done for each entry varies a lot, I guess they assume that each segment does basicly the same which is not the case for a raytracer (for example if you look at a car the headlights require magnitues more work than the rest or the background).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wild idea:Have you triedboth tbb::mutex and tbb::spin_mutex for the stack, and do you get 100% CPU use in both cases? Have you tried to attach a debugger at various times, to find out how threads are spending their time?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"In my opinion the partitioners dont work if the amount of work that is done for each entry varies a lot, I guess they assume that each segment does basicly the same which is not the case for a raytracer (for example if you look at a car the headlights require magnitues more work than the rest or the background)."
That is correct for auto_partitioner, so if you want to use that you should probably hash the tiles, e.g., by inverting the bits in a binary index.
(Added) To avoid confusion: auto_partitioner (the default) has that "problem", and simple_partitioner doesn't, so you might want to specify simple_partitioner explicitly, or use a hashed index with auto_partioner.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Currently I have used a atomic fetch_and_increment to get a tile from the stack. Just changed that to a spin_mutex and in slightly increased the time from 300s to 285s (the 70% version was at 245s).
I guess I am really hitting a bandwidth limit here, as far as I can understand what the AMD Codeanalyst tells me, there are a lot non-local-memory accesses and the functions that take up most of the time are the intersection function which are read only while the locks nearly disappear in overall time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think ultimately the formula for good work with so many players may be much like playing good jazz: you can get out front and impress everyone with your machine-gun precision and virtuosity, but if you pay attention to the other players and leave room for them, you'll generally geta better result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Currently I have used a atomic fetch_and_increment to get a tile from the stack. Just changed that to a spin_mutex and in slightly increased the time from 300s to 285s (the 70% version was at 245s)."
From 300 seconds to 285 seconds would be a decrease, not an increase, so did you mean a "decrease" with those numbers or are the numbers different? But could you then also try tbb::mutex? My wild idea is about the potential overhead associated with 48 threads contending for access to a single resource, and I'm not 100.00% sure that a tbb::spin_mutex scales even if it is more efficient than tbb::mutex for low levels of contention.
(Added) It's sort of like several forum participants falling over each other all trying to be the first to solve your problem and claim the glory. :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Both mutices should work worse than lock-free variant that Michael used."
Hmm... yes, fetch_and_increment uses a hardware lock on x86. But then why doesn't the use of spin_mutex decrease performance? (Plural is just "mutexes", though.)
"And spin mutex normally works better than kernel mode mutex (for which tbb::mutex is a wrapper), unless there is oversubscription."
Can you remind us what the oversubscription is about? You have convoying with both kinds of mutexes, but even without oversubscription there's a potential problem with the thundering herd thing when using a spin_mutex, and I'm not sure that __TBB_LockByte handles that well, although I'm going by what I remember from an earlier version for that.
"Though tbb::mutex may result in less CPU load because threads will wait inside the kernel instead of spinning."
That was the reason for checking: getting a lock through the interconnect might make the spin_mutex scalability more of an issue. But I'm just guessing, really, until we know what happens after Michael removes "spin_".
(Added) __TBB_LockByte has not changed, I see. Still, each tile takes so much processing effort that the contention levels on this lock should be low enough for a significant benefit to be a surprise, anyway. That means I'm out of ideas for now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also tried having multiple stacks without any lock, but it seams my scene was so much imbalanced that it didnt show much of a difference. But I can recheck that tomorrow.
I will also try to increase the block size, maybe with the stack approach larger block sizes may work better than the last time I tried it (which still used a simple 2D parallel_for). I will try to get about 500 Blocks overall, so each thread will process about 10 blocks on average. Maybe this helps.
I just did another test starting 2 instances of the software with 24 Cores assigned to each instance. Those required between 260 and 270 seconds to render 2 frame of the scene (so essentially the 24 Cores are faster than 48 Cores for that particular scene), so this would mean 135 seconds per frame. I am still not shure how to interpret these numbers, one thing thats for shure is that both instances use different memory areas so they dont interfere with each other.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Correction) __TBB_LockByte, if #define'd, now refers to a redundant copy of the same code instead of to a tight loop, but it might then still be replaced just the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[After skimming the web a little I agree that plural of "mutex" should be "mutexes", as it is not a loan word :) ]
In case of oversubscription pure spin mutex aggravates convoying issue, because if a thread holding the lock gets preempted, remaining threads will just dumbly spin through the tremainders of their time quanta, while with kernel mode mutex they quickly begin getting parked in the kernel, the ready queue becomes empty, and preempted thread is allowed to continue.
Exponential backoff used in TBB spin mutexes alleviates the issue to some extent, especially when oversubscription is not too high and locked regions of code are short enough (so that if preemption occurs, there was enough mutexes in the short spinning loop to quickly concede their CPUs to all waiting threads including the one holding the lock).
I honestly surprized that FetchAndIncrement variant takes longer than the one with mutex, because
- with interlocked instruction the locked region is the shortest possible;
- FetchAndIncrement necer fails (that is there is no need to use it in a loop);
- mutex also use interlocked operation, and it is a heavier one (CAS that can fail, and thus generally requires repetitions in a loop).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"In case of oversubscription pure spin mutex aggravates convoying issue, because if a thread holding the lock gets preempted, remaining threads will just dumbly spin through the tremainders of their time quanta, while with kernel mode mutex they quickly begin getting parked in the kernel, the ready queue becomes empty, and preempted thread is allowed to continue."
That seems somehow plausible. Still, with spin_mutex preferably used only with resources that block only briefly, a thread waiting for access should get its turn fairly quickly, so it may be more efficient to spin a little while longer than to get preempted too quickly. So I find this a bit strange. I'll assume it has nevertheless been observed and measured?
"Exponential backoff used in TBB spin mutexes alleviates the issue to some extent, especially when oversubscription is not too high and locked regions of code are short enough (so that if preemption occurs, there was enough mutexes in the short spinning loop to quickly concede their CPUs to all waiting threads including the one holding the lock)."
I'm afraid I don't see what you mean. I see the exponential backoff primarily as a way to dynamically tune down contention for exclusive ownership of a lock's cache line, and my proposal in #31 elaborates on that. A problem with the current implementation is that backoff increases while a thread is waiting, not just when an acquisition attempt fails, so on a heavily contented resource many candidates spend more time waiting to try again than needed. Another is that after waiting a bit the contender goes straight to an expensive acquisition attempt instead of first listening whether it's a good time. The current implementation is also rather unfair by favouring newly arrived threads whose backoff has not yet increased so much.
"I honestly surprized that FetchAndIncrement variant takes longer than the one with mutex, because
- with interlocked instruction the locked region is the shortest possible;
- FetchAndIncrement necer fails (that is there is no need to use it in a loop);
- mutex also use interlocked operation, and it is a heavier one (CAS that can fail, and thus generally requires repetitions in a loop)."
Indeed. Unless the hardware implementation of an interlocked increment has its own problems, one would expect it to be rather efficient, although I'm not sure that the difference is so significant in view of processor speed relative to communication across even a fast interconnect (well, I don't know might be more accurate). FetchAndIncrement doesn't fail on x86 because it has a lock attribute on the instruction, but on other architectures it can still be a loop. I have no real idea of performance relative to a CAS loop?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, actually the original purpose of the backoff was to
- graciously handle the oversubscription situation I mentioned above (by dunking the spinning thread into the kernel periodically so that the kernel could reschedule earlier preempted thread holding the lock).
- decrease power consumption if the code does a blocking operation under the spin lock (not the best practice generally, but people do it, and sometimes it may even be necessary).
More generally speaking doing this kind of perfromance analysis is a time consuming endeavour, as we have to cover multiple platforms/OSes/compilers, and the tests results tend to be not as reproducible as would be sufficient to quickly make reliable conclusions. This is why optimizations that are not absolutely obviously benefitial often do not make it into the codebase fast enough. Anyway I'll talk to our managers if they have someone who could investigate the benefit of your proposal once again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To be exact, these wont work ( line 119):
#if !defined(__TBB_CompareAndSwap4) \
|| !defined(__TBB_CompareAndSwap8) \
|| !defined(__TBB_Yield) \
|| !defined(__TBB_full_memory_fence) \
|| !defined(__TBB_release_consistency_helper)
Or is there any compiler flag I need to set for these?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have high hopes, so I dare ask for a few minutes of your time, but don't count on anything, because it hasn't actually been tested on many cores, apparently disappointed on lower core counts when tested by Intel, and might even need some tweaking and tuning, if it does anything for you at all. Still, I think I remember seeing reports about good results with something similar.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not shure what exactly you are trying to do: From skipping over the code I guess the idea is this:
Usually, if you have many cores trying to access the same codeblock at the same time, this will happen over and over again.
Now if you assign a random sleep to the threads when encountering the mutex the chance that next time they want to access the same piece of code they arrive in a different order and probably better interleaved increases so they dont interfere as much.
Am I right with that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Again, it's just an educated guess at this point, but it also might be just the thing for your 48 cores. Although first you should probably try to reproduce the difference you observed between the fetch-and-increment and the lock (with the lock as the very surprising winner), and try other locks (plain mutex, queueing mutex) to see what the variability is, and whether there's anything to work with. If the idea turns out to be successful, it might also make a difference for scheduling tasks.
The idea is to treat the lock like an Ethernet medium: wait until the line/lock appears to be free, wait a random amount of time after that, and if nobody is talking/has the lock yet just go ahead and try, and if unsuccessful increase the interval from which to choose a random delay, unless yielding seems to be a better solution in the end (that's only for locks). An improvement could be to record in the lock what the current contention levels are, so that newly arrived contenders know how to behave politely, because there are many locks and they are not being constantly monitored by each thread as the few ethernet media are by the connected hardware dedicated to them. Perhaps it might also be better to adopt the actual Ethernet backoff algorithm instead of improvising one, especially with that possibly expensive rand(_r) call.
As a workaround for the freeze-up, you could omit the rand() call and take the full backoff value to see what that gives, or decrease that large number I've put in (if you do a diff you should see it), but I'll have another look tonight for what might have failed.
(Added) Just rand() seems to work on Core Duo/Linux.
(Added 2010-06-24) That big number at line 649 or so appears to have to be decreased to perhaps about 1000 or less, probably for oversubscribed longer blocks. Since I don't know how long a pause really lasts on x86 and friends, maybe the criterium should be a number of ticks instead. The maximum backoff should probably be decreased again as well: it was 16 which I thought was a bit small, but 1<<16 may be a bit too much again, although now it takes a larger number of failures to get there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the random wait, consider using _rdtsc() and mask off the low n bits and use that in your _mm_pause loop. Note, the value of n can increase (to a ceiling level) on each failure to attain the lock. Pseudo code
n = 0;
while((lock) || (xchg(lock,1) == 1))
{
mask = (1 << n) - 1;
if(n < nMax)
++n;
c = _rdtsc() & mask;
for(int i = 0; i < c; ++i)
_mm_pause();
}
Note, if the hold time can be very long, then think of inserting event or condition variable waits.
TBB is a tasking system not a thread system, so try to rework the code such that you never have long block times.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That is probably a much more efficient source of randomness. I don't agree with the algorithm because it increases the potential wait on each unsuccessful probe/attempt, leading to longer waits earlier than needed and more unfairness: the idea is to wait only to avoid the thundering herd problem, which resembles a collision on Ethernet. But I don't have relevant hardware and budget to easily test any of this, so it's all just hypothetical. Surely somebody else must have already explored this, though?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page