TBB on NUMA Plattform?

michaelnikelsky1 · ‎06-17-2010

Hi there,

I had recently the chance to test my raytracer based on TBB 3.0 on a 48 Core AMD NUMA plattform (so that are 8 Processor organized in 2 boards connected with hypertransport, each board with 4 CPUs). Sadly the results were disastrous.
My first try was to just use parallel-for with a grainsize of 24x24 which gave me the best results on SMP Machines so far. This actually resulted in 48 Cores being actually about 20% slower than 24 Cores.

So my new approach was to just use a small parallel_for loop from 1 to number of cores and maintain a simple stack where I pull blocks to render from (so just 1 atomic increment for each tile, with about 2000 tiles per FullHD Frame and no mutexes whatsoever). The results where a lot better, 24 Cores were about 10% faster, 48 Cores were about 30% faster than before.

Nonetheless: 48 Cores are about 4% faster than 24 Cores which is a littlebit rediculous. Whats even more interesting: Using 24 Cores I get about 50% CPU usage (so exactly 4 of the 8 NUMA nodes run close to 100%just like it should be). Upping to 48 Cores gives me about 60% CPU usage with still 4 NUMA nodes peaking at 100% while the other 4 are more or less idle at 5% maximum. It also doesnt improve if I massively increase the amount of work to be done for each pixel, so I doubt that the atomic operation for pulling from my stack have any influence.

Although the hypertransport will slow down memory access a little ( from what I have read so far it should be 30% slower compared to a direct memory access on the same board), this is nowhere near the performance that should be possible. It actually looks to me like the TBB Scheduler / Windows 2008 Server R2 Scheduler puts 2 threads on each core of one board and leaves the second board pretty much idle.

Does anyone have an Idea what might go wrong?

Michael

P.S. By the way, scaling up to 24 Cores is pretty ok, considering there is still some serial part in my tracer:

1 Core: Factor 1.0
2 Cores: Factor 2.0
4 Cores: Factor 3.98
8 Cores: Factor 7.6
12 Cores: Factor 10.9
16 Cores: Factor 14.2
24 Cores: Factor 19.5
32 Cores: Factor 23.18
48 Cores: Factor 22.0 <- This is not good

Andrey_Marochko · ‎06-17-2010

TBB task scheduler does not control threads distribution across available cores. This completely belongs to OS scheduler domain.

One possible reason of the described behavior can be your process having an affinity mask tying its threads to the cores of only one board. Though I have no idea who could set it. But this does not explain the peak at 32 cores unless your code does some blocking operations (like waiting on OS synchronization primitives or doing IO).

Another possible explanation is that the amount of work is not sufficient to load up 48 cores. Though in this case you'd likely see all your cores busy (either due to high overehad of too fine grained partitioning, or because workers uselessly spin trying to fing some work to do, and occasinally diving returning to the OS kernel just to be woken up soon again).

And at lasts, the most probable reason is that the memory bandwidth gets saturated at around 32 cores.

RafSchietekat · ‎06-17-2010

Unlike a recent linear situation, correct alignment seems very important in a 2D situation, because cuts in the horizontal are repeated over a large number of rows, and because grainsize is not there to provide alignment, you have to make each unit in grainsize space represent a correctly aligned unit of work in hardware space. Maybe sometime TBB will provide the sugar for that, but it's easy enough to do it yourself.

But if I can assume that pixels don't need to interact, I would suggest either not to use 2D ranges, which seem made for physics problems that want to minimize border-to-surface ratios, just vertical ranges of rows that could very well have horizontal parallel_for loops nested inside them, or to let the 2D grainsize degenerate to something that's horizontal, e.g., 256x1. The parallel_for logic will then process wide tiles that should have very little false-sharing overhead. Make it 10000x1 to almost guarantee horizontal stripes of work.

Just setting a very wide grainsize should give you a big boost for less than a minute of recoding, because 24x24 just doesn't sound right. Let us know what this does for you.

robert-reed · ‎06-17-2010

Your query suggests so many interesting questions. To start off, since you're talking about a ray tracer, presumably the overhead is NOT in writing the final pixels but doing the ray casting and intersection testing, bouncing around the 3D data structures following reflections and refractions. Any idea how those data are situated? Do they all reside in the memory on one card or the other? On one socket of the one card?

How's the comparative bandwidth saturation on the two cards? If we postulate a model where all the data are on one side and so you have fast and slow accesses, these scaling results would suggest that 24 cores don't saturate memory bandwidth because 32 cores give a higher number, but maybe saturation occurs between 32 and 48? Or maybe the data are distributed and this is a completely shallow analysis. More details are needed.

RafSchietekat · ‎06-17-2010

Do try that different grainsize ratio just to see what happens, but I've been way too optimistic about what you may expect with this amount of work involved per tile before the result is written, as Robert rightly pointed out, so quite probably the difference wiill disappear in the noise instead. I just quickly wrote this because there was this other question about grainsize just the other day and because 24x24 looks so strange: you might as well not provide a grainsize at all if you're using the (default) auto_partitioner, and with simple_partitioner the tiles are still going to be smaller than 24x24 (horizontally you'll get 1920->...->15 and vertically you'll get 1200->...->75->37,38->18,19, so the tiles will be 15x18 and 15x19, or just a little over 8000 per frame).

michaelnikelsky1 · ‎06-18-2010

The grainSize is not the problem. 24x24 pixel tiles are the optimum while 16x16 are nearly equally good, sometimes better for the better possibility to distribute the workload. It has to be 2D as well for the way the raytracer works and since it is actually one of the fastest tracers on the market it is safe to assume that this is not the problem. As I said, just keeping the tiles on a stack and retrieving the work from the stack works best and is faster than any parallel_for loop, which actually makes sense since this keeps the task running for the longest time possible while limiting the amount of interaction required. Also when I increase the amount of work to be done (so more indirections with many, many more rays to send through the scene), the necessary lock for the pop operation will vanish into noise.

I think the memory issue might be the problem. Indeed, the most time is spend in the intersection functions, at least this is what the profile tells me. And it seems to be during the traversal of the acceleration structure when a new node is fetched.

I allocate all memory in the main thread (using the scalable_aligned_malloc) so I assume that the whole scene is on one board. According to a paper I found on the AMD website, cross board accesses should take 30% longer but I am not shure about what bandwidth the system has. Since my scenes generally dont fit into any cache (usual memory requirements are between 2GB and 8GB, but we also had some scenes that required 24GB). So it is really likely that the interboard connection becomes the bottleneck.
I also did try a small testscene as well that should completely fit into cache while still doing a lot of work per pixel (many indirections and so on) and the render times were a bit more promising with 24 Cores requiring 18 seconds while 48 Cores required 12 seconds, so it shows a much better speedup. CPU usage is still peaking at 50-60% though.

The question is what can I do about it? My idea would be to actually duplicate the data so each board (or even each NUMA node) gets its own scene to work with. But how can I let the tasks know on which board they run? How can I determine this during runtime? Basicly I would require something like "NUMANodeId()" I could query when a task is executed so I can redirect my memory accesses. Of course, the OS shouldnt interfere afterwards and move the thread to another Node.

Any ideas?

RafSchietekat · ‎06-18-2010

Are you saying that "Windows 2008 Server R2" does not have "NUMANodeId()" yet, or wondering how to apply it with TBB and memory allocation? Last time I looked (not very recently, though, and I can't check again right now) there was no NUMA awareness in the scalable allocator, but that's where open source can work miracles.

Andrey_Marochko · ‎06-18-2010

Actually allocator should not necessary be NUMA aware. Modern OSes inernally use first-touch mapping policy. Thus even if you allocated the whole array on one node, but made the first access to a particular element from another one, the phiysical memory will be mapped from the second node bank.

Evidently to benefit from this feature the program must not initialize the whole array right where it was allocated, but postpone initialization to until the moment the iteration stace is partitioned and its chunks are ready to be processed by worker threads (so that the "first touch" happened on the nodes wheer the processing of the corresponding chunk will take place).

michaelnikelsky1 · ‎06-18-2010

I was wondering how I should apply this with the tbb.
All data is initialized in the main thread, so I assume it always ends up on one board.

Andrey_Marochko · ‎06-18-2010

I think you could try the following algorithm:

Allocate the array(s) on the main thread, but do not load any data into them at this point
Use your "small parallel_for from 1 to number of cores" to do higher level partitioning, and initialize corersponding parts of your array(s) in its body
Use nested parallel_for to further partition the large chunks you get at the first level

This approach may not achieve optimal load balance, but at least it should demonstrate significant improvement IF our problem really is in inter-board communication.

RafSchietekat · ‎06-18-2010

#7 "Actually allocator should not necessary be NUMA aware."
The scalable allocator recognises threads for block-level ownership because that's all that matters on non-NUMA systems, but aren't blocks allocated all together from an arbitrary (from the point of view of NUMA) mmap region, so that even if a thread first touches a block it still ends up with memory whose mmap region was first touched on another NUMA node? Or have I missed a development? I'll have another look tonight...

Andrey_Marochko · ‎06-18-2010

From what I know (unfortunately I cannot find an appropriate document at the moment), both Linux and Windows do not actually commit physical memory right at the moment of request (even when you call VirtualAlloc with MEM_COMMIT flag). Instead they postpone mapping of a virtual memory page until the first access to this page. Thus your continuous region of virtual address space ends up being mapped to the physiscal memory pages belonging to different NUMA nodes.

RafSchietekat · ‎06-18-2010

#11 "Thus your continuous region of virtual address space ends up being mapped to the physiscal memory pages belonging to different NUMA nodes."
Ah, so it's per page, not per mmap region? But then the question remains what people who thought they'd get better performance with large memory pages are going to do (including Intel's own Page Size Extension up to 4MB, and with Westmere even going to 1 GB, if I can believe Wikipedia)?

RafSchietekat · ‎06-18-2010

#8 "I was wondering how I should apply this with the tbb."
Assuming embarrassing parallelism except for shared access to the scene description, I suppose you need something like NLS (node-local storage), so to say, with the first user making a deep copy from the original? But how about letting each NUMA node process only its "own" frame, instead, or is there a serial dependency between successive frames that would prevent such a solution?

robert-reed · ‎06-20-2010

Even if there aren't dependence issues between frames, memory has to hold twice 2-24 GB of scene data with half the number of available threads working on each frame. But if you're going to double the scene graph, you might as well have two copies of the same scene on each board and let the threads on each chase the local graph to process pixels for, say,its side of the viewport, or interleaved bands, whatever works best.

However, aren't we getting ahead of ourselves? Interboard bandwidth saturation is still only a theory. Do you have any data to determine what might be happening? Intel Core processors have means to measure various bandwidth indicators. I presume those AMD chips have something similar via Oprofile or some such?

Finally, perhaps I should ask what OS. I've wondered about the description provided for VirtualAlloc:

It's the "You can commit reserved pages in subsequent calls..." that gives me pause. I haven't found chapter and verse on Linux mmap (-1) (no file mapping) but what I have seen refers to copy-on-write semantics, which would suggest a dynamic commit of pages. As it stands, the TBB MapMemory call that uses VirtualAlloc does so like this:

return VirtualAlloc(NULL, bytes, (MEM_RESERVE | MEM_COMMIT | MEM_TOP_DOWN), PAGE_READWRITE);

I haven't studied this code for quite a while, but the way I read it, it looks like VirtualAlloc would commit the whole BigBlock on the thread that initializes tbbmalloc, whereas if mmap works the way I think it does, threads writing to allocated pages would commit them locally then. Switching back to wild speculation, allocating the entire scene graph on one board would limit HyperTransport data transfer to one direction, whereas if elements of the scene graph were scattered by the whim of the initializing threads, the randomized accesses would at least use the HyperTransport in both directions.

So, got data?

michaelnikelsky1 · ‎06-21-2010

Yes, you are right, we were ahead of ourselves. I just did a burn in test, a task going into a loooooong loop doing nothing more than summing up numbers.

So for the test the task

-runs 100% inside the cache
-there are no memory accesses that could limit bandwith
-no locks whatsoever

Result: It sticks at 69% CPU usage.

Then I just created 48 Threads myself and rerun the test:

Result: 100% CPU usage.

Next step will be to really do the raytracing using the other threads, I wonder how that will turn out. But right now I would say there is a serious bug inside the TBB, whatever it is.

I will report back once I have the results.

michaelnikelsky1 · ‎06-21-2010

Ok, i finally managed to let the TBB use 100% CPU as well, using a parallel_for to start the task was a bad idea as it seems.

Now it scales equall to the normal Thread version, so when doing a lot of work per Pixel 48 Cores are faster than 32 Cores.

Scaling is still not optimal now but there may be another cause for this (only get a factor of 27.5x for 48 Cores, 24x for 32 Cores and 20x for 24 Cores).

Andrey_Marochko · ‎06-21-2010

Do you mean that you are not using parallel_for now? What do you do then?

michaelnikelsky1 · ‎06-21-2010

Thats what I am doing now:

_task_group_context->reset();

tbb::task_list tasks;

for( size_t idx = 0; idx < numCores; ++idx)
tasks.push_back( *new( _main_dummy_task->allocate_child())
TraceImageTask(_private, pixelOffset));

_main_dummy_task->set_ref_count(numCores + 1);
_main_dummy_task->spawn_and_wait_for_all( tasks);

This is called for every subframe I render (128 subframes). Id like to skip all the push_back thing, but I am not shure how I can prevent the tasks from getting deletet once they finish.

BUT: I get 100% CPU usage with that but it is slower than using

tbb::parallel_for( tbb::blocked_range(0, numCores, 1), vrRTHQTileTraceTask<....

which only uses at max 69% of the CPU on scenes that dont do so much work between two calls.

And it is not a small difference, the task-based version renders for 300 Seconds while the parallel_for renders for 245 seconds.

So I am not shure why this is and I am finally out of ideas. To me it looks like something is burning CPU cycles without doing any good.

RafSchietekat · ‎06-21-2010

"This is called for every subframe I render (128 subframes)."
So numCores is 48 and you give each of 128 subframes to each of those 48 cores? How does each core know what to do? Also, this way not enough parallel slack is being generated, causing the cores that finish first to sit idle until the next round of tasks. You should always try to generate a number of tasks that is many times the number of cores, so that the fraction of time that some cores are idling is relatively small. Normally parallel_for() does this for you, but if you are ever tempted to use something like numCores an alarm flag should go up and you should probably use something like 10*numCores instead.

"Id like to skip all the push_back thing, but I am not shure how I can prevent the tasks from getting deletet once they finish."
recycle_as_child_of()

"tbb::parallel_for( tbb::blocked_range(0, numCores, 1), vrRTHQTileTraceTask<...."
Again very suspicious code, with use of numCores preventing parallel slack instead of encouraging it.

(Added) As for burning cycles: wouldn't you then be able to catch one in the act by attaching a debugger?

michaelnikelsky1 · ‎06-21-2010

No, for rendering an image (1920x1080) for example, about 3600 tiles are created on a stack. Now for each Core a task is started that does something like

while( stack not empty)
doWork;

Once the stack is empty, the task returns (so each task will run until there is no more work to do). This is a pretty common approach in raytracing software and it seems to work fine....usually....

For the next subframe, the same is done (this cant be changed since there might be some changes between subframes, so it is essential to restart the tasks). So you basicly you render 128x (3600 Tiles) - Id say, there is plenty of work to do in parallel for 48Cores, especially since a single pixel uses up many 1000 CPU Cylces.

And as Ive written, I have tried using parallel_for to pull from the stack or even just a parallel_for on the whole image and it was constantly slower by a large marin, like everything else was that I tried. CPU usage went up, performance went down, thats it. And this is true for all plattforms I tried ( 2x 4Core Nehalem and 2x 6Core Nehalem and the 8x4Core AMD). It just doesnt make any sense at all.

But just tried normal Windows Threads and they show the same behaviour: 100% CPU usage and a slower speed compared to the 69% CPU usage.