I had recently the chance to test my raytracer based on TBB 3.0 on a 48 Core AMD NUMA plattform (so that are 8 Processor organized in 2 boards connected with hypertransport, each board with 4 CPUs). Sadly the results were disastrous.
My first try was to just use parallel-for with a grainsize of 24x24 which gave me the best results on SMP Machines so far. This actually resulted in 48 Cores being actually about 20% slower than 24 Cores.
So my new approach was to just use a small parallel_for loop from 1 to number of cores and maintain a simple stack where I pull blocks to render from (so just 1 atomic increment for each tile, with about 2000 tiles per FullHD Frame and no mutexes whatsoever). The results where a lot better, 24 Cores were about 10% faster, 48 Cores were about 30% faster than before.
Nonetheless: 48 Cores are about 4% faster than 24 Cores which is a littlebit rediculous. Whats even more interesting: Using 24 Cores I get about 50% CPU usage (so exactly 4 of the 8 NUMA nodes run close to 100%just like it should be). Upping to 48 Cores gives me about 60% CPU usage with still 4 NUMA nodes peaking at 100% while the other 4 are more or less idle at 5% maximum. It also doesnt improve if I massively increase the amount of work to be done for each pixel, so I doubt that the atomic operation for pulling from my stack have any influence.
Although the hypertransport will slow down memory access a little ( from what I have read so far it should be 30% slower compared to a direct memory access on the same board), this is nowhere near the performance that should be possible. It actually looks to me like the TBB Scheduler / Windows 2008 Server R2 Scheduler puts 2 threads on each core of one board and leaves the second board pretty much idle.
Does anyone have an Idea what might go wrong?
P.S. By the way, scaling up to 24 Cores is pretty ok, considering there is still some serial part in my tracer:
1 Core: Factor 1.0
2 Cores: Factor 2.0
4 Cores: Factor 3.98
8 Cores: Factor 7.6
12 Cores: Factor 10.9
16 Cores: Factor 14.2
24 Cores: Factor 19.5
32 Cores: Factor 23.18
48 Cores: Factor 22.0 <- This is not good
One possible reason of the described behavior can be your process having an affinity mask tying its threads to the cores of only one board. Though I have no idea who could set it. But this does not explain the peak at 32 cores unless your code does some blocking operations (like waiting on OS synchronization primitives or doing IO).
Another possible explanation is that the amount of work is not sufficient to load up 48 cores. Though in this case you'd likely see all your cores busy (either due to high overehad of too fine grained partitioning, or because workers uselessly spin trying to fing some work to do, and occasinally diving returning to the OS kernel just to be woken up soon again).
And at lasts, the most probable reason is that the memory bandwidth gets saturated at around 32 cores.
But if I can assume that pixels don't need to interact, I would suggest either not to use 2D ranges, which seem made for physics problems that want to minimize border-to-surface ratios, just vertical ranges of rows that could very well have horizontal parallel_for loops nested inside them, or to let the 2D grainsize degenerate to something that's horizontal, e.g., 256x1. The parallel_for logic will then process wide tiles that should have very little false-sharing overhead. Make it 10000x1 to almost guarantee horizontal stripes of work.
Just setting a very wide grainsize should give you a big boost for less than a minute of recoding, because 24x24 just doesn't sound right. Let us know what this does for you.
Your query suggests so many interesting questions. To start off, since you're talking about a ray tracer, presumably the overhead is NOT in writing the final pixels but doing the ray casting and intersection testing, bouncing around the 3D data structures following reflections and refractions. Any idea how those data are situated? Do they all reside in the memory on one card or the other? On one socket of the one card?
How's the comparative bandwidth saturation on the two cards? If we postulate a model where all the data are on one side and so you have fast and slow accesses, these scaling results would suggest that 24 cores don't saturate memory bandwidth because 32 cores give a higher number, but maybe saturation occurs between 32 and 48? Or maybe the data are distributed and this is a completely shallow analysis. More details are needed.
I think the memory issue might be the problem. Indeed, the most time is spend in the intersection functions, at least this is what the profile tells me. And it seems to be during the traversal of the acceleration structure when a new node is fetched.
I allocate all memory in the main thread (using the scalable_aligned_malloc) so I assume that the whole scene is on one board. According to a paper I found on the AMD website, cross board accesses should take 30% longer but I am not shure about what bandwidth the system has. Since my scenes generally dont fit into any cache (usual memory requirements are between 2GB and 8GB, but we also had some scenes that required 24GB). So it is really likely that the interboard connection becomes the bottleneck.
I also did try a small testscene as well that should completely fit into cache while still doing a lot of work per pixel (many indirections and so on) and the render times were a bit more promising with 24 Cores requiring 18 seconds while 48 Cores required 12 seconds, so it shows a much better speedup. CPU usage is still peaking at 50-60% though.
The question is what can I do about it? My idea would be to actually duplicate the data so each board (or even each NUMA node) gets its own scene to work with. But how can I let the tasks know on which board they run? How can I determine this during runtime? Basicly I would require something like "NUMANodeId()" I could query when a task is executed so I can redirect my memory accesses. Of course, the OS shouldnt interfere afterwards and move the thread to another Node.
Are you saying that "Windows 2008 Server R2" does not have "NUMANodeId()" yet, or wondering how to apply it with TBB and memory allocation? Last time I looked (not very recently, though, and I can't check again right now) there was no NUMA awareness in the scalable allocator, but that's where open source can work miracles.
Evidently to benefit from this feature the program must not initialize the whole array right where it was allocated, but postpone initialization to until the moment the iteration stace is partitioned and its chunks are ready to be processed by worker threads (so that the "first touch" happened on the nodes wheer the processing of the corresponding chunk will take place).
- Allocate the array(s) on the main thread, but do not load any data into them at this point
- Use your "small parallel_for from 1 to number of cores" to do higher level partitioning, and initialize corersponding parts of your array(s) in its body
- Use nested parallel_for to further partition the large chunks you get at the first level
#7 "Actually allocator should not necessary be NUMA aware."
The scalable allocator recognises threads for block-level ownership because that's all that matters on non-NUMA systems, but aren't blocks allocated all together from an arbitrary (from the point of view of NUMA) mmap region, so that even if a thread first touches a block it still ends up with memory whose mmap region was first touched on another NUMA node? Or have I missed a development? I'll have another look tonight...
Ah, so it's per page, not per mmap region? But then the question remains what people who thought they'd get better performance with large memory pages are going to do (including Intel's own Page Size Extension up to 4MB, and with Westmere even going to 1 GB, if I can believe Wikipedia)?
#8 "I was wondering how I should apply this with the tbb."
Assuming embarrassing parallelism except for shared access to the scene description, I suppose you need something like NLS (node-local storage), so to say, with the first user making a deep copy from the original? But how about letting each NUMA node process only its "own" frame, instead, or is there a serial dependency between successive frames that would prevent such a solution?
However, aren't we getting ahead of ourselves? Interboard bandwidth saturation is still only a theory. Do you have any data to determine what might be happening? Intel Core processors have means to measure various bandwidth indicators. I presume those AMD chips have something similar via Oprofile or some such?
Finally, perhaps I should ask what OS. I've wondered about the description provided for VirtualAlloc:
It's the "You can commit reserved pages in subsequent calls..." that gives me pause. I haven't found chapter and verse on Linux mmap (-1) (no file mapping) but what I have seen refers to copy-on-write semantics, which would suggest a dynamic commit of pages. As it stands, the TBB MapMemory call that uses VirtualAlloc does so like this:
return VirtualAlloc(NULL, bytes, (MEM_RESERVE | MEM_COMMIT | MEM_TOP_DOWN), PAGE_READWRITE);
I haven't studied this code for quite a while, but the way I read it, it looks like VirtualAlloc would commit the whole BigBlock on the thread that initializes tbbmalloc, whereas if mmap works the way I think it does, threads writing to allocated pages would commit them locally then. Switching back to wild speculation, allocating the entire scene graph on one board would limit HyperTransport data transfer to one direction, whereas if elements of the scene graph were scattered by the whim of the initializing threads, the randomized accesses would at least use the HyperTransport in both directions.
So, got data?
So for the test the task
-runs 100% inside the cache
-there are no memory accesses that could limit bandwith
-no locks whatsoever
Result: It sticks at 69% CPU usage.
Then I just created 48 Threads myself and rerun the test:
Result: 100% CPU usage.
Next step will be to really do the raytracing using the other threads, I wonder how that will turn out. But right now I would say there is a serious bug inside the TBB, whatever it is.
I will report back once I have the results.
Now it scales equall to the normal Thread version, so when doing a lot of work per Pixel 48 Cores are faster than 32 Cores.
Scaling is still not optimal now but there may be another cause for this (only get a factor of 27.5x for 48 Cores, 24x for 32 Cores and 20x for 24 Cores).
for( size_t idx = 0; idx < numCores; ++idx)
tasks.push_back( *new( _main_dummy_task->allocate_child())
_main_dummy_task->set_ref_count(numCores + 1);
This is called for every subframe I render (128 subframes). Id like to skip all the push_back thing, but I am not shure how I can prevent the tasks from getting deletet once they finish.
BUT: I get 100% CPU usage with that but it is slower than using
which only uses at max 69% of the CPU on scenes that dont do so much work between two calls.
And it is not a small difference, the task-based version renders for 300 Seconds while the parallel_for renders for 245 seconds.
So I am not shure why this is and I am finally out of ideas. To me it looks like something is burning CPU cycles without doing any good.
"This is called for every subframe I render (128 subframes)."
So numCores is 48 and you give each of 128 subframes to each of those 48 cores? How does each core know what to do? Also, this way not enough parallel slack is being generated, causing the cores that finish first to sit idle until the next round of tasks. You should always try to generate a number of tasks that is many times the number of cores, so that the fraction of time that some cores are idling is relatively small. Normally parallel_for() does this for you, but if you are ever tempted to use something like numCores an alarm flag should go up and you should probably use something like 10*numCores instead.
"Id like to skip all the push_back thing, but I am not shure how I can prevent the tasks from getting deletet once they finish."
(Added) As for burning cycles: wouldn't you then be able to catch one in the act by attaching a debugger?
while( stack not empty)
Once the stack is empty, the task returns (so each task will run until there is no more work to do). This is a pretty common approach in raytracing software and it seems to work fine....usually....
For the next subframe, the same is done (this cant be changed since there might be some changes between subframes, so it is essential to restart the tasks). So you basicly you render 128x (3600 Tiles) - Id say, there is plenty of work to do in parallel for 48Cores, especially since a single pixel uses up many 1000 CPU Cylces.
And as Ive written, I have tried using parallel_for to pull from the stack or even just a parallel_for on the whole image and it was constantly slower by a large marin, like everything else was that I tried. CPU usage went up, performance went down, thats it. And this is true for all plattforms I tried ( 2x 4Core Nehalem and 2x 6Core Nehalem and the 8x4Core AMD). It just doesnt make any sense at all.
But just tried normal Windows Threads and they show the same behaviour: 100% CPU usage and a slower speed compared to the 69% CPU usage.