Your query suggests so many interesting questions. To start off, since you're talking about a ray tracer, presumably the overhead is NOT in writing the final pixels but doing the ray casting and intersection testing, bouncing around the 3D data structures following reflections and refractions. Any idea how those data are situated? Do they all reside in the memory on one card or the other? On one socket of the one card?
How's the comparative bandwidth saturation on the two cards? If we postulate a model where all the data are on one side and so you have fast and slow accesses, these scaling results would suggest that 24 cores don't saturate memory bandwidth because 32 cores give a higher number, but maybe saturation occurs between 32 and 48? Or maybe the data are distributed and this is a completely shallow analysis. More details are needed.
Are you saying that "Windows 2008 Server R2" does not have "NUMANodeId()" yet, or wondering how to apply it with TBB and memory allocation? Last time I looked (not very recently, though, and I can't check again right now) there was no NUMA awareness in the scalable allocator, but that's where open source can work miracles.
#7 "Actually allocator should not necessary be NUMA aware."
The scalable allocator recognises threads for block-level ownership because that's all that matters on non-NUMA systems, but aren't blocks allocated all together from an arbitrary (from the point of view of NUMA) mmap region, so that even if a thread first touches a block it still ends up with memory whose mmap region was first touched on another NUMA node? Or have I missed a development? I'll have another look tonight...
#8 "I was wondering how I should apply this with the tbb."
Assuming embarrassing parallelism except for shared access to the scene description, I suppose you need something like NLS (node-local storage), so to say, with the first user making a deep copy from the original? But how about letting each NUMA node process only its "own" frame, instead, or is there a serial dependency between successive frames that would prevent such a solution?
It's the "You can commit reserved pages in subsequent calls..." that gives me pause. I haven't found chapter and verse on Linux mmap (-1) (no file mapping) but what I have seen refers to copy-on-write semantics, which would suggest a dynamic commit of pages. As it stands, the TBB MapMemory call that uses VirtualAlloc does so like this:
return VirtualAlloc(NULL, bytes, (MEM_RESERVE | MEM_COMMIT | MEM_TOP_DOWN), PAGE_READWRITE);
I haven't studied this code for quite a while, but the way I read it, it looks like VirtualAlloc would commit the whole BigBlock on the thread that initializes tbbmalloc, whereas if mmap works the way I think it does, threads writing to allocated pages would commit them locally then. Switching back to wild speculation, allocating the entire scene graph on one board would limit HyperTransport data transfer to one direction, whereas if elements of the scene graph were scattered by the whim of the initializing threads, the randomized accesses would at least use the HyperTransport in both directions.
So, got data?
"This is called for every subframe I render (128 subframes)."
So numCores is 48 and you give each of 128 subframes to each of those 48 cores? How does each core know what to do? Also, this way not enough parallel slack is being generated, causing the cores that finish first to sit idle until the next round of tasks. You should always try to generate a number of tasks that is many times the number of cores, so that the fraction of time that some cores are idling is relatively small. Normally parallel_for() does this for you, but if you are ever tempted to use something like numCores an alarm flag should go up and you should probably use something like 10*numCores instead.
"Id like to skip all the push_back thing, but I am not shure how I can prevent the tasks from getting deletet once they finish."
(Added) As for burning cycles: wouldn't you then be able to catch one in the act by attaching a debugger?