We're using Embree as our raytracing engine, but are having some problems with memory usage when it comes to large scenes (10M+ Triangles)
I believe that a large proportion of memory usage is taken up by the acceleration structure, and I was wondering if there were any variables we could adjust to cause Embree to build a smaller acceleration structure and save us some memory on large scenes? (At the expense of rendering speed, of course).
Any other emory reduction news / ideas would also be gratefully received! :-)
What acceleration structure and builder are you using? Best avoid using the spatial split builder for large scenes, that one consumes much memory. Start the application with -accel bvh4
Finding a better tradeoff between memory consumption and performance is something we are looking into. Most memory is consumed during the building of the data structures. While the builders do not touch much memory, they need a large virtual address space. You should try increasing the swap file on your harddisk if you run into memory allocation problems.
Yep, memory is the biggest flow in embree and makes it unpractical for use in real application... My 6GB rig, wasn't able to run even the crown scene...
Fast ways to reduce the memory:
- remove the precalculated normal inTriangle4. For better alignment, I removedid1 too, and moved it to a separate array. This change doesn't seem to affect the speed notably on my i7, but saves a good amount of memory.
- free the vertices from the meshes, since they are already in the tree and can be restored from there.
- in the builder classes, change the calculation ofallocatedNodes andallocatedPrimitives. This is very tricky part since there are multiple threads, and space is wasted in unpredictalbe way. But the wasted space there currently is HUGE. In this place is allocated memory equal tothe number of triangles multiplied by 4 + some extraper thread memory. What I did, worked on the crown scene, but is probably not safe in other scenes:
I thinkallocatedNodes is "safe" to be divided by 4, since it doesn't waste so much memory as the triangles. And number of nodes must be lower then triangles.allocatedPrimitives is the most unpredictable part, so you must play safe here. Must be definetly multiplied by something > 0.25.
Also, for every practical application, the huge arrays must be replaced with some kind of deque, that allocates memory on chunks. In this way you can not only have better chance to use all your available memory (because on my 6GB rig, I had the memory, but it wasn't continuous), but also easly free the unused chunks after building. Not relying on reallocating function, that can choose to allocate another big block of memory (i.e. another chance for failed allocation).
I was wrong aboutallocatedPrimitives :(... Wasted space can be much more, since the packing may be far from perfect. I was testing with bvh2 wich have less leaves, and seems to waste less memory. But bvh4 is another story.
So, probably the Builder class must be able to allocate extra chunks of memory in globalAllocPrimitives()when needed. It is worth experimenting how this can be done. For example the current atomic may stay, but when we fail to return memory, allocate some new chunk (bigger then allocBlockSize). But generally... everything has a trade-off.. The purpose of embree was to show fastest CPU path tracer. With more tweaking, it probablycan be made practiacal, withoutsacrificing too much of the original performance.
About BVH... I was looking into the code, and I didn't find anything that can be improved. I suppose, Intel did there job very well there, since this is what embree is about. They even use tree rotation to improve the BVH a bit.
The other thing I was experimenting with is the maximum number of triangles that can be stored. Currently, you can see this in the code:
if (nextTriangle >= (1<<26)) throw std::runtime_error("cannot encode triangle, ID too large");
67M triangles may be not enough, and this is not the actual count, but is related again to the above unpredictable wasting of memory. I just changed the Node class to holdint64 childs, not int32. There is already space for this, just not used. The Node classes are even cache aligned.
I've implemented the chunk allocation and it's working pretty much as I've expected. The container must be specially written for this purpose though, for fast random access, and to allow allocating new chunks while the old ones are accessed by other threads. I've got some little slow down of course, but I can live with that :).
That's really interesting, nice work! What sort of percentage memory saving are you finding, and would you say the slowdown you're getting mainly affects the building phase, or the traversal?
I implemented some of the things you mentioned earlier, and saw some gains, but to be honest, I was a bit worried about the robustness with arbitrary scenes. Your chunk system does sound cool though, I'd be fascinated to see your implementation if ever you felt like sharing.:-)
It's just for experiment, don't expect to be bug free or finished :). The slow down is during the traversal. If your scene fits in a single chunk, it will render faster. You can play with the chunk size to see exactly what is the difference. It's something like 2-3%.