Some memory optimizations

Max_Liani · ‎06-02-2012

Hi,
thank you for sharing embree source code. I'm learning a lot from it.
During last month I had a crack at optimising embee memory footprint. Major area of interest was reducing peak memory during tree build is a such way to have a minimal impact on performance (especially traversal speed). I ended up using almost 1/4th of the peak memory with minimal impact on traversal speed while achieving an even faster tree build.

In the process I account memory in the following sections:
- nodes, internal tree nodes,
- prims, tree leaf nodes,
- splits, boxes and some other temp memory used by tree builder,
- build, triangle soup and other temp data,
Data is collected and counters updated at allocation/release so I have a reasonable estimate of the individual/overall peaks. Of course, in the rendering process, there is more memory used by other subsystems which I'm not accounting for. Here is some data:

Original BVH4 implementation:
triangles = 4452410
sah = 1.85e+08
nodes = 417728 (44.6 MB) (92.3 %)
leaves = 1123778 (282.8 MB) (84.1 %)
Scene memory 170.1 MB (peak 170.1 MB)
BVH memory 328.7 MB
nodes 45.2 MB (peak 476.4 MB)
prims 283.5 MB (peak 952.9 MB)
splits 0.0 MB (peak 271.8 MB)
build 0.0 MB (peak 384.0 MB)
Total memory 498.7 MB (peak 2255.1 MB)

New BVH4 implementation:
triangles = 4452410
sah = 1.75e+08
nodes = 416685 (50.9 MB) (92.3 %)
leaves = 1122036 (221.9 MB) (84.2 %)
Scene memory 153.0 MB (peak 153.0 MB)
BVH memory 273.3 MB
nodes 51.1 MB (peak 52.0 MB)
prims 222.2 MB (peak 225.5 MB)
splits 0.0 MB (peak 169.8 MB)
build 0.0 MB (peak 0.0 MB)
Total memory 426.4 MB (peak 600.4 MB)

Overall memory peak is ~26% of the original, resident memory after tree build is ~85%.

Here are some of the changes I have made:
- To avoid overallocation for tree nodes and leaf triangles I have implemented some sort of deque where I allocate element in chunks of 2^15. When addressing the elements in the data structure I use the lower 15 bits of the ID to address the item in the chunk, the higher bits to address the chunk. I use this for both nodes and primitives (Triangle4) but I compact the nodes array in continuous memory at the end of the tree build to not affect traversal performance. I do that reallocation in the shadow of other build temp memory release to not affect the overall peak.

- Original implementation tries to free memory overallocated for nodes and leaves. This can increase peak sometimes. It might need to allocate brand new memory and copy the data across. Allocating memory in segments, I don't need to.

- I don't build the triangle soup in embrue::rtNewScene. That alone account for a big slice of the build peak and it is also a slow single threaded process. Besides the actual use of it, the std::vector used for the triangle soup grows in a geometric way. You could end up allocating almost twice the memory needed. Instead I have implemented a simple primitive iterator so I don't need the soup at all. I can address all triangles in the scene with a continuous uint64 id (max 2^32 objects with max 2^32 triangles each). Memory used by the data structure is pretty much 8 bytes per object; negligible in the total. Tree builder and bounds are computed using this iterator which is now a slightly slower process because of the fragmentation of memory but at least that extra cost is split into multiple threads; the soup build is a non threaded serial process instead. If I count the triangle soup creation as part of the tree build (which we should), it ends up that the overall tree build is now 35% faster (on 8 cores, a bit slower than the original if on a single thread).

- I don't use 2*N Boxes in the parallel split process, instead I use only N boxes and a gather/scatter addressing vector of 2*N indices. I don't shuffle the boxes from front to back of the longer array, I leave the boxes where they are and shuffle indices in the addressing vector. There is a minimal performance reduction of the tree builder but it is way repaid by previous point.

- I removed Ng from Triangle4. It is a simple cross product to compute during leaf intersection. The performance reduction seems to be a small price to pay, but the data structure is considerably smaller (the majority of the resident memory reduction comes from this little move).

Other than the above, I have noticed that internal nodes are not aligned to cache lines. They are rather referred to with a memory offset. I remove that and aligned the nodes, which actually increase memory usage, but it gives me more space to address a larger scene. Original have a limit of 2^26 primitives. Now I have 2^32 (up to 4 triangles each) and I have other 28 bits per node to spare where to store visibility flags, like "visible to camera", "cast shadows" or user-defined trace-subsets... which were not supported before and it is usually a mcuh needed feature.

I believe there is still a certain amount of space for improvements without changing the overall structure or technique. I will have another crack at it when I will find some time to spare.

Cheers,
Max

taylorius · ‎06-08-2012

Hi Max

It sounds like you've done some really exciting work. I've done some "tinkering around the edges" work on the memory usage, but nothing nearly as dramatic as you have undertaken. Are you thinking of making your modified source code available? It would be fascinating to compare performance.

Best Regards

Matt

Max_Liani · ‎06-09-2012

Hi Matt,
I did lot of changes to embed the core into some other tool before undertaking the memory optimization round. I can share some tricks and code fragments, but at this point I can't share the full code that would compile and run. I hope you understand. Is there any particular aspect you are interested to? I can provide more details.

Cheers,
Max