This technical deep dive was written by Gabor Liktor, Tobias Zirr, Miroslaw Pawlowski, and Anton Sochenov as part of their research efforts at the Visual Compute and Graphics lab, within Intel Labs.
Highlights:
- There are two main challenges to handle complex geometry in a path tracer: memory footprint and acceleration structure updates.
- In this article, we discuss the solutions we developed to address the challenges of geometric complexity to make the real-time path tracing of trillion-triangle scenes possible on an Intel Arc B580 GPU.
- We found that partitioning the top-level acceleration structure to a linkage of fragments provides an essential tool to manage updates at a finer granularity while preserving the integrity of ray traversal within a single structure.
Figure 1.Visualization of the AS-fragment partitioning in Jungle Ruins.
Ray tracing scales well (logarithmically) with scene complexity thanks to the efficiency of traversing spatial data structures (acceleration structures) in contrast to rasterization, which linearly streams through the geometry of the entire scene. However, in a path tracer there are two main challenges to handle complex geometry:
- Memory footprint: to accurately simulate light transport, even indirectly visible geometry and their corresponding acceleration structures need to persist in memory.
- Acceleration structure updates: before rays can be dispatched in the scene, the acceleration structures that contain animated geometry need to be updated. This must be done for large parts of the scene beyond the direct view frustum, since it is hard to predict which areas would be traversed by indirect rays.
In Jungle Ruins, we have optimized the scene so that the entire geometry and acceleration structures fit into the GPU memory, so no streaming updates are required. However, the large number of dynamic instances (animated foliage) made the maintenance of our scene acceleration structures challenging. For more than 9 million instances, the cost of updating our top-level acceleration structure (TLAS) pushed our frame times well beyond the desired 30 FPS real-time constraints. It became clear early in the project that a novel solution is required to amortize the TLAS update costs across frames.
In this article, we discuss the solutions we developed to address the challenges of geometric complexity to make the real-time path tracing of trillion-triangle scenes possible on an Intel Arc B580 GPU.
TLAS Partitioning
Modern real-time ray tracing APIs, such as Direct3D and Vulkan, model the acceleration structure of a scene on two levels. Each unique geometry (e.g., triangle mesh) has a corresponding bottom-level acceleration structure (BLAS). A single top-level acceleration structure (TLAS) is then built over the transformed instances of all geometry. This abstraction allows an efficient implementation of underlying data layouts and traversal methods but also comes with limitations.
In the Jungle Ruins scene, we have already optimized our assets for unique triangle count to fit into our memory budget. The complexity of our animated geometry is in the range of 300-500K unique triangles, and the cost of BLAS updates easily fits into our frame budget. However, the high number of instances made the cost of dynamic TLAS updates prohibitively expensive, and this is not something we can easily solve using the existing two-level TLAS-BLAS model of the Vulkan Ray Tracing API. Some potential ideas that we considered:
- Multi-TLAS: The cost of maintaining a monolithic TLAS can be amortized by partitioning the scene into a set of tiles, with each tile having an independent TLAS. However, this would introduce a significant problem during ray tracing, where rays that cross tile boundaries would require relaunching into a new TLAS. While such a spatial partitioning might be applicable in distributed rendering of large scenes, we never considered this practical on a single GPU.
- Multi-level instancing: A possible way to limit the cost of individual AS updates is to introduce additional levels into the TLAS-BLAS hierarchy. By allowing nested instances within instances, we can keep the size of individual acceleration structures under control. While multi-level instancing is certainly a promising direction for other use cases, it would also add more complexity during ray traversal that is not needed for our open-world scene. Instances are “heavy-weight” objects that also assume ray transformations during traversal, not only adding computations for each ray, but also increasing the footprint of the ray’s state in memory (e.g., additional stack levels). In our case, all scene partitions share the same world space transformation.
- Fragmented TLAS: While recognizing that we needed to control the update of our TLAS at a finer granularity, we wanted to avoid additional transformations or other state management during ray traversal. The solution we have evaluated in Jungle Ruins allows the construction of the TLAS from sub-trees (AS-fragments) that can be built and updated independently but then linked together into single TLAS. This required extending the Vulkan API with a fragmented AS build mechanism. The key benefit of this solution is that it allows limiting the cost of TLAS updates to a sparse set of AS fragments, while keeping the representation monolithic during ray traversal.
AS-Fragment Partitioning in Jungle Ruins
Figure 2. TLAS partitioning scheme used in Jungle Ruins: the outer AABB represents the boundaries of the root of the TLAS, while the nested AABBs are bounds of instance partitions (tiles) in the scene, built independently as AS-fragments and linked together in a single TLAS.
Each AS-fragment corresponds to an independent subset of instances in the TLAS. We refer to the mapping of instances to AS-fragments as our instance partitioning. The selection of the right instance partitioning scheme is critical for reliable performance, and is driven by two key factors:
- Keeping the size of partitions (number of instances) as regular as possible, which would lead to fewer fluctuations during sparse updates.
- Minimizing the spatial overlap of AS-fragment bounds. We still traverse a single TLAS; a large overlap of fragment AABBs (axis-aligned bounding boxes) would degrade the performance of ray traversal.
The Jungle Ruins scene contains about 9 million instances that are roughly uniformly distributed in the scene, except for a “hero area of interest” near the pyramid. For such a large open-world scene, a natural partitioning scheme is to break down the scene into a set of tiles and assign instances to them based on their centroid. Of course, some overlap between AS-fragment bounds is inevitable, but minimizing this overlap is key to achieving good traversal performance. To cater to these needs, we use a simple adaptive scheme that incrementally subdivides the set of instances with hierarchically applied binary spatial splits. Each split partitions the instances into groups of approximately equal counts while taking care to choose the split planes creating the lowest overlap of the resulting partitions.
Our task is made a lot simpler by having only foliage with a limited range of motion. Therefore, we can rely on static splitting that tries to spatially optimize the number of instances within the AS-fragments. If we had significant changes in the location of instances, like in the case of animated characters, we could either rebuild AS-fragments when instances cross spatial boundaries, or we could create separate AS-fragments that contain only such moving instances, leaving the rest of the structure static. We leave such experiments to future work. In our scene we have a total of 78 partitions, visualized in Figure 1 above.
Overlapping instance AABBs introduce perf degradation during traversal compared to monolithic TLAS. While we traverse a single unified TLAS, a significant overlap of instance partition bounds may still degrade our ray-tracing performance. To this end, we introduced a mesh splitter in the content authoring pipeline, which we applied for larger, unique geometry in the scene, in particular to the terrain and pyramid.
AS-Fragments Update Performance
Figure 3. The relative cost of AS-fragment updates compared to monolithic TLAS updates over our cinematic sequence. On average, AS-fragments require about 15% of the time of a full TLAS update, but there are some differences across camera sequences due to the view-dependent sparse update heuristic.
The update of AS-fragments is also amortized across frames, with only a subset of them being updated (refit) in each frame based on a maximum screen-space error heuristic. We compute the required update frequency for each AS-fragment to stay below 1px screen-space error. Based on this frequency, we compute the average number of instances that require refitting in each frame. Note that this number is dependent on the camera position, and it can change suddenly with camera cuts.
The time of ray tracing was slightly increased due to the degradation of the partitioned TLAS quality compared to a monolithic TLAS, since we have some overlap between partitions and the monolithic AS builder can do a better job splitting instances between nodes on traversal heuristics compared to our static spatial partitioning. However, this increase in ray traversal is negligible compared to the speedup in total frame time due to the reduction in TLAS update time. On average, our TLAS updates are 7x faster than a monolithic TLAS update. The combined time of AS updates and path tracing is 20% faster.
BLAS Performance Optimizations
Given that the animation of foliage and trees swaying in the wind is spatially localized, we implemented performance optimization to reduce the cost of acceleration structure updates. Instead of performing full BLAS (Bottom-Level Acceleration Structure) and rebuilding each frame using the PREFER_FAST_BUILD flag, we perform a full build only once at initialization, using the PREFER_FAST_TRACE flag to prioritize trace performance. For subsequent frames, we rely on BLAS updates (refits), which are significantly faster.
While this approach may result in a lower-quality BVH (Bounding Volume Hierarchy) compared to a full rebuild, in our case, the performance gain from faster updates outweighs the potential loss in traversal speed. Overall, this optimization reduces BLAS update time by approximately 40-45%, leading to a 5–10% reduction in the total frame rendering time of our workload.
Conclusion
In this article, we have discussed how the partitioning of the top-level acceleration structure to a linkage of fragments provides an essential tool to manage updates at a finer granularity but preserve the integrity of ray traversal within a single structure. In the future, we may explore further avenues to improve the management of geometric complexity in our path tracer by experimenting with ideas such as level-of-detail methods and global illumination aware culling.
Series
- Path Tracing a Trillion Triangles
- Neural Image Reconstruction for Real-Time Path Tracing
- Jungle Ruins Scene: Technical Art Meets Real-Time Path-Tracing Research
- Path Tracing Massive Dynamic Geometry in Jungle Ruins - this post
- Assessing Video Quality in Real-time Computer Graphics
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.