Client
Interact with Intel® product support specialists on client concerns and recommendations
39 Discussions

Neural Image Reconstruction for Real-Time Path Tracing

Anton_Sochenov
Employee
0 0 2,772

This technical deep dive was written by Manu Mathew Thomas and Anton Sochenov as part of their research efforts at Visual Compute and Graphics lab , within Intel Labs.

Highlights:

  • This blog details our research journey and key insights from building a joint neural denoising and supersampling technique designed to bring real-time path tracing to budget-friendly hardware.
  • Our in-house Jungle Ruins scene features animated trees, vegetation, foliage, different materials, dynamic shadows and lighting conditions, and is fully path traced at 1 SPP running 30FPS at 1440p on Intel B580 GPU.
  • Jungle Ruins serves as a foundation for advancing neural reconstruction research, particularly aimed at handling highly complex scenes characterized by high-frequency geometric details and scalability.


Despite the remarkable recent advances in rendering, real-time path tracing remains an elusive goal. Each breakthrough reveals new, often more challenging problems, continually testing the limits of current hardware and software. The computational demand of tracing light paths increases with scene complexity, material properties, and image resolution, preventing the rendering from fully converging to a noise-free image within real-time frame rates.

Unlike traditional image processing, where noise is treated as an error to be removed, noise in path tracing is fundamentally inherent to the rendering process. Each noisy sample contributes to the final solution. By understanding the rich, structured, and correlated information between samples, a meaningful signal can be reconstructed by deriving filter kernels based on pixel similarity. When the rendering is excessively noisy, geometric and material data are used to guide the reconstruction filtering process. In addition to spatial reconstruction, samples can be adaptively accumulated across multiple frames, improving the image quality over time.

In the early days of hardware-accelerated ray tracing, hand-engineered sophisticated reconstruction filters were carefully designed based on heuristics to reconstruct individual noisy signals such as shadows, reflections, and diffuse global illumination, which are then composited to form the final image. To attain playable frame rates, the shadings are usually done in low resolution before passing to the reconstruction step, followed by a ML based supersampling (combined anti-aliasing and upscaling) model. Due to their reliance on fixed heuristics, these hand-tuned filters often struggle in diverse and complex scenes leading to artifacts such as blurring or loss of details. This, combined with the rise of dedicated ML hardware, paved the way for machine learning approaches that enable unified, adaptive and data-driven reconstruction.

A joint denoising and supersampling model is beneficial in terms of both performance and image quality. Instead of training two nearly identical models for separate tasks, a single model learns to handle multiple signal reconstruction simultaneously with greater computational efficiency. While powerful, ML-based reconstruction techniques come with their own set of challenges, including issues of generalization, data bias, and real-time inference performance.

In this blog, we take you through our research journey and key insights from building a joint neural denoising and supersampling technique designed to bring real-time path tracing to budget-friendly hardware.


Real-Time Joint Neural Denoising and Supersampling

Beginning in early 2020, our research into neural methods for denoising started with the development of a spatial denoiser leveraging a reduced precision Kernel Prediction Network [Thomas 2020]. The following image shows the output of our spatial denoiser to a noisy image rendered at 4 samples per pixel (SPP).

media2_resized.pngFigure 1 – Left: noisy 4SPP. Right: our model.

As our focus shifted towards game workloads, we refined our goals to prioritize a unified denoiser, that could replace separate analytical denoisers, capable of working with low sample counts while maintaining temporal stability. To this end, we introduced a new improved model with a single low-precision feature extractor shared between different signals along with multiple higher-precision filter stages. To reduce cost further, our model uses low-resolution inputs to reconstruct a high-resolution, denoised, and supersampled output. [Thomas 2022].

Figure 2 – Noisy 1SPP 720p input, and denoised and supersampled 1440p output.


To streamline computation, lower the overall cost, and simplify integration with game engines we reduced input requirements to noisy RGB composite. Here is our simplified kernel prediction model operating on noisy color inputs from Unreal Engine hybrid renderer.

Figure 3 – Simplified kernel prediction model.


While our work primarily focuses on kernel prediction models, we observe that direct prediction models also offer notable benefits. Although often overlooked due to its challenging controllability and stability during training, our results suggest that, with an appropriate dataset and training strategy, it can achieve image quality comparable to more advanced models. The simpler architecture and lower computational cost make it better suited for entry level graphics hardware and integrated GPUs. Here is our non-quantized direct prediction reconstruction model running on path traced scenes with 1440p output resolution at 30+ FPS on an Intel B580 GPU.

Figure 4 – Non-quantized direct prediction reconstruction model.


Jungle Ruins Scene and Its Challenges

Our in-house Jungle Ruins scene serves as a foundation for advancing neural reconstruction research, particularly aimed at handling highly complex scenes characterized by high-frequency geometric details and scalability.

Ideally, we want our models to generalize to all kinds of scenes. However, as this remains a research effort and similar scenes are not widely available to add to our training dataset, we train on a subset of our in-house scene and evaluate unseen regions. This also highlights a key challenge of data driven approaches: the network can only learn features, correlations, and priors that are represented in the training data. In graphics, the range of geometric and appearance variations are vast, making true generalization particularly difficult. Inductive bias or the assumptions built into the model can help generalize within the distribution of the data. For instance, CNNs are biased towards processing local receptive fields whereas Transformer models are designed to capture global context using self-attention mechanism. However, if the input falls outside the training distribution, the model may produce artifacts or hallucination due to lack of learned priors for those unfamiliar scenarios.

To train an effective reconstruction network, it is essential to capture a wide spectrum of noise levels and scene characteristics, including lighting, material and geometry. Creating such a dataset is time-consuming and difficult to scale. The model also must adapt differently to each variation in the data. Here we show the same scene in different lighting conditions. One of them requires aggressive temporal accumulation to produce a stable image while other one can achieve the same level of stability with spatial information.

media6_resized.pngFigure 5 – 1SPP images of the same area under different lighting conditions.
Left: low light. Right: direct lighting.


Generating clean reference images for training is one of the most resource-intensive steps in data capture pipeline. In path tracing, a truly converged image requires rendering with an exceedingly high SPP which is not practical especially when our coverage requirements are high. Although lesser amounts of residual noise in the reference may be averaged out during training, using a cleaner reference is important for stable training and to avoid unintended bias.

media7_resized.pngFigure 6 – Two images of the same area rendered with different SPP.
Left: 1024 SPP, right: 32K SPP


The variations in scene and camera configurations are beneficial for training but can also create dataset imbalance. As a result, certain types of content are overrepresented, while more challenging or less frequent cases receive limited training coverage. This imbalance can bias network towards one set of patterns while reducing its effectiveness in less common but visually important scenarios. A weighted sampling of training data combined with image augmentations can increase the presence of underrepresented configurations. This encourages the network to learn more evenly across a diverse range of visual and structural conditions. A high-quality dataset can make a significant impact on the reconstruction capability of the model.


Challenges

Although our model addresses several common reconstruction issues, a few challenges persist. In many cases, we observed noticeable gains, but there’s still room for refinement.

  • Fine texture details – Denoisers generally tend to produce smoother results because they are optimized to reduce visible noise using loss function that favor averaging. As a result, finer details can be lost, especially when the model cannot distinguish between high-frequency noise and actual signal.
    finer_texutre_details_resized.png

    Figure 7 – Left: our model. Right: reference image.

  • Flickering - While a single denoised frame might look clean, small inconsistencies from frame to frame can result in visible shimmer over time. The inconsistencies can occur due to changes in lighting, motion or lack of temporal context in the model itself. A good temporal loss can encourage the model to keep outputs stable but if it is aggressively used, we end up with ghosting artifacts.

    Figure 8 – Left: low temporal loss. Right: strong temporal loss.

  • Moiré patterns – appear when high frequency details are undersampled, causing interference between the scene detail and the pixel grid. This results in wavy patterns that are not present in the scene but emerge due to insufficient resolution or sampling precision. One way to address this is by training the model on more samples with the textures and structures that commonly cause these artifacts. With sufficiently diverse and representative training data, the model learns to resolve these issues during denoising.

    Figure 9 – Before and after adding moiré pattern data.

  • Shadow Reconstruction – Shadows are inherently tricky for denoisers when there is no supporting information in motion vectors or guide buffers. The model relies solely on the noisy color input. When we introduce training samples with different lighting conditions and animation, we see our model gradually learns to reconstruct shadows more effectively.

    Figure 10 – Left: noisy 1SPP. Middle/Right: insufficient/extensive shadow data variations.

  • Disocclusion – One of the most challenging aspects for our model is handling disocclusions. These occur in regions that were occluded in the previous frame but become visible in the current one due to camera or object motion. Because these newly visible areas lack information from the previously denoised frame, reconstruction becomes difficult. The absence of consistent patterns makes it hard for our model to generalize, sometimes resulting in ghosting artifacts. As with other artifacts, adding diverse and representative training data can help mitigate the issue.

    Figure 11 – Before and after adding similar scene data in training.

  • Reflections – Similar to shadows, reconstructing reflection model relies only on noisy color input. Providing the first non-specular hit in the auxiliary buffers can significantly improve the quality of the reflection especially for mirror-like surfaces.

    Aov_reflection_cmp_resized.png
    Figure 12 – Left: Without first non-specular hit data in albedo. Right: With it included.

Perceived quality of distortions

Each of the artifacts above affects perceived rendering quality in a different way. The complex interdependencies between artifact types and perceived visual quality can be effectively modeled using objective quality metrics. The results from these metrics can then be used to optimize model parameters or rebalance training data to achieve perceptually optimal outcomes.


Conclusion

In spite of many challenges, we have come a long way from initial explorations and reconstructing simple scenes to the extremely challenging large scale Jungle Ruins scene featuring animated trees, vegetation, foliage, different materials, dynamic shadows and lighting conditions, fully path traced at 1 SPP achieving 30FPS at 1440p on Intel B580 GPU.

 

Tags (3)
About the Author
I support the Real-Time Graphics Research team at Intel, focusing on a combination of classical and novel neural technologies that push the boundaries of industrial research. Previously, I led the software engineering team within the Meta Reality Labs' Graphics Research Team, working on a graphics stack for socially acceptable augmented reality (AR) glasses. Before that, I was bringing telepresence technology to Microsoft HoloLens AR headset.