Applications scaled with rtcIntersect4/8/16 and Importance of Scheduler

lalith-mcw · ‎04-09-2024

I do have few questions on the performance and schedulers part of the application, and was checking more on the usage and parallelism for rtcIntersect4/8/16 and rtcOccluded4/8/16 and went through few comments

https://github.com/embree/embree/issues/356#issuecomment-1003924412

https://github.com/embree/embree/issues/413#issuecomment-1341321901

1. The default examples (i.e, pathtracer, motionblur) present in `Embree` and does not seem to utilize rtcIntersect4/8/16 and rtcOccluded4/8/16 are there possible performance improvement if the rays are coherent across the packet? Please do provide some examples which takes advantages of these functions

2. Are there other provisions to process multiple pixels together to scale the performance while rendering ?

3. The default task scheduler is tbb and most of the time is spent on task stitching(synchronization), are there better ways to take advantage of SIMD usage across.

4. Most of the application is restricted to Vec2 & Vec3 formats (within Ray Tracer Core) which sticks to128-bit vectors. Few other Rendering applications are using 256-bit formats

FlorianR_Intel · ‎04-16-2024

Hi,

Regarding 1.) Which mode of rendering (ray1, ray4, ray8, ray16) will be faster depends on a lot of parameters so please experiment with the different ray packet sizes. However for simple "primary hit" use-cases, ray packets are expected to perform better (usually the wider the better). However not all workloads can easily use these ray packets without code modifications. For example, a "mega-kernel"-style path tracer (such as the Embree path tracer tutorial) usually uses a single-ray model and using ray packets requires modifying the path tracer to a "wave-front"-style.

Regarding 2.) That's a very generic rendering design question and not easy to answer from the Embree perspective.

Regarding 3.) This should not be the case for real life workloads. And TBB has nothing to do with SIMD. Embree's internal implementation uses SIMD instructions for efficient BVH traversal, for example, but this is orthogonal to the tasking system used by the application.

4.) Again very generic. Embree tries to be flexible enough to handle most use-cases. For example one could also SIMDfy by using a spectral rendering application that uses 4-wide vectors for the spectral resolution (per path/ray). So SIMDfying over rays is not always the best/only option.

Cheers,

Embree Team

Siyabonga · ‎05-28-2024

Addressing Performance and Scheduler Queries for Embree

Using rtcIntersect4/8/16 and rtcOccluded4/8/16 for Performance Improvements:
- The rtcIntersect4/8/16 and rtcOccluded4/8/16 functions can offer performance improvements if the rays are coherent across the packet. These functions utilize SIMD (Single Instruction, Multiple Data) instructions to process multiple rays simultaneously, which can be more efficient than processing rays individually.
- Example of Utilizing These Functions:
  cpp
  Copy code
  // Assume rays and hits are initialized appropriately RTCRayHit4 rays4; RTCHit4 hits4; for (int i = 0; i < 4; ++i) { // Initialize individual rays within the packet rays4.ray[i] = ...; rays4.hit[i] = ...; } // Perform intersection using rtcIntersect4 rtcIntersect4(scene, context, &rays4); // Check results for (int i = 0; i < 4; ++i) { if (rays4.hit[i].geomID != RTC_INVALID_GEOMETRY_ID) { // Process hit } }
- This approach can improve performance when rays are coherent, as the packetized intersection can take advantage of data locality and SIMD instructions.
Processing Multiple Pixels Together for Performance Scaling:
- To process multiple pixels together, you can batch rays into packets and use the aforementioned packetized ray tracing functions. Grouping rays that are spatially close can enhance coherence, leading to better SIMD utilization.
- Pixel Batching Example:
  cpp
  Copy code
  // Example for a 4x4 pixel block for (int y = 0; y < height; y += 4) { for (int x = 0; x < width; x += 4) { RTCRayHit4 rayHits[4]; // Initialize rays for the 4x4 block for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { int index = i * 4 + j; rayHits[index].ray = ...; // Setup each ray rayHits[index].hit = ...; // Initialize hit } } // Perform intersection for (int i = 0; i < 4; ++i) { rtcIntersect4(scene, context, &rayHits[i]); } // Process results for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { int index = i * 4 + j; if (rayHits[index].hit.geomID != RTC_INVALID_GEOMETRY_ID) { // Process hit } } } } }
Improving Task Scheduling and SIMD Utilization:
- Task Scheduling with TBB:
  - Intel Threading Building Blocks (TBB) is the default scheduler and can effectively manage parallelism. However, excessive synchronization can lead to overhead.
  - To mitigate this, consider fine-tuning TBB tasks to minimize synchronization points. For instance, grouping more work into each task or using task partitioning strategies can help.
- SIMD Usage:
  - Ensure that your ray processing kernels are vectorized effectively. Using explicit SIMD intrinsics or relying on the compiler's auto-vectorization can help.
  - You can also look into using larger vector types if your hardware supports it (e.g., AVX2 or AVX-512).
Using 256-bit Vectors:
- Advantages:
  - Switching to 256-bit vectors can potentially double the throughput of SIMD operations, provided that your hardware and compiler support it.
- Implementation:
  - Modify your data structures to use __m256 (for AVX) or __m256d (for double precision) types. Ensure that your algorithms are updated to leverage these wider registers.
  - Example:
    cpp
    Copy code
    __m256 rayDirsX = _mm256_set_ps(...); // Set 8 float values __m256 rayDirsY = _mm256_set_ps(...); __m256 rayDirsZ = _mm256_set_ps(...); // Perform SIMD operations on these vectors __m256 results = _mm256_add_ps(rayDirsX, rayDirsY);

By utilizing these techniques, you can enhance the performance of your ray tracing application with Embree, making better use of SIMD instructions and improving task scheduling efficiency.