I'm currently developing a fancy GI renderer on top of the rendering primitives, but I had some trouble using the "blocked" version of functions. All of the intersection functions and most of the others are designed for rendering blocks (for example 32x32 pixels) onscreen primarily in mind. From a previous discussion on the forum, I know that block sizes for multiples of 4 are speed optimized. (Possibly SSE vectorization).
The problem here is that almost every modern renderer uses a lot of secondary rays (diffuse reflection/refractions, ambient occlusion, final gathering, adaptive sampling, soft shadows, area lightsetc), where calling the functions using data as "IppiSize block" is rather cumbersome. Almost every time those secondary rays determine thefinal performance.
Also the number of secondary rays cannot be divided into 4 in both dimensions most of the time, and this makes the caller code clumsy.
For example : Suppose that I need to trace 921 secondary rays (The number is determined dynamically using surface shader parameters and importance of the rendering, or increased adaptively for soft shadows to detect shadow edges etc, so it cannot be fixed).
To do this kind of intersection tests, I'm currentlysplitting the intersection part into two calls, one call using a 228X4 block to get most out of SSE optimization, and one call using a 9X1 block to render the remaining samples. (I measured the overhead of two function calls, and it's less than the gained optimization).
So here's my suggestion, is it possible to add new versions of functions in a future release where one dimensional arrays are sent, and the library does the multiple of 4-and remaining samples thing transparently to caller to simplify development a bit?