Rasterizer optimization help...

kevin-bray · ‎03-27-2009

I'm currently in the unusual position of writing a software rasterizer. At this point, I've vectorized the C code to make use of SSE2. So far, I've managed a 9x speed increase over the compiler through the use of SSE vectorization and prefetching. However, I feel that I could probably speed this up further through better instruction scheduling. That said, I'd like to rewrite some of my assembly to better utilize the execution units within the CPU. My target CPUs here are the Pentium 4 and up, so that means I'm limited to SSE2.

That all being said, is there a chart somewhere that lists instructions, their latency, their issue port, and their respective execution unit?

Thanks,
Kevin B

capens__nicolas · ‎03-29-2009

Quoting - Kevin Bray

That all being said, is there a chart somewhere that lists instructions, their latency, their issue port, and their respective execution unit?

You can find instruction latency tablesand such in the Optimization Reference Manual: http://www.intel.com/products/processor/manuals/

For the rasterizer itself, what algorithm are you using? In my experience it's often not worth it to try to optimize by improving instruction scheduling (the processor does out-of-order execution anyway). Optimizing at a higher level like having a specialized path for small triangles can sometimes give you much greater speedups...

kevin-bray · ‎04-03-2009

Quoting - c0d1f1ed

You can find instruction latency tablesand such in the Optimization Reference Manual: http://www.intel.com/products/processor/manuals/

For the rasterizer itself, what algorithm are you using? In my experience it's often not worth it to try to optimize by improving instruction scheduling (the processor does out-of-order execution anyway). Optimizing at a higher level like having a specialized path for small triangles can sometimes give you much greater speedups...

Thanks a lot for your help! This actually helped me out. =)

I'm using an algorithm that I think is called tiled rasterization. I basically calculate line equations for the edges of the triangle, then I do course rasterization over blocks of 16x16 pixels. I then determine whether or not the current block falls partially inside of the triangle or fully inside of the triangle. For cases where the block is entirely inside of the triangle, and where the early z-test passes, I have a very fast path available.

As it would turn out, the reason I initially posted is because I was seeing a very bizarre spike on one particular instruction. I effectively had a situation where I was taking 8 different floating point values across two different XMM registers, converting them to integers, and then merging the results of those instructions into a single SSE register using a 'packssdw' instruction. That pack instruction was, according to V-Tune, taking up a huge amount of the function's time... on the order of about 50% (and there were plenty of other instructions).

My initial thought was that, due to the nature of the code, out-of-order execution wasn't really being as effective as it could be, which is why I asked for the latency tables. However, V-Tune seemed to indicate that the instruction cache was full, and as a result, the processor would stop fetching on that particular instruction. After seeing that, I guessed that the instruction had a high latency period for decoding due to it's 66h size change prefix. So in response to this, I created another loop. The first loop simply converts floats to integers using SSE into a temporary buffer, and the second loop packs those intermediate results into the final output. That had a *major* impact on the performance of the code, so here is my guess as to what was going on:

Due to loop unrolling in the first loop, the trace cache was being overrun on each loop iteration, so instructions constantly needed to be decoded. The high latency decode of the packssdw instruction was then causing the major stall I was seeing. Once the code was refactored, the packssdw instruction could constantly remain in the trace cache, so the decoding time was only a factor on the first loop iteration.

whew.

So that's what I think was going on. It could very well be that I'm wrong and that I just got lucky, so if someone has some other ideas, please let me know. The more I can figure out about how these chips work, the better!

Thanks again for your help!

Kevin B

kevin-bray · ‎04-03-2009

Also, I just went back to the original code and reduced the amount of work done on each loop iteration (instead of handling 8 pixels per loop, I handle 4). This code seems to run at an identical speed as the code that uses a temporary buffer.

Kevin B