I'm not sure I understand the context of question. Are you referring to coding technique or compiler command-line options for best optimization? For Intel Compiler options, I would recommend using the O3 switch and the proper instruction setswitch to enable architecture specific SSE instructions (/QxAVX, SSE4.2, SSSE3, SSE3, SSE2).
>8 core CPU is comparable to an older GPU for 500x500x500 volumes
More accurate would be to say from 500x500x500 and up.... CPU VR has a way better scalability so with bigger volumes its advantage goes up; the same true for rendering quality -> with higher sampling rate CPU advantage goes up (it stands true for latest GPU vs latest Intel CPU as well). The combination of both: [bigger size + higher rendering quality] makes GPU look hopelessly lost. The MIC architecture is the platform where CPU VR development will gain a dramatic boost...
>ispc on some sample datasets and see if it is fast enough.
No it is definitelynot a good example...
Porting SPMD volume-ray-casting design to CPU via ispc will likely demonstrate a quite mediocre performance. The rendering algorithms must be designed from ground up for flexible MIMD machine what multi-core CPU in fact is. Some SIMD optimizations definitely may speed-up the math (to compute faster lighting, interpolations etc...) but it is complementary and it does not effect the design scalability.
There is no good public CPU VR at this point (at my knowledge; btw, I would love to find such).
I've just read the paper "Full-Resolution Interactive CPU Volume Rendering with Coherent BVH Traversal". Quite an interesting read. The rendering quality of the described rendering algorithm is quite poor however..
The way the gradient is calculated, though fast to compute, results in severe image artifacts. I tried it in my Volumize rendering engine, with much worse image quality result. It's because basically the gradient is point sampled in the 3 component directions. Also the preintegration is not suitable for 16-bit voxel data as it does point sampling based on 8 bit data.
The Volumize engine uses AVX2 for CPU rendering and DX11 for GPU rendering. Both CPU and GPU rendering result in exacltly the same non compromise image quality doing exactly the same calculations.