AVX + Triangle8 slow

theigors · ‎06-26-2013

Hi All

Unfortunately I've no machine with AVX and can't debug. Users tell me that with AVX render time is in 3 times slower (bvh4 + triangle8) are used. If I set bvh4 + triangle4 then render time with AVX is approx same as SSSE3. Any hint/advice is very appreciated

Thanks

SvenW_Intel · ‎06-27-2013

Are you using Embree as is, or did you extract the ray traversal kernels to your application? If the latter is the case, you have to be carefull that the __mm256_zeroupper() intrinsic is active at the beginning and end of the Embree traversal kernel. Otherwise there will be a performance penalty, if the user code it NOT compiled with AVX enabled.

Further, the bvh4.triangle8 will anyway only give you a small performance benefit (if any) over using the bvh4.triangle4. Thus the best workaround is to simply use the bvh4.triangle4.

theigors · ‎06-28-2013

Hi, Sven

>> Are you using Embree as is, or did you extract the ray traversal kernels to your application? If the latter is the case, you have to be carefull that the __mm256_zeroupper() intrinsic is active at the beginning and end of the Embree traversal kernel. Otherwise there will be a performance penalty, if the user code it NOT compiled with AVX enabled.<<

Yes, I've extracted kernels (btw it was much easier than I expected). I've built with __AVX__ (and __SSE4_2__), so __mm256_zeroupper() should be active. I can use Triangle8 with SSSE3 etc - no time penalty but also no speedup. I'll debug on user's side (it needs time) and let you know.

>>Further, the bvh4.triangle8 will anyway only give you a small performance benefit (if any) over using the bvh4.triangle4. Thus the best workaround is to simply use the bvh4.triangle4.<<

Ops! It's really surprising, I expected like 1.5 times speedup. If possible tell me why so?

Thx for your help

theigors · ‎06-28-2013

Hi, Sven

1) Yes, I've extracted rt cores and built with __AVX__ (and __SSE4_2__), so __mm256_zeroupper() should be active. If I use Triangle8 with SSSE3 etc - no time penalty, approx same speed as for Triangle4. I'll debug on user's side (it needs time) and let you know.

2) It's a surprise/unexpected for me that Triangle8 is not significantly faster! If possible - explain why

Thx for your help