topic AVX + Triangle8 slow in Intel® Embree Ray Tracing Kernels

AVX + Triangle8 slow

theigors — Thu, 27 Jun 2013 05:03:06 GMT

Hi All

Unfortunately I've no machine with AVX and can't debug. Users tell me that with AVX render time is in 3 times slower (bvh4 + triangle8) are used. If I set bvh4 + triangle4 then render time with AVX is approx same as SSSE3. Any hint/advice is very appreciated

Thanks

Are you using Embree as is,

SvenW_Intel — Thu, 27 Jun 2013 12:12:33 GMT

Are you using Embree as is, or did you extract the ray traversal kernels to your application? If the latter is the case, you have to be carefull that the __mm256_zeroupper() intrinsic is active at the beginning and end of the Embree traversal kernel. Otherwise there will be a performance penalty, if the user code it NOT compiled with AVX enabled.

Further, the bvh4.triangle8 will anyway only give you a small performance benefit (if any) over using the bvh4.triangle4. Thus the best workaround is to simply use the bvh4.triangle4.

Hi, Sven

theigors — Fri, 28 Jun 2013 14:01:34 GMT

Hi, Sven

>> Are you using Embree as is, or did you extract the ray traversal kernels to your application? If the latter is the case, you have to be carefull that the __mm256_zeroupper() intrinsic is active at the beginning and end of the Embree traversal kernel. Otherwise there will be a performance penalty, if the user code it NOT compiled with AVX enabled.<<

Yes, I've extracted kernels (btw it was much easier than I expected). I've built with __AVX__ (and __SSE4_2__), so __mm256_zeroupper() should be active. I can use Triangle8 with SSSE3 etc - no time penalty but also no speedup. I'll debug on user's side (it needs time) and let you know.

>>Further, the bvh4.triangle8 will anyway only give you a small performance benefit (if any) over using the bvh4.triangle4. Thus the best workaround is to simply use the bvh4.triangle4.<<

Ops! It's really surprising, I expected like 1.5 times speedup. If possible tell me why so?

Thx for your help

Hi, Sven

theigors — Fri, 28 Jun 2013 14:09:12 GMT

Hi, Sven

1) Yes, I've extracted rt cores and built with __AVX__ (and __SSE4_2__), so __mm256_zeroupper() should be active. If I use Triangle8 with SSSE3 etc - no time penalty, approx same speed as for Triangle4. I'll debug on user's side (it needs time) and let you know.

2) It's a surprise/unexpected for me that Triangle8 is not significantly faster! If possible - explain why

Thx for your help