- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All
Unfortunately I've no machine with AVX and can't debug. Users tell me that with AVX render time is in 3 times slower (bvh4 + triangle8) are used. If I set bvh4 + triangle4 then render time with AVX is approx same as SSSE3. Any hint/advice is very appreciated
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you using Embree as is, or did you extract the ray traversal kernels to your application? If the latter is the case, you have to be carefull that the __mm256_zeroupper() intrinsic is active at the beginning and end of the Embree traversal kernel. Otherwise there will be a performance penalty, if the user code it NOT compiled with AVX enabled.
Further, the bvh4.triangle8 will anyway only give you a small performance benefit (if any) over using the bvh4.triangle4. Thus the best workaround is to simply use the bvh4.triangle4.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Sven
>> Are you using Embree as is, or did you extract the ray traversal kernels to your application? If the latter is the case, you have to be carefull that the __mm256_zeroupper() intrinsic is active at the beginning and end of the Embree traversal kernel. Otherwise there will be a performance penalty, if the user code it NOT compiled with AVX enabled.<<
Yes, I've extracted kernels (btw it was much easier than I expected). I've built with __AVX__ (and __SSE4_2__), so __mm256_zeroupper() should be active. I can use Triangle8 with SSSE3 etc - no time penalty but also no speedup. I'll debug on user's side (it needs time) and let you know.
>>Further, the bvh4.triangle8 will anyway only give you a small performance benefit (if any) over using the bvh4.triangle4. Thus the best workaround is to simply use the bvh4.triangle4.<<
Ops! It's really surprising, I expected like 1.5 times speedup. If possible tell me why so?
Thx for your help
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Sven
1) Yes, I've extracted rt cores and built with __AVX__ (and __SSE4_2__), so __mm256_zeroupper() should be active. If I use Triangle8 with SSSE3 etc - no time penalty, approx same speed as for Triangle4. I'll debug on user's side (it needs time) and let you know.
2) It's a surprise/unexpected for me that Triangle8 is not significantly faster! If possible - explain why
Thx for your help
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page