- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have an application that performs a large number of rtcPointQueryV() over moderately-sized triangle meshes. I am seeing significant differences in performance depending on the hardware it is being executed on. I'm using Embree 3.8.0 and ISPC 1.12.0.
On my i9 MacBookPro, these jobs execute reliably in a handful of seconds or less (usually much less). Perfectly acceptable performance.
On less capable hardware (e.g. CPUs not reporting avx2 support), the same binary executing the same job can take well over a minute to execute. I certainly expected some slowdown but a >10X drop seems excessive (but maybe it isn't???). And it isn't consistent -- some jobs are reasonably performant while others seem to get lost somewhere in the rtcPointQuery() calls (for some of these jobs, it making 10,000+ calls).
In some cases, dialing the ISPC compiler optimizations down to -O0 actually significantly improved performance, but not in all cases.
I've tried changing the ISPC --target but it didn't seem to make a significant difference. My last debugging iteration targeted avx only.
I'm trying to figure out if this is an ISPC problem, an embree problem, or a _me_ problem. Any recommendations on next steps I should take? Or am I just expecting too much out of limited hardware?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree, a 10X drop is suspicious. Disabling ISPC optimizations should also not impact performance in a positive way. I'll take a look into it. Can you share some more details about the slower hardware?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The jobs are run as AWS Lambdas, where /proc/cpuinfo reports:
INFO: CPUINFO: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) Processor @ 2.50GHz stepping : 4 microcode : 0x1 cpu MHz : 2500.012 cache size : 33792 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs bogomips : 5000.02 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) Processor @ 2.50GHz stepping : 4 microcode : 0x1 cpu MHz : 2500.012 cache size : 33792 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs bogomips : 5000.02 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
Embree logging reports:
Embree Ray Tracing Kernels 3.8.0 () Compiler : GCC 7.3.1 20180712 (Red Hat 7.3.1-6) Build : Release Platform : Linux (64bit) CPU : Unknown CPU (GenuineIntel) Threads : 2 ISA : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND Targets : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI MXCSR : FTZ=1, DAZ=1 Config Threads : default ISA : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND Targets : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI (supported) SSE2 SSE4.2 AVX AVX2 (compile time enabled) Features: intersection_filter Tasking : TBB2019.9 TBB_header_interface_11009 TBB_lib_interface_11009 general: build threads = 0 build user threads = 0 start_threads = 0 affinity = 0 frequency_level = simd256 hugepages = enabled verbosity = 3 cache_size = 134.218 MB max_spatial_split_replications = 1.2 triangles: accel = default builder = default traverser = default motion blur triangles: accel = default builder = default traverser = default quads: accel = default builder = default traverser = default motion blur quads: accel = default builder = default traverser = default line segments: accel = default builder = default traverser = default motion blur line segments: accel = default builder = default traverser = default hair: accel = default builder = default traverser = default motion blur hair: accel = default builder = default traverser = default subdivision surfaces: accel = default grids: accel = default builder = default motion blur grids: accel = default builder = default object_accel: min_leaf_size = 1 max_leaf_size = 1 object_accel_mb: min_leaf_size = 1 max_leaf_size = 1
Thanks in advance...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After further experimentation, it appears this blog post explains a lot of what I've been seeing: https://engineering.opsgenie.com/how-does-proportional-cpu-allocation-work-with-aws-lambda-41cd44da3cac
Thanks for the response.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page