Performance issues on older hardware

Mike_M_6 · ‎04-08-2020

I have an application that performs a large number of rtcPointQueryV() over moderately-sized triangle meshes. I am seeing significant differences in performance depending on the hardware it is being executed on. I'm using Embree 3.8.0 and ISPC 1.12.0.

On my i9 MacBookPro, these jobs execute reliably in a handful of seconds or less (usually much less). Perfectly acceptable performance.

On less capable hardware (e.g. CPUs not reporting avx2 support), the same binary executing the same job can take well over a minute to execute. I certainly expected some slowdown but a >10X drop seems excessive (but maybe it isn't???). And it isn't consistent -- some jobs are reasonably performant while others seem to get lost somewhere in the rtcPointQuery() calls (for some of these jobs, it making 10,000+ calls).

In some cases, dialing the ISPC compiler optimizations down to -O0 actually significantly improved performance, but not in all cases.

I've tried changing the ISPC --target but it didn't seem to make a significant difference. My last debugging iteration targeted avx only.

I'm trying to figure out if this is an ISPC problem, an embree problem, or a _me_ problem. Any recommendations on next steps I should take? Or am I just expecting too much out of limited hardware?

FlorianR_Intel · ‎04-09-2020

I agree, a 10X drop is suspicious. Disabling ISPC optimizations should also not impact performance in a positive way. I'll take a look into it. Can you share some more details about the slower hardware?

Mike_M_6 · ‎04-09-2020

The jobs are run as AWS Lambdas, where /proc/cpuinfo reports:

INFO: CPUINFO: 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) Processor @ 2.50GHz
stepping	: 4
microcode	: 0x1
cpu MHz		: 2500.012
cache size	: 33792 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms smap xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 5000.02
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) Processor @ 2.50GHz
stepping	: 4
microcode	: 0x1
cpu MHz		: 2500.012
cache size	: 33792 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms smap xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 5000.02
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

Embree logging reports:

Embree Ray Tracing Kernels 3.8.0 ()
Compiler  : GCC 7.3.1 20180712 (Red Hat 7.3.1-6)
Build     : Release 
Platform  : Linux (64bit)
CPU       : Unknown CPU (GenuineIntel)
Threads  : 2
ISA      : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND 
Targets  : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI 
MXCSR    : FTZ=1, DAZ=1
Config
Threads : default
ISA     : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND 
Targets : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI  (supported)
SSE2 SSE4.2 AVX AVX2  (compile time enabled)
Features: intersection_filter 
Tasking : TBB2019.9 TBB_header_interface_11009 TBB_lib_interface_11009 
general:
build threads      = 0
build user threads = 0
start_threads      = 0
affinity           = 0
frequency_level    = simd256
hugepages          = enabled
verbosity          = 3
cache_size         = 134.218 MB
max_spatial_split_replications = 1.2
triangles:
accel              = default
builder            = default
traverser          = default
motion blur triangles:
accel              = default
builder            = default
traverser          = default
quads:
accel              = default
builder            = default
traverser          = default
motion blur quads:
accel              = default
builder            = default
traverser          = default
line segments:
accel              = default
builder            = default
traverser          = default
motion blur line segments:
accel              = default
builder            = default
traverser          = default
hair:
accel              = default
builder            = default
traverser          = default
motion blur hair:
accel              = default
builder            = default
traverser          = default
subdivision surfaces:
accel              = default
grids:
accel              = default
builder            = default
motion blur grids:
accel              = default
builder            = default
object_accel:
min_leaf_size      = 1
max_leaf_size      = 1
object_accel_mb:
min_leaf_size      = 1
max_leaf_size      = 1

Thanks in advance...

Mike_M_6 · ‎04-13-2020

After further experimentation, it appears this blog post explains a lot of what I've been seeing: https://engineering.opsgenie.com/how-does-proportional-cpu-allocation-work-with-aws-lambda-41cd44da3cac

Thanks for the response.