The i9-7940X runs my thread-heavy application SLOWLY with a 70-80% slower total execution time than the i7-8700 and i7-6700.
The "traditional" single-threaded CPU benchmarks like Mandelbrot fractal and prime number factorization run only 8-10% slower on the i9 than the i7 (as expected!). A thread-heavy (with little or no synchronization) CPU-bound workload such as raytracing also shows the expected (small) performance difference.
I wrote a script to analyze the differences in time spent per stack frame. The results showed things like `__pthread_mutex_lock`, and `pthread_cond_timedwait`, `futex_wait` were consuming SECONDS of additional execution time per thread -- which add up pretty quickly across an entire run.
I then wrote a test case using pthread to just spin up a bunch of threads with a mutex -- there is a 2x difference in runtime with the i9 vs the i7 using my mutex test case. My test case (lots of threads and synchronization) is fairly representative of the kind of workflow done by my application.
`perf` uncovers that the performance hotspot is that the i9 is spending 2x as many cycles as the i7 in `__pthread_mutex_lock`, particularly _near_ the `lock` x86 instruction and the loop afterwards (possibly and ostensibly _at_ but I haven't been able to measure cycle counts at a more granular per-instruction level even at a higher sampling frequency...). This result make sense with the differential stack traces I measured while running my application and all of its subprocesses. The other hardware event metrics (cache misses, branch mispredictions, page faults, etc) look uniform across both CPUs -- and pthread mutex doesn't invoke any system calls (it lives only in user space). Context switch time is actually SLOWER on the i7 (2173.8 ns/ctxsw) than the i9 (1546.9 ns/ctxsw). Trying the analogous futex test case also shows 3x difference in cycle counts on i9 vs. i7 in the `mov` that accesses the register immediately set by a preceeding `lock` instruction (suggesting a pipeline stall and also further ruling out system call overhead). The analogous spinlock test case also shows a similar 3x cycle count + execution time difference (and uses both the `lock` and `pause`
instructions in its hotspot).
Test case UBUNTU BIONIC (Linux cmoyes-dt-02 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux):
No, I already tested with nopti and it only accounts for around 5% of the slowdown. Both the i7and the i9 equally benefit from this as well so it doesn't explain away the CPU difference.