Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1711 Discussions

i9-7940X 14-core CPU running thread synchronization-heavy applications almost ~TWICE as SLOW as other CPUs

Charles_Moyes
Beginner
483 Views

The i9-7940X runs my thread-heavy application SLOWLY with a 70-80% slower total execution time than the i7-8700 and i7-6700.

The "traditional" single-threaded CPU benchmarks like Mandelbrot fractal and prime number factorization run only 8-10% slower on the i9 than the i7 (as expected!). A thread-heavy (with little or no synchronization) CPU-bound workload such as raytracing also shows the expected (small) performance difference.

I wrote a script to analyze the differences in time spent per stack frame. The results showed things like `__pthread_mutex_lock`, and `pthread_cond_timedwait`, `futex_wait` were consuming SECONDS of additional execution time per thread -- which add up pretty quickly across an entire run.

I then wrote a test case using pthread to just spin up a bunch of threads with a mutex -- there is a 2x difference in runtime with the i9 vs the i7 using my mutex test case. My test case (lots of threads and synchronization) is fairly representative of the kind of workflow done by my application.

`perf` uncovers that the performance hotspot is that the i9 is spending 2x as many cycles as the i7 in `__pthread_mutex_lock`, particularly _near_ the `lock` x86 instruction and the loop afterwards (possibly and ostensibly _at_ but I haven't been able to measure cycle counts at a more granular per-instruction level even at a higher sampling frequency...). This result make sense with the differential stack traces I measured while running my application and all of its subprocesses. The other hardware event metrics (cache misses, branch mispredictions, page faults, etc) look uniform across both CPUs -- and pthread mutex doesn't invoke any system calls (it lives only in user space). Context switch time is actually SLOWER on the i7 (2173.8 ns/ctxsw) than the i9 (1546.9 ns/ctxsw). Trying the analogous futex test case also shows 3x difference in cycle counts on i9 vs. i7 in the `mov` that accesses the register immediately set by a preceeding `lock` instruction (suggesting a pipeline stall and also further ruling out system call overhead). The analogous spinlock test case also shows a similar 3x cycle count + execution time difference (and uses both the `lock` and `pause`
instructions in its hotspot).

Test case UBUNTU BIONIC (Linux cmoyes-dt-02 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux):

 

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
 
const auto NUM_RUNS = 1000;
const auto NUM_ITER = 100000;
 
#ifdef USE_SPINLOCK
pthread_spinlock_t spinlock;
#else
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
#endif
 
int counter = 0;
 
void *functionA() {
   for (int i=0; i<NUM_ITER; i++) {
#ifdef USE_SPINLOCK
      pthread_spin_lock(&spinlock);
#else
      pthread_mutex_lock(&mutex);
#endif
      counter++;
      //printf("A: Counter value: %d\n", counter);
#ifdef USE_SPINLOCK
      pthread_spin_unlock(&spinlock);
#else
      pthread_mutex_unlock(&mutex);
#endif
   }
   return NULL;
}
 
void *functionB() {
   for (int i=0; i<NUM_ITER; i++) {
#ifdef USE_SPINLOCK
      pthread_spin_lock(&spinlock);
#else
      pthread_mutex_lock(&mutex);
#endif
      counter++;
      //printf("B: Counter value: %d\n", counter);
#ifdef USE_SPINLOCK
      pthread_spin_unlock(&spinlock);
#else
      pthread_mutex_unlock(&mutex);
#endif
   }
   return NULL;
}
 
int main() {
#ifdef USE_SPINLOCK
printf("Use spinlock\n");
    pthread_spin_init(&spinlock, 0);
#else
    printf("Use mutex\n");
    pthread_mutex_init(&mutex, NULL);
#endif
 
for (int i=0; i<NUM_RUNS; i++) {
counter = 0;
int rc1, rc2;
pthread_t thread1, thread2;
 
if ((rc1 = pthread_create(&thread1, NULL, (void *(*)(void *))&functionA, NULL))) {
printf("Thread creation failed: %d\n", rc1);
}
if ((rc2 = pthread_create(&thread2, NULL, (void *(*)(void *))&functionB, NULL))) {
printf("Thread creation failed: %d\n", rc2);
}
 
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
}
printf("Done!\n");
 
#ifdef USE_SPINLOCK
    pthread_spin_destroy(&spinlock);
#else
    pthread_mutex_destroy(&mutex);
#endif
 
return 0;
}
0 Kudos
3 Replies
Richard_Nutman
New Contributor I
483 Views

At a guess, is this possibly caused by Spectre/Meltdown measures in the silicon ?

0 Kudos
Charles_Moyes
Beginner
483 Views

No, I already tested with nopti and it only accounts for around 5% of the slowdown. Both the i7and the i9 equally benefit from this as well so it doesn't explain away the CPU difference.

0 Kudos
Chaze__Olivier
Beginner
483 Views

Hi Charles,

Did you find the reason by any chance and moreover a fix ? 

 

 

0 Kudos
Reply