Software Archive
Read-only legacy content
17061 Discussions

Randomly slower cores

Simon_H_2
Beginner
3,080 Views

Hi,

I experience a severe performance imbalance in our Xeon Phi (5110P, latest MPSS): a few (1-3) random CPU cores are 10-20% slower than all the other cores. I created a minimal example which demonstrates this (see below).

observations:

  • happens for any number of threads, less often for fewer threads (with just a few cores it often happens that no core is slow, but after a few runs one of them typically has a slow core)
  • if two threads run on a core, typically both or neither of them are slow
  • the "slow" core is random, different in every run
  • "niceness" of the process has no influence
  • moving most other linux processes running on the mic to core 0 (with taskset), and exclude core 0 from the test: no influence
  • due to the minimal example I can basically exclude any cache/memory access effects
  • manual thread pinning vs. automatic assignment has no influence (I typically use KMP_AFFINITY=granularity=fine,scatter)
  • the slowdown is really relative to the total work, i.e., it is not a constant overhead (try to vary the first parameter of the sample code)

By now I am quite perplexed.

In a parallel application with many equal threads the 20% slowdown will obviously transfer to all other threads at synchronization points, resulting in an overall 20% loss.

Any hints? Thanks,

Simon

[cpp]

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <math.h>


static __inline__ unsigned long getCC()
{
  unsigned a, d;
  asm volatile("rdtsc" : "=a" (a), "=d" (d));
  return ((unsigned long)a) | (((unsigned long)d) << 32);
}


int main(int argc, char *argv[])
{
    int repeat = atoi(argv[1]);
    int threads = atoi(argv[2]);

#pragma omp parallel num_threads(threads)
    {
        int id = omp_get_thread_num();
        //kmp_affinity_mask_t mask;
        //kmp_create_affinity_mask(&mask);
        //kmp_set_affinity_mask_proc(4*id+1, &mask);
        //kmp_set_affinity(&mask);


#pragma omp barrier

        double x = 1.0;
        unsigned long start = getCC();
        for(int r=0; r<repeat; r++)
        {
            x += sin(x);
        }
        unsigned long end = getCC();

#pragma omp barrier

        printf("%02d x: %e cycles: %ld seconds: %lf\n", id, x, end-start, (double)(end-start)/1052630000);
    }

    return 0;
}

[/cpp]

Compile and run:

[bash]

# on the host:

icpc -openmp -mmic main.cc -o test.mic

# on the mic (parameters: iteration count, number of threads):

./test.mic 10240000 59

[/bash]


0 Kudos
21 Replies
Florian_R_
Beginner
255 Views

I can confirm your result - stepping is B1 (stepping ID 3) and 60 cores. Also our software stack is basically defined by:

OS Version: 2.6.32-220.el6.x86_64
Driver Version : 6720-13
MPSS Version : 2.1.6720-13

Maybe this issue has been fixed in a more recent HW revision, which is why most people don't have that problem.

0 Kudos
Reply