I can confirm your result - - Page 2

Simon_H_2 · ‎05-08-2013

Hi,

I experience a severe performance imbalance in our Xeon Phi (5110P, latest MPSS): a few (1-3) random CPU cores are 10-20% slower than all the other cores. I created a minimal example which demonstrates this (see below).

observations:

happens for any number of threads, less often for fewer threads (with just a few cores it often happens that no core is slow, but after a few runs one of them typically has a slow core)
if two threads run on a core, typically both or neither of them are slow
the "slow" core is random, different in every run
"niceness" of the process has no influence
moving most other linux processes running on the mic to core 0 (with taskset), and exclude core 0 from the test: no influence
due to the minimal example I can basically exclude any cache/memory access effects
manual thread pinning vs. automatic assignment has no influence (I typically use KMP_AFFINITY=granularity=fine,scatter)
the slowdown is really relative to the total work, i.e., it is not a constant overhead (try to vary the first parameter of the sample code)

By now I am quite perplexed.

In a parallel application with many equal threads the 20% slowdown will obviously transfer to all other threads at synchronization points, resulting in an overall 20% loss.

Any hints? Thanks,

Simon

[cpp]

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <math.h>

static __inline__ unsigned long getCC()
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}

int main(int argc, char *argv[])
{
    int repeat = atoi(argv[1]);
    int threads = atoi(argv[2]);

#pragma omp parallel num_threads(threads)
    {
        int id = omp_get_thread_num();
        //kmp_affinity_mask_t mask;
        //kmp_create_affinity_mask(&mask);
        //kmp_set_affinity_mask_proc(4*id+1, &mask);
        //kmp_set_affinity(&mask);

#pragma omp barrier

        double x = 1.0;
        unsigned long start = getCC();
        for(int r=0; r<repeat; r++)
        {
            x += sin(x);
        }
        unsigned long end = getCC();

#pragma omp barrier

        printf("%02d x: %e cycles: %ld seconds: %lf\n", id, x, end-start, (double)(end-start)/1052630000);
    }

    return 0;
}

[/cpp]

Compile and run:

[bash]

# on the host:

icpc -openmp -mmic main.cc -o test.mic

# on the mic (parameters: iteration count, number of threads):

./test.mic 10240000 59

[/bash]

Florian_R_ · ‎05-29-2013

I can confirm your result - stepping is B1 (stepping ID 3) and 60 cores. Also our software stack is basically defined by:

OS Version: 2.6.32-220.el6.x86_64
Driver Version : 6720-13
MPSS Version : 2.1.6720-13

Maybe this issue has been fixed in a more recent HW revision, which is why most people don't have that problem.

Randomly slower cores