Randomly slower cores

Simon_H_2 · ‎05-08-2013

Hi,

I experience a severe performance imbalance in our Xeon Phi (5110P, latest MPSS): a few (1-3) random CPU cores are 10-20% slower than all the other cores. I created a minimal example which demonstrates this (see below).

observations:

happens for any number of threads, less often for fewer threads (with just a few cores it often happens that no core is slow, but after a few runs one of them typically has a slow core)
if two threads run on a core, typically both or neither of them are slow
the "slow" core is random, different in every run
"niceness" of the process has no influence
moving most other linux processes running on the mic to core 0 (with taskset), and exclude core 0 from the test: no influence
due to the minimal example I can basically exclude any cache/memory access effects
manual thread pinning vs. automatic assignment has no influence (I typically use KMP_AFFINITY=granularity=fine,scatter)
the slowdown is really relative to the total work, i.e., it is not a constant overhead (try to vary the first parameter of the sample code)

By now I am quite perplexed.

In a parallel application with many equal threads the 20% slowdown will obviously transfer to all other threads at synchronization points, resulting in an overall 20% loss.

Any hints? Thanks,

Simon

[cpp]

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <math.h>

static __inline__ unsigned long getCC()
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}

int main(int argc, char *argv[])
{
    int repeat = atoi(argv[1]);
    int threads = atoi(argv[2]);

#pragma omp parallel num_threads(threads)
    {
        int id = omp_get_thread_num();
        //kmp_affinity_mask_t mask;
        //kmp_create_affinity_mask(&mask);
        //kmp_set_affinity_mask_proc(4*id+1, &mask);
        //kmp_set_affinity(&mask);

#pragma omp barrier

        double x = 1.0;
        unsigned long start = getCC();
        for(int r=0; r<repeat; r++)
        {
            x += sin(x);
        }
        unsigned long end = getCC();

#pragma omp barrier

        printf("%02d x: %e cycles: %ld seconds: %lf\n", id, x, end-start, (double)(end-start)/1052630000);
    }

    return 0;
}

[/cpp]

Compile and run:

[bash]

# on the host:

icpc -openmp -mmic main.cc -o test.mic

# on the mic (parameters: iteration count, number of threads):

./test.mic 10240000 59

[/bash]

TimP · ‎05-08-2013

In the past, in some such cases, it has been possible to identify events which busy out a thread occasionally by analyzing with VTune.

As you are running a current MPSS, you should be able to control which cores your applications use by KMP_PLACE_THREADS. KMP_AFFINITY still has an influence within the allocation you set by PLACE_THREADS.

Simon_H_2 · ‎05-10-2013

A quick test with VTune reveils "ramoops" as a possible reason. Though it takes some time on all cores, this time is significantly higher on the slow core (causing many L1 misses).

For now it is very unclear to me what exactly this means, respectively how I can circumvent it.

TimP · ‎05-10-2013

In current MPSS, the thread running idle process and some other OS functions (including VTune data collection) is labeled ramoops. This should be core 0, which you should avoid for most applications (not because of idle process). We had a discussion and the developers agreed to attempt to label idle process better for future revisions. This is just my rough summary; I'm not an expert on these things.

Accounting for idle process leads to frequent confusion when using VTune. It may appear to count as OpenMP waits, for example. On the other hand, what you are looking for might turn out to be OpenMP events such as races or false sharing.

jimdempseyatthecove · ‎05-10-2013

To your printf add a printout of "start". This will show if the slowdown is due to skew or slower running.

Jim Dempsey

Simon_H_2 · ‎05-10-2013

TimP (Intel) wrote:

In current MPSS, the thread running idle process and some other OS functions (including VTune data collection) is labeled ramoops. This should be core 0, which you should avoid for most applications (not because of idle process). [...]

I *do* avoid core 0
the slow core is random, it may be core 0, but just as well core 41
there are slow cores, even if less than 60 cores are used, i.e., there should be other cores available for the OS functions you mention

TimP (Intel) wrote:

Accounting for idle process leads to frequent confusion when using VTune. It may appear to count as OpenMP waits, for example. On the other hand, what you are looking for might turn out to be OpenMP events such as races or false sharing.

Please refer to my code example: all threads are independent, there is no shared data or synchronization inside the code I am timing.

jimdempseyatthecove wrote:

To your printf add a printout of "start". This will show if the slowdown is due to skew or slower running.

Jim Dempsey

Are you implying that the clock frequency might be different on some cores?

DubitoCogito · ‎05-10-2013

I ran the code and only noticed the issue if not running with the same number of threads per core. That will cause a load imbalance for a compute bound application. I suspect the "slow" cores are not actually random. If you were to look at the thread assignments by setting KMP_AFFINITY to something like "compact,verbose" you could easily check. I also have a few comments.

(a) You could also do something like the following to place threads on specific cores.

KMP_AFFINITY=granularity=fine,proclist=[0-119:1,123,125],explicit

It offers more flexibility because KMP_PLACE_THREADS assigns core hardware threads in ascending order to application threads.

(b) You need not use inline assembly code to read the TSC. You could simply call __rdtsc().

uint64_t tsc_start;

tsc_start = __rdtsc();

(c) The MIC uses a time-multiplexed round-robin thread scheduler. Consequently, each thread will have a different value for the start and stop variables and you will get different thread run times although the difference should be negligible.

(d) The core frequency will not change unless a thermal event has occurred. Under such circumstances the frequencies of all cores will be throttled.

jimdempseyatthecove · ‎05-10-2013

jimdempseyatthecovewrote:
To your printf add a printout of "start". This will show if the slowdown is due to skew or slower running.

Jim Dempsey

>>Are you implying that the clock frequency might be different on some cores?

No, I am saying you are assuming that immediatly after barrier, that all threads start at ~same time.
Where as I am saying that by printing out the value of the start time (RDTSC), that you will know if all threads start at ~same time (or with 20% skew).

Other post seems to state that each core seems to have a different RDTSC, I would like confirmation on this.

Jim Dempsey

DubitoCogito · ‎05-10-2013

From the Intel MIC documentation:

Each core has a 64-bit counter that monotonically increments the time-stamp counter every clock cycle and reset to 0 whenever the processor is reset. Having multiple counters in the coprocessor increases the complexity to synchronize all of them when time measurements are required on different cores.

jimdempseyatthecove · ‎05-10-2013

After looking at your code snip again, skew effect can be ignored in your printout.

You may though have an issue with the HT siblings of each core not getting the same accessibility to the FPU (AVX sincos).

Jim Dempsey

Simon_H_2 · ‎05-13-2013

DubitoCogito wrote:

I ran the code and only noticed the issue if not running with the same number of threads per core. That will cause a load imbalance for a compute bound application.

Obvisously. But that is NOT the issue: on my MIC even with exactly one thread per core (or zero on some), I see the effect.

DubitoCogito wrote:

I suspect the "slow" cores are not actually random. If you were to look at the thread assignments by setting KMP_AFFINITY to something like "compact,verbose" you could easily check. I also have a few comments.

The "random" distribution is not uniform, that is right. (For another code) I saw a slight dip around core 30, and maybe one around 0. Additionally, consequent runs can be correlated, in the sense that occasionally I observe in runs (i, i+1, i+2, ..) that cores (x, x-1, x-2, ...) are slow. I observed such correlations up to lengths of about 10.

DubitoCogito wrote:

(a) You could also do something like the following to place threads on specific cores.

KMP_AFFINITY=granularity=fine,proclist=[0-119:1,123,125],explicit

It offers more flexibility because KMP_PLACE_THREADS assigns core hardware threads in ascending order to application threads.

As I wrote in my original post, manual pinning to ensure that exactly one thread per core is running (and such that 0 is excluded), does not change anything. So unless you have a specific suggstion which cores I should use/exclude, I would not know what to do with KMP_AFFINITY.

Is there anybody who can confirm my observation by running my code? In some cases more than one run might be necessary, because the phenomenon does not occur with 100% probability.

Thanks

Simon

McCalpinJohn · ‎05-13-2013

I have run on several of the Xeon Phi SE10P coprocessors on the Stampede system at TACC and see no evidence of variability using this code. I then added some additional instrumentation to see the skew in the starting and stopping times, and although the time required to come out of the barrier was variable, the values were not large enough to significantly perturb the overall performance.

Using "10240000" as the repeat count argument, I measured on various core counts and ran 100 trials for each core count. In each trial I recorded the *maximum* relative added time for thread completion (maxcycles-mincycles)/mincycles, then computed statistics of that maximum relative added time across the 100 trials.

For KMP_AFFINITY="scatter,granularity=fine" and 60 or 61 threads, the maximum slowdowns averaged in the 0.3% range, with worst-case values in the 0.6% range.

Increasing the number of threads did increase the maximum slowdowns slightly, but actually increased the consistency of the values. With 2 threads per core (122 threads), the maximum slowdown across the threads averaged 1.8% over the 100 runs. With 3 threads per core (183 threads), the maximum slowdowns averaged 1.5% across the 100 runs. Even with 4 threads per core (244 threads), the maximum slowdowns averaged less than 1.4% across the 100 runs. Worst-case slowdowns across all 300 of the multiple-thread-per-core cases was 2.65%.

I saw the same behavior on revision B0 and B1 silicon, though I only tested a few nodes -- not all 6880 that we have installed....

I also saw no variation in performance when I ran with KMP_AFFINITY="compact,granularity=fine".

The only time I saw variations in performance of the ~20% scale was when I ran with uneven numbers of threads across the physical cores.

Simon_H_2 · ‎05-13-2013

John, thank you very much for your tests!

I see the following possible differences/conclusions:

we have a 5110P, while you used a SE10P (60 cores vs. 61), maybe this happens only with the 5110P?
there might be a software issue with our machine
our silicon is buggy? I cannot check from here which revision it is, but I think it was produced in autumn (edit: it is B1)

Any further ideas? Is there someone with access to an 5110P who could run a test?

Simon

McCalpinJohn · ‎05-14-2013

It might be useful to check your MIC power and frequency settings using the "micsmc" application on the host.

On Stampede we have most of the options disabled.

In /etc/sysconfig/mic/default.conf we include the line:
PowerManagement "cpufreq_off;corec6_off;pc3_off;pc6_off"

This results in the following output from "micsmc"

[root@c557-201.stampede]# ./micsmc --pwrstatus
mic0 (Power Status):
   cpufreq power management feature: .. disabled
   corec6 power management feature: ... disabled
   pc3 power management feature: ...... disabled
   pc6 power management feature: ...… disabled

Of course we cannot (and don't want to) disable the thermal protection mechanisms that will reduce performance when triggered.
I don't know the best way to look for them, but "micsmc --temp" should tell you how well your chip is being cooled. The numbers I saw on a moderately busy node looked like they were well into the "safe" range:

[root@c557-201.stampede]# ./micsmc --tempmic0 (temp): Cpu Temp: ................ 44.00 C Memory Temp: ............. 31.00 C Fan-In Temp: ............. 23.00 C Fan-Out Temp: ............ 32.00 C Core Rail Temp: .......... 35.00 C Uncore Rail Temp: ........ 35.00 C Memory Rail Temp: ........ 35.00 C

Simon_H_2 · ‎05-15-2013

Regarding powermanagement: I now verified that it makes no difference, I get the same effect with and without these features. The temperatures in our system are also very low.

Example of my result:

[plain]

$ ./slow 1024000 50 | sort -n -k5
45 x: 3.333333e-02 cycles: 16411879 seconds: 0.015591
11 x: 3.333333e-02 cycles: 16413658 seconds: 0.015593
39 x: 3.333333e-02 cycles: 16413722 seconds: 0.015593
22 x: 3.333333e-02 cycles: 16413964 seconds: 0.015593
35 x: 3.333333e-02 cycles: 16414469 seconds: 0.015594
43 x: 3.333333e-02 cycles: 16414593 seconds: 0.015594
44 x: 3.333333e-02 cycles: 16414708 seconds: 0.015594
36 x: 3.333333e-02 cycles: 16414810 seconds: 0.015594
05 x: 3.333333e-02 cycles: 16415006 seconds: 0.015594
24 x: 3.333333e-02 cycles: 16415184 seconds: 0.015594
10 x: 3.333333e-02 cycles: 16415754 seconds: 0.015595
28 x: 3.333333e-02 cycles: 16415948 seconds: 0.015595
08 x: 3.333333e-02 cycles: 16416397 seconds: 0.015596
03 x: 3.333333e-02 cycles: 16416704 seconds: 0.015596
15 x: 3.333333e-02 cycles: 16417163 seconds: 0.015596
09 x: 3.333333e-02 cycles: 16417632 seconds: 0.015597
29 x: 3.333333e-02 cycles: 16417948 seconds: 0.015597
32 x: 3.333333e-02 cycles: 16418003 seconds: 0.015597
49 x: 3.333333e-02 cycles: 16418031 seconds: 0.015597
18 x: 3.333333e-02 cycles: 16418133 seconds: 0.015597
13 x: 3.333333e-02 cycles: 16418359 seconds: 0.015597
46 x: 3.333333e-02 cycles: 16418533 seconds: 0.015598
41 x: 3.333333e-02 cycles: 16418614 seconds: 0.015598
47 x: 3.333333e-02 cycles: 16418842 seconds: 0.015598
12 x: 3.333333e-02 cycles: 16418940 seconds: 0.015598
19 x: 3.333333e-02 cycles: 16419204 seconds: 0.015598
14 x: 3.333333e-02 cycles: 16419362 seconds: 0.015598
16 x: 3.333333e-02 cycles: 16419518 seconds: 0.015599
04 x: 3.333333e-02 cycles: 16419931 seconds: 0.015599
07 x: 3.333333e-02 cycles: 16419964 seconds: 0.015599
21 x: 3.333333e-02 cycles: 16420452 seconds: 0.015599
48 x: 3.333333e-02 cycles: 16420526 seconds: 0.015600
31 x: 3.333333e-02 cycles: 16421080 seconds: 0.015600
42 x: 3.333333e-02 cycles: 16421804 seconds: 0.015601
01 x: 3.333333e-02 cycles: 16422007 seconds: 0.015601
37 x: 3.333333e-02 cycles: 16422098 seconds: 0.015601
26 x: 3.333333e-02 cycles: 16422498 seconds: 0.015601
17 x: 3.333333e-02 cycles: 16422529 seconds: 0.015601
27 x: 3.333333e-02 cycles: 16422827 seconds: 0.015602
23 x: 3.333333e-02 cycles: 16423332 seconds: 0.015602
20 x: 3.333333e-02 cycles: 16425107 seconds: 0.015604
38 x: 3.333333e-02 cycles: 16425441 seconds: 0.015604
30 x: 3.333333e-02 cycles: 16425965 seconds: 0.015605
40 x: 3.333333e-02 cycles: 16426336 seconds: 0.015605
06 x: 3.333333e-02 cycles: 16426545 seconds: 0.015605
02 x: 3.333333e-02 cycles: 16428388 seconds: 0.015607
25 x: 3.333333e-02 cycles: 16456341 seconds: 0.015634
00 x: 3.333333e-02 cycles: 16480017 seconds: 0.015656
34 x: 3.333333e-02 cycles: 16627974 seconds: 0.015797
33 x: 3.333333e-02 cycles: 18057162 seconds: 0.017154

[/plain]

In this case thread 33 is slowest, this varies semi-randomly, if I rerun the binary. Each thread is pinned to core "4*id+1", but automatic assignment gives the same result.

McCalpinJohn · ‎05-15-2013

This looks like a case where performance counter interval measurements would be very helpful...
Dumb question for the Intel folks: Does VTune allow you to specify instrumentation points in the code for interval measurements, or is it all sampling based?

Normally I would use inline calls to RDPMC (at the same spots where you currently call RDTSC), but the kernel on the Xeon Phi does not set CR4.PCE, so user mode code cannot call RDPMC directly. I have a loadable kernel module to correct this problem, but I have not yet figured out how to compile it for the kernel that runs on the MIC.

The good news is that you can run your code as long as you want, so you can make the run time long enough that the overhead of reading the counters through the /dev/msr* interfaces is negligible. An inner loop time of 1 second is probably enough. I pulled the MSR reading and writing code out of "rdmsr.c" and "wrmsr.c" from msrtools-1.2 and modified it to conform to the unusual naming of the device drivers in the MIC kernel. (Most versions of Linux use /dev/cpu//msr as the device driver for logical cpu "m", while the version on the MIC uses /dev/msr as the device driver for logical cpu "m".)

If you run as root, you can have each thread open the /dev/msr file corresponding to its logical processor number, program the relevant performance counter control registers, then read the performance counts at the same places that you currently have RDTSC calls. There are only two counters per logical processor, but it should only take a few tests to determine whether there are any correlations between the slow cores and performance counter anomalies.

I would probably just put a loop in the code and have it run all of the available counters -- I think there are less than 100 events. Then plug the results into Matlab, R, Excel, or whatever you use for data analysis and look for correlations. At the very least it should be immediately clear whether the CPU frequency is being modulated, and you might find evidence of more subtle and interesting phenomena.

Simon_H_2 · ‎05-21-2013

I managed to get access to a different Xeon Phi system (ES2-P1750, stepping B0, different system administration). Result: I see the same problem as on our own system.

Can anyone comment?

McCalpinJohn · ‎05-24-2013

I was working on this issue on our systems yesterday and noticed that the run times above are extremely short.

The posted results from 2013-05-15 only run for 0.0156 seconds, with an outlier of 0.0172 seconds. It does not take much O/S interference to create a 1.5 millisecond delay on one core.

The first example (from 2013-05-08) uses an iteration count that is 10x larger, so it should still have a relatively short run time of 0.156 seconds. A 20% slowdown would be about 0.03 seconds, or about 30 million cycles.

In my testing on some Xeon Phi SE10P nodes on stampede.tacc.utexas.edu yesterday I found that thread completion delays in the range of 10M to 30M cycles are quite common. These are a problem with 0.15 second run times, but are mess less troublesome when I increase the iteration count by another factor of 10, so that the measured section takes between 1-2 seconds (depending on the number of threads/core).

Using the first 60 cores with 1,2,3,4 threads per core, I saw a continuous distribution of thread completion times that varied from run to run. The fastest threads completed in the same amount of time in each case, but each case had different distributions of slowdowns across the processors.

For each value of threads per core, I ran 300 test cases (1..60 cores, 5 trials per core count). Across these tests:

For 1 thread per core the typical runtime was 1.00 seconds and the largest slowdown was under 0.7%.
For 2 threads per core the typical runtime was 1.27 seconds and the largest slowdown was just under 2.5%
For 3 threads per core the typical runtime was 1.59 seconds and the largest slowdown was just under 1.9%
For 4 threads per core the typical runtime was 1.91 seconds and largest slowdown was just over 2.1%

The median values of max slowdown were under 1/2 of these maximum values, but plotting performance vs physical core number showed similar patterns of variability -- cores 16-44 tended to run slower, but by different amounts on different runs.

These slowdowns are larger than I like -- certainly orders of magnitude larger than the variability implied by the OpenMP synchronization overheads (which are in the 10k-80k cycle range -- much less than the 10M-30M cycle variations in completion time). If I had time to pursue this I would probably start by looking for OS-related overhead.

Lawrence_M_Intel · ‎05-24-2013

Regarding the CR4.PCE issue, you should be able to just run something like "perf stat -e r0016,r001a -a" which will program those events on all cpus and also enable CR4.PCE.

You can also use vtune:

amplxe-runss -count -target=mic-0 -d 0 -event-config CPU_CLK_UNHALTED,INSTRUCTIONS_EXECUTED

And then:

amplxe-runss -C stop -target=mic-0 -r r001

For example. Hope this helps.

Simon_H_2 · ‎05-26-2013

John D. McCalpin wrote:

I was working on this issue on our systems yesterday and noticed that the run times above are extremely short.

The posted results from 2013-05-15 only run for 0.0156 seconds, with an outlier of 0.0172 seconds. It does not take much O/S interference to create a 1.5 millisecond delay on one core.

You are right, this is short, however on my system the slowdown happens also for longer runs. If I run longer (102400000 repetitions), I observe typical slowdowns between 4% and 8%. This is for a single thread per core. I left the OS core empty, and also tried using only (e.g.) 50 cores, with the same result.

I have another code where the typical slowdown is more than 10%, but it is much more complex, so I cannot post it here. One reason for the larger slowdown could be that this code accesses the cache frequently. This might indicate that the reason for the slowdown is another kind of thread running on the same core.

McCalpinJohn · ‎05-28-2013

An issue to be aware of on Xeon Phi is that the L2 cache is inclusive of both the L1 Instruction and L1 Data caches. This means that whenever a line is chosen as victim from the L2, that line must also be evicted from the L1 Instruction cache and L1 Data cache.

It is not difficult to imagine a case in which each thread is working primarily on different address ranges and so causing evictions of different L2 congruence classes. If any of these threads are consistently evicting lines that map to the same congruence class(es) as your "hot" code, those threads will certainly run slower than the rest.

Although it is not trivial to work around this issue, it should be easy to identify -- just look for cases where you have anomalously increased values of *both* "CODE_CACHE_MISS" (Event 0x0E, Umask 0x00) and either "L2_CODE_READ_MISS_CACHE_FILL" (Event 0xF0, Umask 0x10) or "L2_CODE_READ_MISS_MEM_FILL" (Event 0xF5, Umask 0x10). For threaded codes, I would expect most L2 code read misses to be satisfied by other L2 caches, so I would start by looking at CODE_CACHE_MISS and L2_CODE_READ_MISS_CACHE_FILL (since there are only 2 counters per logical processor).

Each L2_CODE_READ_MISS_*_FILL event should take an average of ~250-275 cycles on a lightly loaded system, so you can check to see if the observed slowdown is consistent with the stalls that you would expected from any excess of these events.