- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I experience a severe performance imbalance in our Xeon Phi (5110P, latest MPSS): a few (1-3) random CPU cores are 10-20% slower than all the other cores. I created a minimal example which demonstrates this (see below).
observations:
- happens for any number of threads, less often for fewer threads (with just a few cores it often happens that no core is slow, but after a few runs one of them typically has a slow core)
- if two threads run on a core, typically both or neither of them are slow
- the "slow" core is random, different in every run
- "niceness" of the process has no influence
- moving most other linux processes running on the mic to core 0 (with taskset), and exclude core 0 from the test: no influence
- due to the minimal example I can basically exclude any cache/memory access effects
- manual thread pinning vs. automatic assignment has no influence (I typically use KMP_AFFINITY=granularity=fine,scatter)
- the slowdown is really relative to the total work, i.e., it is not a constant overhead (try to vary the first parameter of the sample code)
By now I am quite perplexed.
In a parallel application with many equal threads the 20% slowdown will obviously transfer to all other threads at synchronization points, resulting in an overall 20% loss.
Any hints? Thanks,
Simon
[cpp]
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <math.h>
static __inline__ unsigned long getCC()
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
int main(int argc, char *argv[])
{
int repeat = atoi(argv[1]);
int threads = atoi(argv[2]);
#pragma omp parallel num_threads(threads)
{
int id = omp_get_thread_num();
//kmp_affinity_mask_t mask;
//kmp_create_affinity_mask(&mask);
//kmp_set_affinity_mask_proc(4*id+1, &mask);
//kmp_set_affinity(&mask);
#pragma omp barrier
double x = 1.0;
unsigned long start = getCC();
for(int r=0; r<repeat; r++)
{
x += sin(x);
}
unsigned long end = getCC();
#pragma omp barrier
printf("%02d x: %e cycles: %ld seconds: %lf\n", id, x, end-start, (double)(end-start)/1052630000);
}
return 0;
}
[/cpp]
Compile and run:
[bash]
# on the host:
icpc -openmp -mmic main.cc -o test.mic
# on the mic (parameters: iteration count, number of threads):
./test.mic 10240000 59
[/bash]
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can confirm your result - stepping is B1 (stepping ID 3) and 60 cores. Also our software stack is basically defined by:
OS Version: 2.6.32-220.el6.x86_64
Driver Version : 6720-13
MPSS Version : 2.1.6720-13
Maybe this issue has been fixed in a more recent HW revision, which is why most people don't have that problem.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »