- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there!
I've been through some experiment on Xeon Phi recently and I'm hitting a serious scalability issue. The idea is to run a fully parallel-code with no memory access and with a growing number of processors, all running the exact same amount of operations. The workload is fixed, so therefore we should expect a constant execution time. Here is the code we are running:
int main(int argc, char *argv[])
{
int i = 0; struct timespec start, end; uint64_t total; FILE *fd = stderr;// Make sure the OpenMP threads are started before we start our time calculation
#pragma omp parallel
{
fprintf (stdout, "parallel %d\n", omp_get_thread_num ());
}// Start the experiment
clock_gettime (CLOCK_REALTIME, &start);
#pragma omp parallel
{
for ( i = 0 ; i < 1024*1024*128 ; i++ )
{
asm volatile(
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
"addl $1,%%eax;\n\t"
:
:
:"%eax"
);
}
}
// Stop the experiment
clock_gettime (CLOCK_REALTIME, &end);
total = (end.tv_sec * 1e9 + end.tv_nsec)
- (start.tv_sec * 1e9 + start.tv_nsec);
fprintf (fd, "time = %llu\n", (unsigned long long) total);
return 0;
}
Here is some info about our environment
- MPSS Version:mpss_gold_update_3-2.1.6720-19 (released: September 10 2013)
- KMP_AFFINITY= scatter: This avoids possible hardware stalls due to HyperThreading contention. We want our code to be fully parallel.
- Number of OpenMP threads set via the OMP_NUM_THREADS environment variable
Here are the results we got
For 4 cores, in avg = 21.09s (with 25 runs)
For 8 cores, in avg = 43.29s (with 25 runs)
For 12 cores, in avg = 65.37s (with 25 runs)
For 16 cores, in avg = 87.24s (with 25 runs)
For 20 cores, in avg = 109.95s (with 25 runs)
For 24 cores, in avg = 132.18s (with 25 runss)
For 28 cores, in avg = 152.79s (with 25 runs)
For 32 cores, in avg = 175.32s (with 24 runs)
For 36 cores, in avg = 196.47s (with 24 runs)
For 40 cores, in avg = 218.72s (with 24 runs)
For 44 cores, in avg = 241.10s (with 24 runs)
For 48 cores, in avg = 263.49s (with 24 runs)
For 52 cores, in avg = 285.33s (with 24 runs)
For 56 cores, in avg = 307.35s (with 24 runs)
We clearly see the lack of scalability here. So my question is: Are these numbers normal to you?
Jp
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you should declare the variable "i" inside the parallel. The way your code is at the moment it will be shared, which is not what you want...
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you should declare the variable "i" inside the parallel. The way your code is at the moment it will be shared, which is not what you want...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It was indeed the problem. I thank you for your time and apologize for the inconvenience.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No problem at all. I'm glad the fix was that simple!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.
This mistake is common and it is unfortunate that the OpenMP syntax allows it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If discussing "unfortunate" features, Cilk(tm) Plus allows a default shared cilk_for loop index for a .c file but not for a .cpp source file. This is probably considered too obvious to document, but still a point on which mistakes are easily made.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private."
Indeed true, however the semantics would also have been completely different! (Sharing a fixed amount of work between the threads as against doing all the work in each thread).
My personal preference (even in non-OpenMP code) is to declare and initialise variablesin C/C++ where they are first required unless they need a wider scope. That normally avoids the need to specify that they are "private" if/when you add OpenMP directives.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John D. McCalpin wrote:
I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.
This mistake is common and it is unfortunate that the OpenMP syntax allows it.
This is exactly what happened in this case. I am used to use the "parallel for" directive a lot, and I automatically assumed the induction variable to be private in this case. I won't forget this trick twice ! Again, I am deeply sorry for this unfortunate, simple mistake.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another reason for using for(int i=... is that then the compiler optimization code will then know that "i" exits scope after the for statement. Meaning better opertunities for optimization (registerization) of i.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page