topic It was indeed the problem. I in Software Archive

Scalability issue with fully parallel code

Jean-Philippe_H_ — Fri, 18 Oct 2013 09:42:22 GMT

Hi there!

I've been through some experiment on Xeon Phi recently and I'm hitting a serious scalability issue. The idea is to run a fully parallel-code with no memory access and with a growing number of processors, all running the exact same amount of operations. The workload is fixed, so therefore we should expect a constant execution time. Here is the code we are running:

int main(int argc, char *argv[])
{
    int i = 0; struct timespec start, end; uint64_t total; FILE *fd = stderr;

   // Make sure the OpenMP threads are started before we start our time calculation
   #pragma omp parallel
   {
      fprintf (stdout, "parallel %d\n", omp_get_thread_num ());
   }

   // Start the experiment
   clock_gettime (CLOCK_REALTIME, &start);
   #pragma omp parallel
   {
    for ( i = 0 ; i < 1024*1024*128 ; i++ )
    {
        asm volatile(
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      :
      :
      :"%eax"
   );
    }
   }
   // Stop the experiment
   clock_gettime (CLOCK_REALTIME, &end);

   total = (end.tv_sec * 1e9 + end.tv_nsec)
         - (start.tv_sec * 1e9 + start.tv_nsec);

   fprintf (fd, "time = %llu\n", (unsigned long long) total);
    return 0;
}

Here is some info about our environment

MPSS Version:mpss_gold_update_3-2.1.6720-19 (released: September 10 2013)
KMP_AFFINITY= scatter: This avoids possible hardware stalls due to HyperThreading contention. We want our code to be fully parallel.
Number of OpenMP threads set via the OMP_NUM_THREADS environment variable

Here are the results we got
For 4 cores, in avg = 21.09s (with 25 runs)
For 8 cores, in avg = 43.29s (with 25 runs)
For 12 cores, in avg = 65.37s (with 25 runs)
For 16 cores, in avg = 87.24s (with 25 runs)
For 20 cores, in avg = 109.95s (with 25 runs)
For 24 cores, in avg = 132.18s (with 25 runss)
For 28 cores, in avg = 152.79s (with 25 runs)
For 32 cores, in avg = 175.32s (with 24 runs)
For 36 cores, in avg = 196.47s (with 24 runs)
For 40 cores, in avg = 218.72s (with 24 runs)
For 44 cores, in avg = 241.10s (with 24 runs)
For 48 cores, in avg = 263.49s (with 24 runs)
For 52 cores, in avg = 285.33s (with 24 runs)
For 56 cores, in avg = 307.35s (with 24 runs)

We clearly see the lack of scalability here. So my question is: Are these numbers normal to you?

I think you should declare

James_C_Intel2 — Fri, 18 Oct 2013 10:05:20 GMT

I think you should declare the variable "i" inside the parallel. The way your code is at the moment it will be shared, which is not what you want...

It was indeed the problem. I

Jean-Philippe_H_ — Fri, 18 Oct 2013 10:42:44 GMT

It was indeed the problem. I thank you for your time and apologize for the inconvenience.

No problem at all. I'm glad

James_C_Intel2 — Fri, 18 Oct 2013 10:56:57 GMT

No problem at all. I'm glad the fix was that simple!

I believe that if the pragma

McCalpinJohn — Fri, 18 Oct 2013 18:48:12 GMT

I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.

This mistake is common and it is unfortunate that the OpenMP syntax allows it.

If discussing "unfortunate"

TimP — Sat, 19 Oct 2013 00:20:25 GMT

If discussing "unfortunate" features, Cilk(tm) Plus allows a default shared cilk_for loop index for a .c file but not for a .cpp source file. This is probably considered too obvious to document, but still a point on which mistakes are easily made.

"I believe that if the pragma

James_C_Intel2 — Mon, 21 Oct 2013 09:26:38 GMT

"I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private."

Indeed true, however the semantics would also have been completely different! (Sharing a fixed amount of work between the threads as against doing all the work in each thread).

My personal preference (even in non-OpenMP code) is to declare and initialise variablesin C/C++ where they are first required unless they need a wider scope. That normally avoids the need to specify that they are "private" if/when you add OpenMP directives.

Quote:John D. McCalpin wrote:

Jean-Philippe_H_ — Mon, 21 Oct 2013 11:02:36 GMT

John D. McCalpin wrote:

I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.

This mistake is common and it is unfortunate that the OpenMP syntax allows it.

This is exactly what happened in this case. I am used to use the "parallel for" directive a lot, and I automatically assumed the induction variable to be private in this case. I won't forget this trick twice ! Again, I am deeply sorry for this unfortunate, simple mistake.

Another reason for using for

jimdempseyatthecove — Mon, 21 Oct 2013 12:31:52 GMT

Another reason for using for(int i=... is that then the compiler optimization code will then know that "i" exits scope after the for statement. Meaning better opertunities for optimization (registerization) of i.

Jim Dempsey