Even numbered threads slow down.

Daniel_M_6 · ‎10-03-2015

Hi.

I found a critical performance issue while programming on Xeon Phi.

In order to post the issue in this forum, I've simplified the issue related code as follows.

So the following is just a code for issue reproduction, not real code.

#include <thread>
#include <chrono>
#include <math.h>

void threadProcedure()
{
 double *pArray = new double[3 * 100000 * 1000];

 auto begin = std::chrono::system_clock::now();

 for(int i = 0; i < 3; ++i)
 {
  for(int j = 0; j < 100000; ++j)
  {
   #pragma omp parallel for num_threads(2)
   for(int k = 0; k < 1000; ++k)
    pArray[i * 100000 + j * 1000 + k] = exp(i * 0.1 + j * 0.00001 + k * 0.001);
  }
 }

 wprintf(L"%d ms\n", std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - begin).count());

 delete [] pArray;
}

int main()
{
 for(int i = 1; i <= 10; ++i)
 {
  wprintf(L"Thread %d Run Time : ", i);

  std::thread *pThread = new std::thread(threadProcedure);
  pThread->join();
  delete pThread;
 }
}

icpc -mmic -std=c++11 -openmp -o slowdown slowdown.cpp

Thread 1 Run Time : 5277 ms
Thread 2 Run Time : 10307 ms
Thread 3 Run Time : 5273 ms
Thread 4 Run Time : 10480 ms
Thread 5 Run Time : 5290 ms
Thread 6 Run Time : 10358 ms
Thread 7 Run Time : 5294 ms
Thread 8 Run Time : 10503 ms
Thread 9 Run Time : 5290 ms
Thread 10 Run Time : 10357 ms

Even numbered threads always slow down.

This issue doesn't occur when I compile the above code with Microsoft Visual C++ compiler.

Therefore, my guess is that it's a bug of Intel C++ compiler OpenMP library.

jimdempseyatthecove · ‎10-03-2015

Interesting, normally one does not spawn pthreads, then have the spawned thread create an OpenMP parallel region. Were there any environment variable settings regarding affinity?

What's likely happening is the spawned thread instantiates a new OpenMP thread team in a round-robin logical processor order:

Iteration 1: 0,1,2 (two compute threads on core 0)
Iteration 2: 0, 3, 4 (One compute thread on core 0, the other on core 1)
Iteration 3: 0, 5, 6 (two on core 1)
Iteration 4: 0, 7, 8 (one on core 1, the other on core 2)
iteration 5: 0,9,10 (two on core 2)
...

The Xeon Phi makes best utilization of a core with at least two threads per core.

Either do not keep spawning pthreads (keep them around for subsequent use), or stay entirely within OpenMP.

Jim Dempsey

Frances_R_Intel · ‎10-05-2015

I wasn't aware that the Microsoft Visual C++ compiler generated code for the coprocessor. Are you sure that was what it was using and not the Intel compiler running under the Microsoft Visual Studio IDE? Or, if was using the Microsoft compiler, that the code it generated was actually running on the coprocessor?

Daniel_M_6 · ‎10-05-2015

Thanks for comments.

I've fixed OpenMP library code by myself, so this issue doesn't occur any more.

The cause of the issue isn't thread affinity.

Of course, I didn't compile the code with Microsoft Visual C++ compiler for Xeon Phi.

This issue occurs on linux Xeon host too.

So, in order to check whether it occurs on Microsoft Visual C++ OpenMP library, I compiled the code with Microsoft Visual C++ compiler for Xeon (NOT FOR XEON PHI).

James_C_Intel2 · ‎10-06-2015

I've fixed OpenMP library code by myself, so this issue doesn't occur any more.

Unless you are intending to maintain your own version of the OpenMP runtime, perhaps you could submit a patch to the LLVM runtime (http://openmp.llvm.org ) so that it can be reviewed and potentially incorporated into the Intel OpenMP runtime too?

Thanks

jimdempseyatthecove · ‎10-06-2015

The symptoms you observed are a result of if the two OpenMP worker threads were created on the same core (faster) or separate cores (slower). And contrary to your statement, the issue is related to core placement of the threads, and this in turn is related to thread affinity.

Please note that the observation of your "fix", is not that of a complete fix. Assume the following scenarios:

a) You add two more worker pthreads that you start prior to launching your OpenMP parallel regions. If (as you state isn't an issue of affinity) your configuration is the same as that of your failing situation, this would place the two OpenMP worker threads spanning cores. And thus run slowly.

b) You change the number of OpenMP worker threads. In this case you might find it advantageous to have 2 threads per core, or 3 threads per core, but not necessarily 4 threads per core. (think of you expanding your number of pthreads, each with a 2 thread OpenMP parallel region).

On the Xeon Phi, with 60/61 cores you would not run a main spawing a single pthread which spawns a two-thread OpenMP parallel region. Your configuration would expand much larger than this. In order to make full use of the processor, you really need to be aware of thread placement.

Due to the different order cores (Xeon Phi is In-Order core, Xeon is Out-of-Order core), you cannot expect the two behaviors to be the same without consideration of placement of the threads.

Jim Dempsey

Frances_R_Intel · ‎10-07-2015

What was your change to the OpenMP code?