Thanks for the reply.

Yoel_E_ · ‎12-09-2015

Hi all,

I'm new to OpenMP. I'm trying to implement a program that runs on a Xeon Phi card. The program consists of a master thread and worker threads.
The workers maintain hash tables, one each, and the master sends commands to the workers to insert new elements, find elements and delete elements.
I've looked through some documentation but there's an abundance of constructs and I got a bit lost.
I thought I'd be able to do something like this:

#pragma offload target(mic)
{
// Shared message-passing objects declarations
#pragma omp parallel num_threads(NUM_THREADS) shared(<shared objects>)
{
   if (omp_get_thread_num() == 0) { // master thread
      // Do predefined number of iterations,
      // construct a command and send it to all workers
      // wait for worker response and process it
   }
   else { // worker thread
      for (;;) {
         // wait for a new command and process it
         // return response to the master
      }
   }
}

Every thread spends all its time inside its respective if clause in a loop. It's not very neat but I thought it would do. Unfortunately, it does not. I'm having trouble with the code I've written. So I was wondering if there's a more natural way to write this using OpenMP.

Any help is much appreciated :)

Thanks!

jimdempseyatthecove · ‎12-09-2015

You are trying to use OpenMP like pthreads.

#pragma offload target(mic)
{
  // Shared message-passing objects declarations
  for(int i=0; i < NumberOfIterations) {
    // construct a command
    ...
    #pragma omp parallel num_threads(NUM_THREADS) shared(<shared objects>)
    {
      // process slice of command based on OpenMP thread number
    }
  }
}

The above is but one way to handle it.

If you can overlap your construct commands, consider using #pragma omp task

Jim Dempsey

Yoel_E_ · ‎12-09-2015

Thanks for the reply.

It seems your solution has new threads created and destroyed on every iteration. I have a performance requirement, so in order to minimize memory accesses I need to bind threads to cores. Seems to me like this would not be possible with your code. Every iteration will incur many cache misses and memory accesses.

I think it would be possible to pre-construct all the commands I wish the threads to execute in their lifetime and have every thread iterate over all commands, but I am trying to find a more natural, memory-efficient way to do this.

Also, I can't overlap or reorder the commands. Every command affects the subsequent commands.

Thanks for the help :)

jimdempseyatthecove · ‎12-09-2015

Then consider:

#pragma offload target(mic)
{
  // Shared message-passing objects declarations
  #pragma omp parallel num_threads(NUM_THREADS) shared(<shared objects>)
  {
    #pragma omp master
    {
      for(int i=0; i < NumberOfIterations) {
        // construct a command (for all worker threads desired)
        ...
        // parcel out portions of command to workers
        for(int j=0; j < nWorkerTasks; ++j) {
          #pragma omp task [ optional task clauses here]
          {
            // task to perform 1 / nWorkerTasks of work
            ...
          } // end task
        } // end for(int j=
        // if required, wait for all tasks to complete
        #pragma omp taskwait
      } // end for(int i=
    } // end master
  } // end parallel
}

Jim Dempsey

jimdempseyatthecove · ‎12-09-2015

Then

#pragma offload target(mic)
{
  // Shared message-passing objects declarations
  volatile int Phase = -1; // or use an atomic type
  volatile int myPhase[NUM_THREADS]; // or use an atomic type
  for(int i=0; i < NUM_THREADS; ++i)
    myPhase = 0;
  #pragma omp parallel num_threads(NUM_THREADS) shared(<shared objects>)
  {
    int myThread = omp_get_thread_num(); // done once
    // all threads iterate
    for(int i=0; i < NumberOfIterations; ++i) {
      if (myThread == 0) { // master thread
        for(int i=0; i < NUM_THREADS; ++i)
          while(myPhase < Phase)
            WAIT_A_BIT();  // either _mm_pause() or one of the delays on MIC
        
        // construct a command
        ...
        Phase = Phase + 1;
      } // if (myThread == 0)
      while(myPhase[myThread] != Phase)
        WAIT_A_BIT();  // either _mm_pause() or one of the delays on MIC
      // start do work
      ...
      myPhase[myThread] = myPhase[myThread] + 1;
      
    } // for(int i=0;
    // wait for worker response and process it
  } // parallel
}

Jim Dempsey

Yoel_E_ · ‎12-09-2015

Thanks again!

I'll try your suggestions later today.

Meanwhile, would you mind explaining why my approach is wrong? I indeed was trying to mimic pthreads behavior by having all threads run in an infinite loop and passing messages through shared memory. All threads have the same code to run and I use the if to ensure one behaves as a master and the rest as workers. I implemented a signaling mechanism so that the master and the workers can signal each other when the new command is ready and when the results are ready. The signaling is done so:

#pragma offload target(mic)
{
int master_flag = 0;
int worker_flag = 0;
struct Command cmd;
struct Result result;
#pragma omp parallel shared(master_flag, worker_flag, cmd, result)
{
   if (omp_get_num_thread() == 0) {
      int local_master_flag = 0;
      int local_worker_flag = 0;
      for (;;) {
         // prepare command
         #pragma omp flush(cmd)

         // Signal worker:
         master_flag = !local_master_flag;
         local_master_flag = master_flag;
         #pragma omp flush(master_flag)

         // Wait for signal from worker:         
         #pragma omp flush(worker_flag)
         while (worker_flag == local_worker_flag) {
            #pragma omp flush(worker_flag)
         }
         local_worker_flag = worker_flag;
      } // for
   } // if

   else {
      int local_master_flag = 0;
      int local_worker_flag = 0;
      for (;;) {
         // Wait for signal from master:         
         #pragma omp flush(master_flag)
         while (master_flag == local_master_flag) {
            #pragma omp flush(master_flag)
         }
         local_master_flag = master_flag;
         
         // process command
         #pragma omp flush(result)

         // Signal master:
         worker_flag = !local_worker_flag;
         local_worker_flag = worker_flag;
         #pragma omp flush(worker_flag)

      } // for
   } // else
}  // parallel
} // offload

For some odd reason, this mechanism makes sense to me. I thought this method will be good since it ties a thread to a core more easily and it's the simplest thing I could think of at that time.

Thanks once again for all the help, it's greatly appreciated :)

jimdempseyatthecove · ‎12-09-2015

Lines 17 and 45 should be using ~ not ! (even though ! works). If you want to use ! then define the flags as bool.

Your above code works with only 1 worker thread. As written, only 1 thread is doing anything of substance at any one time. Xeon Phi has 240 threads (more or less depending on model).

Jim Dempsey

Yoel_E_ · ‎12-10-2015

Yes, in my original code the flags are indeed bool. Also, at most only one worker at any given iteration returns a response.

Do you see something inherently wrong with using this thread structure?

Thanks :)

jimdempseyatthecove · ‎12-10-2015

Your current thread structure (as illustrated by #6) is unsuitable. There is no parallelism and involves only 2 threads (each taking turns at meaningful work)

Main Thread      Worker Thread

loop:            loop:
prepare command  wait for command
issue command    wait for command
wait for done    receive command
wait for done    process command
...              ...
wait for done    process command
wait for done    signal command done
update flag      goto loop
goto loop

If the #6 sample code is not what you intend to do, then please do not illustrate what you are not intending to do. This is counter-productive.

Jim Dempsey

Yoel_E_ · ‎12-10-2015

The code in #6 is the code I wrote to implement my work.

I think there is parallelism, though. Consider a large number of threads. One of them constructs the command while the rest wait for the signal in the while loop in line 36. When the signal arrives, all of them access the command, which is shared, and do some work. The nature of the work is such that at most one worker will need to return a result into the result object, which is also shared (forgot to mention this detail earlier, sorry).

Am I missing something?

I'll rewrite it today to a more conventional structure and report back.

Much obliged :)

jimdempseyatthecove · ‎12-11-2015

The problem you have is as each of your worker threads complete their work, they each signal the master thread by setting the global variable worker_flag to the next state. What this means is the master thread assumes all workers have completed when the first worker completes (not when the last worker completes).

My #5 post has a table of the equivalent to your worker_flag, but differs in that there is one flag per worker, and thus the master thread can verify when each, and more importantly, when all, of the workers have finished.

Additionally, note that the "master thread" also participates in performing the work.

Jim Dempsey

New to OpenMP on MIC: Master/Workers thread configuration