Using Pthreads on the Coprocessors

Anand_Pratap_S_ · ‎06-17-2014

I wrote a simple advection equation solver using pthreads which works(and scales) correctly on the processor. However when I compile it using -mmic flag and run on coprocessor(using micnativeloadex), it uses just one thread(I am hard coding it to use 200). From what I know the code should work as it is. Am I missing something here?

The code is quite long and dirty, but still for completeness.

    #include <stdlib.h>
    #include <stdio.h>
    #include <pthread.h>
    #include <sys/time.h>
    #include <unistd.h>
    
    int64_t TimeInMicros() {
        struct timeval tv;
        gettimeofday(&tv, NULL);
        return tv.tv_sec*1000000 + tv.tv_usec;
    }
    
    
    struct Grid{
      int nx;
      double *u, *u_new, *f, *res;
      double a, cfl, dx;
      double tf;
    };
    
    
    struct ThreadData{
      int tid;
      struct Grid *grid;
      int maxthreads;
      pthread_barrier_t *barr;
    };
    
    void *solver(void *args){
      struct ThreadData *td = (struct ThreadData *)args;
      int tid = td->tid;
    
      struct Grid *grid = td->grid;
      pthread_barrier_t *barr = td->barr;
    
      double *u = grid->u;
      double *u_new = grid->u_new;
      
      double cfl = grid->cfl;
      double a = grid->a;
      double dx = grid->dx;
      double tf = grid->tf;
      int nx = grid->nx;
      double *f = grid->f;
      double *res = grid->res;
      double dt = cfl*dx/a;
      double t = 0.0;
      int chunk = nx/(td->maxthreads);
      int start = tid*chunk;
    
      int rc;
      while(t < tf){
        // sync here
        
    
        if(start == 0){
        f[start+1:chunk] = a*u[start:chunk];
        }
        else{
          f[start:chunk+1] = a*u[start-1:chunk+1];
        }
    
        // sync here
    
        rc = pthread_barrier_wait(barr);
        if(rc != 0 && rc != PTHREAD_BARRIER_SERIAL_THREAD)
        {
            printf("Could not wait on barrier\n");
            exit(-1);
        }
    
    
        if(start == 0){
          f[start] = f[nx-1];
        }
    
        res[start:chunk] = -(f[start+1:chunk] - f[start:chunk])/dx;
        // need to use update u_new for multiple threads
        u[start:chunk] += res[start:chunk]*dt;
    
        rc = pthread_barrier_wait(barr);
        if(rc != 0 && rc != PTHREAD_BARRIER_SERIAL_THREAD)
        {
            printf("Could not wait on barrier\n");
            exit(-1);
        }
    
       
    
        t+=dt;
      }
      return NULL;
    }
    
    
    int main(int argc, char*argv[]){
      int nx=100000;//atoi(argv[1]);
      int nthreads=200;//atoi(argv[2]);
    
      if(nx%nthreads != 0){
        printf("ERROR: Number of cells should be integral multiple of number of threads \n");
        exit(1);
      }
    
      pthread_t *threads = new pthread_t[nthreads]();
      struct ThreadData td[nthreads];
      pthread_barrier_t barr;
      pthread_barrier_init(&barr, NULL, nthreads);
    
      double *u = new double[nx]();
      double *res = new double[nx]();
      double *f = new double[nx+1]();
    
      double dx = 1.0/nx;
      double cfl = 0.9;
      double a = 1.0;
      double tf = 1.0;
      int i;
      // initialize
      u[0:nx] = 0.0;
      u[nx/4:nx/2] = 1.0;
      f[0:nx+1] = 0.0;
      res[0:nx] = 0.0;
      struct Grid grid;
      grid.nx = nx;
      grid.a = a;
      grid.cfl = cfl;
      grid.dx = dx;
      grid.u = u;
      grid.u_new = u;
      grid.res = res;
      grid.f = f;
      grid.tf = tf;
      
      for(i=0;i<nthreads;i++){
        td.tid = i;
        td.grid = &grid;
        td.maxthreads = nthreads;
        td.barr = &barr;
      }
      
      
    
    
      int64_t t1 = TimeInMicros();
      
      for(i=0;i<nthreads;i++){
        pthread_create(&threads,NULL,solver,&(td));
      }
    
      for(i=0;i<nthreads;i++){
        pthread_join(threads,NULL);
      }
    
      int64_t t2 = TimeInMicros();
      printf("Execution time: %.10f\n", (t2-t1)*1e-6);
    
      FILE * outfile;
      outfile = fopen("results.txt", "w+");
      for(i = 0; i < nx; i++){
        fprintf(outfile, "%.10f %.10f\n", i*dx, grid.u);
      }
      fclose(outfile);
    
    
    
      delete[] threads;
      delete[] u;
      delete[] res;
      delete[] f;
    
    
    
    }

James_C_Intel2 · ‎06-18-2014

1) It may be worth checking the result of pthread_create

2) How do you know how many threads are being used? I see no code there that tells you (there are no printfs inside solver).

Kevin_D_Intel · ‎06-18-2014

I'm definitely not a pthread expert so hopefully others who are can chime in.

I ran your program on a system with MPSS 3.1 and it appears to be creating all the threads. After starting the executable using micnativeloadex, I connected to the card in another window and issued the commands shown below.

[root@mic0 ~]# ps H -C a.out | head
   PID TTY      STAT   TIME COMMAND
 65799 ?        Sl     0:00 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:00 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:03 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:03 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:03 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:03 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:03 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:03 /tmp/coi_procs/1/65799/a.out
 65799 ?        Sl     0:03 /tmp/coi_procs/1/65799/a.out

[root@mic0 ~]# ps H -C a.out | wc -l
203

Anand_Pratap_S_ · ‎06-18-2014

My post mysteriously did not appear, posting it again.

Hi James and Kevin,

Thanks for the inputs.

It turns out I was making wrong interpretation. I was looking at the processor used using micsmc, which does not really change when I run the code. Putting print statements confirm that the threads are indeed created. However it spends so much time waiting at the barrier that it does not really use any processing power.

1. It worked perfectly on processor with the barrier function, I was expecting similar behaviour with coprocessor as well. Am I doing something wrong here? If i remove the barrier, it runs superfast however obviously it gives wrong results.

2. On a different note, I can easily vectorize every statement using array notation. In that case how do I run the executable using multiple threads? I cannot use cilk because there are no for loops.

Thanks,

Anand

jimdempseyatthecove · ‎06-18-2014

What happens if you rewrite to use OpenMP? Your code is short enough to do this and it may shed some light on your issues.

Jim Dempsey

jimdempseyatthecove · ‎06-18-2014

Untested code

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <sys/time.h>
#include <unistd.h>

int64_t TimeInMicros() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec*1000000 + tv.tv_usec;
}


struct Grid{
  int nx;
  double *u, *u_new, *f, *res;
  double a, cfl, dx;
  double tf;
};


struct ThreadData{
  int tid;
  struct Grid *grid;
  int maxthreads;
};

void *solver(void *args){
  struct ThreadData *td = (struct ThreadData *)args;
  int tid = td->tid;

  struct Grid *grid = td->grid;
 
  double *u = grid->u;
  double *u_new = grid->u_new;
  
  double cfl = grid->cfl;
  double a = grid->a;
  double dx = grid->dx;
  double tf = grid->tf;
  int nx = grid->nx;
  double *f = grid->f;
  double *res = grid->res;
  double dt = cfl*dx/a;
  double t = 0.0;
  int chunk = nx/(td->maxthreads);
  int start = tid*chunk;

  int rc;
  while(t < tf){
    // sync here
    #pragma omp barrier

    if(start == 0){
    f[start+1:chunk] = a*u[start:chunk];
    }
    else{
      f[start:chunk+1] = a*u[start-1:chunk+1];
    }

    // sync here

    #pragma omp barrier


    if(start == 0){
      f[start] = f[nx-1];
    }

    res[start:chunk] = -(f[start+1:chunk] - f[start:chunk])/dx;
    // need to use update u_new for multiple threads
    u[start:chunk] += res[start:chunk]*dt;

    #pragma omp barrier

   

    t+=dt;
  }
  return NULL;
}


int main(int argc, char*argv[]){
  int nx=100000;//atoi(argv[1]);
  int nthreads=200;//atoi(argv[2]);

  if(nx%nthreads != 0){
    printf("ERROR: Number of cells should be integral multiple of number of threads \n");
    exit(1);
  }

  double *u = new double[nx]();
  double *res = new double[nx]();
  double *f = new double[nx+1]();

  double dx = 1.0/nx;
  double cfl = 0.9;
  double a = 1.0;
  double tf = 1.0;
  int i;
  // initialize
  u[0:nx] = 0.0;
  u[nx/4:nx/2] = 1.0;
  f[0:nx+1] = 0.0;
  res[0:nx] = 0.0;
  struct Grid grid;
  grid.nx = nx;
  grid.a = a;
  grid.cfl = cfl;
  grid.dx = dx;
  grid.u = u;
  grid.u_new = u;
  grid.res = res;
  grid.f = f;
  grid.tf = tf;
  
  


  int64_t t1 = TimeInMicros();
  
  #pragma omp parallel num_threads(nthreads)
  {
    struct ThreadData td;

    td.tid = omp_get_thread_num();
    td.grid = &grid;
    td.maxthreads = omp_get_num_threads();
    if(nthreads != omp_get_num_threads()) {
      if(td.tid == 0) {
        printf("ERROR: Number of cells should be integral multiple of number of threads \n");
        exit(-1);
      }
    }
    else {
      solver(&td);
    }
  }

  int64_t t2 = TimeInMicros();
  printf("Execution time: %.10f\n", (t2-t1)*1e-6);

  FILE * outfile;
  outfile = fopen("results.txt", "w+");
  for(i = 0; i < nx; i++){
    fprintf(outfile, "%.10f %.10f\n", i*dx, grid.u);
  }
  fclose(outfile);



  delete[] threads;
  delete[] u;
  delete[] res;
  delete[] f;



}

Jim Dempsey

Anand_Pratap_S_ · ‎06-18-2014

Hi Jim,

Thanks for the comments and code. This works correctly. Any idea why the pthread code was not working?

Thanks,

Anand

TaylorIoTKidd · ‎06-18-2014

Anand,

Looking at the code, you also posted this on Stackoverflow. I'm not going to post anything there so you can have input from two different communities.

Please post any solution you get from stackoverflow to these forums, and this forum to stack overflow. It'll help the broader community of users.

Thanks!

Regards
--
Taylor

jimdempseyatthecove · ‎06-19-2014

The function solver is essentially the same, although I added the barriers where your comments indicated a barrier ought to be.

Note, the first barrier I used (where your comment indicated) is (may be) superfluous.

If I were to guess I would think something is wrong (different) with the pthread barrier. Are you using the .so compiled for MIC? It can be problematic using a host version of libraries with requirements specific to an instruction set. There are some minor differences that may be problematic (perhaps PAUSE verses DELAY32/DELAY64).

Jim Dempsey

jimdempseyatthecove · ‎06-19-2014

Anand,

From the little experience I've had with Xeon Phi is when you have 244 threads, barriers are expensive. You should do what you can to work around the barriers. A good example of how you can do this can be found in one of my blog articles on IDZ. The relates to a plesiochronous phasing barrier (loosely-synchronous).

I haven't performed a time-domain analysis of your code, but it appears that it may be capable of incorporating the plesiochronous phasing barrier. What this means in a nutshell is any given thread is only dependent upon one neighboring thread with the exception of the thread with start == 0 which is dependent upon two threads (tid+1 and last tid).

To implement, the main (caller) would initialize the ThreadData objects to include a pointer to a progress array: an array of volatile int's of size == to number of threads. Prior launching the thread team, the array is initialized to 0's. Then as each thread reaches a milestone (completes a section of code) it then increments its progress pointer (++td->progressArray[tid]). Then where your barrier is required, it can consult the appropriate neighboring progress array entry (to right or left or last) as the case may be. In the article I used a conditional compile statement to define a macro to wait a short time. Host uses _mm_pause(), MIC uses _delay32(nnn).

This strategy may work well for you too. A suggestion though is seeing how the first thread (tid==0) waits for last thread, consider giving it more work to do (IOW make its chunk larger).

Before you implement this, look at the time domain dependencies. If they all run in one direction you may experience significant speed-up. If the time domain dependencies run both ways then all is not lost. The technique to employ there is to partition the work (one CEAN statement) into two statements. One is a short stub containing the last vector's worth of data at the tail-end of the region of work which is executed first, then increment progress variable, then perform bulk of remainder of front-end. This way, the dependent thread can resume early while the other threads completes its work.

Jim Dempsey