Strange parallel application behavior

Tudor · ‎06-13-2011

Hello there,

I haven't visited the forums in a while. :)
I'm having a hard time understanding the behavior of a C app I wrote when running on a dual quad-core machine.

The application basically does a wavefront sweep of a matrix assigning to each point the minimum between its value and the neighbours' values. Each row is processed by a thread interleaved and each row is divided into critical sections to prevent the thread processing row i + 1 to get ahead of thread i (since thread i + 1 needs the updated values of row i).

The problem I am seeing when testing the code is the following: the sequential execution takes about 16 seconds on the test data (12000 x 12000 matrix), while the parallel version takes anywhere between 2 and 8 seconds and varies at each run.

I'm posting the code executed by each thread:

[cpp]void* threadTask(void* params)
{
  struct timeval start, stop;
  gettimeofday(&start, 0);  
  int tid = (int)params;

  int i, j, k, l, add;
  add = 0;
  int* elements = (int*) malloc((global_offsets_len / 2 + 1) * sizeof(int));           
                                      
  for(i = tid; i < global_height; i += numProcs) 
  {                                                                              
    while(!events) // thread is released only when previous thread has finished its first chunk.
    {
      //asm volatile("":::"memory");
      //__sync_synchronize();
      sleep(0);
    }	
    	
    for(l = 0; l < global_width / grain; l++) 
    {
      if(i > 0 && progress[i - 1] < global_width / grain) 
      {
        while(progress + 1 == progress[i - 1]) // wait for previous thread to finish its current chunk.
        {
          //asm volatile("":::"memory");
          //__sync_synchronize();
          sleep(0);
        }
      }
      
      pthread_mutex_lock(&locks); // each row is striped
      
      add = 0;
      if (l + 1 == global_width / grain) 
      { // accommodate division reminders
        add = global_width % grain;
      }
        
      for(j = l * grain; j < (l + 1) * grain + add; j++)
      {
        elements[0] = global_input;
        for(k = 0; k < global_offsets_len; k += 2)
        {
          int d1 = indexWrap(i + global_offsets, global_height);   
          int d2 = indexWrap(j + global_offsets[k + 1], global_width);
          elements[k / 2 + 1] = global_input[d1][d2];
        }
        global_input = global_function(elements, global_offsets_len / 2 + 1);
      }
        
      progress++;                        
      if (progress == 1 && i + cPoint[0] < global_height) 
      {                            
        events[i + cPoint[0]] = 1;  // release the next thread to begin computation		
      }
      
      pthread_mutex_unlock(&locks);    
    }
  }
  
  gettimeofday(&stop, 0);
  int total = (stop.tv_sec - start.tv_sec);
  printf("Thread %d time %d sec\r\n", tid, total);
  
  return 0;
}
[/cpp]

At first I thought the problem was with the thread mapping to cpus: it would make sense to me to have degraded performance if threads processing consecutive rows are assigned on different physical processors, and thus have to communicate through the main memory to exchange data. So, I looked around in /proc/cpuinfo and used pthread_attr_setaffinity_np to set the affinity such that the threads processing the first four rows are assigned to cores 0 - 3 and the others to cores 4 - 7. Unfortunately, this did not solve anything. The pthread_attr_setaffinity_np function does indeed work, because setting all threads to a single core makes the application run very slowly. Code executed in main thread:

[cpp]  int cpus[] = { 0, 2, 4, 6, 1, 3, 5, 7 };
  for(i = 0; i < numProcs; i++)
  {
	cpu_set_t cpuset;
	CPU_ZERO(&cpuset);
	CPU_SET(cpus, &cpuset);
	pthread_attr_t attr;
	pthread_attr_init(&attr);
	pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset);	
    printf("Thread %d started\r\n", i);
    pthread_create(&threads, &attr, threadTask, (void*)i);
  }[/cpp]

The above cpus[] array is constructed based on info from /proc/cpuinfo: cores 0 2 4 6 are inside processor 0 and 1 3 5 7 are inside processor 1.
I really hope someone can help me because I've really run out of ideas.
Thank you!

jimdempseyatthecove · ‎06-13-2011

I am confused between your problem statement:

>>The application basically does a wavefront sweep of a matrix assigning to each point the minimum between its value and the neighbours' values.

And your code.

-------------------

Excepting for boundary conditions (first row, last row, first col, last col) aren't you essentially wanting

for each i, j
a' = min(a[i-1][j-1], a[i-1], a[i-1][j+1],
a[j-1], a, a[j+1],
a[i+1][j-1], a[i+1], a[i+1][j+1]);
a = a';

If so, then I suggest you experiment with allocating a second array (a'), perform the calculation without locks, and copy the data back. I do not know if your boundary conditions wrap or truncate.

Note, although you will be writing each variable twice you will have no stalls nor interlocked operations.
Also, instead of interleaving rows, I would suggest each thread gets a slice of rows (as you would with #pragma omp parallel for iterating on rows and the inner loop iterating on column without slicing). This code then would be a good candidate for some vectorization. And more importantly will be friendlier with the L1 cache loads.

If you find system load (running something else) causing your threads not finishing near the same time, then create more row slices than threads. With 12,000 rows and 8 cores, 120 rows per slice would provide for no worse than a 1% skew per thread preemption. In OpenMP you would use dynamic scheduling, for your pthread just create an atomic integer and interlocked increment it to get the next row slice number. Your unnecessary lock overhead would be only 100-107 LOCK's (interlocked increments).

// in QuickThread(TBBwould be similar excepting for the chunkingused below)
// slice rows by 100 (minimizes thread completion skewwhen system under load)
parallel_for(nRows/100, [&](int iRowFrom, int iRowTo){
for(int iRow = iRowFrom; iRow .lt. iRowTo; ++iRow)
{
int iRowUp = (iRow == 0) ? nRows-1 : iRow-1;
int iRowDown = (iRow == nRows-1) ? 0 : iRow+1;
... special case col=0
for(int jCol = 1; jCol .lt. nCols-1; ++jCol)
{
... min function here a'[iRow][iCol] = ...
}
... special case jCol == nCols-1
} // for
}, //[&](int iRowFrom, int iRowTo)
0, nRows); // parallel_for
// all threads done here
// now copy back a' = a
parallel_for(nRows/100, [&](int iRowFrom, int iRowTo){
for(int iRow = iRowFrom; iRow .lt. iRowTo; ++iRow)
{
...

Cilk++ would be easier to write/express. However, I suggest you not choose a threading model based on one example.

Jim Dempsey

Tudor · ‎06-14-2011

Hello and thank you for your reply Jim.

I understand what you are saying but actually I want to use the updated values at each step, not the old values of the matrix.

For example, if element a[0][0] gets the value min(a[0][0], neighbours), then at the next step a[0][1] will get the min(a[0][1], neighbours), but here we use the updated value of a[0][0]. Also, the neighbour below a[0][0] will get the updated value of a[0][0] in its calculation.

Therefore, practically doing a map division of rows to threads would be equivalent to executing sequentially, since at some point a, I cannot compute its value without the updated value of the neighbour a[i-1].

I hope the statement is more clear now.

Edit: I've found an older thread that I started a while back with basically the same problem. I think the statement is better explained there by Dr. Clay's reply.
http://software.intel.com/en-us/forums/showthread.php?t=66937

Tudor · ‎06-14-2011

Well, apparently the problem was solved by replacing pthread_attr_setaffinity_np with sched_setaffinity.
Now I'm getting the desired performance.

jimdempseyatthecove · ‎06-14-2011

Your explanation is almost clear.

You are somewhat representing a "ripple effect" (what you formerly stated as a wave propagation).

Do you intend for your wave to start at [0][0] and ripple as an arc

step-1
[0][0] = min([0][0],[0][1],[1][0],[1][1])
step-2 (in parallel)
[0][1] = min([0][1]'s neighbors after step-1)
[1][0] = min([1][0]'s neighbors after step-1)
[1][1] = min([1][1]'s neighbors after step-1)
step-3 (in parallel)
[0][2] = min([0][2]'s neighbors after step-2)
[1][2] = min([1][2]'s neighbors after step-2)
[2][0] = min([2][0]'s neighbors after step-2)
[2][1] = min([2][1]'s neighbors after step-2)
[2][2] = min([2][2]'s neighbors after step-2)
step-4 (in parallel)
[0][3] = min([0][3]'s neighbors after step-3)
...
[3][3] = min([3][3]'s neighbors after step-3)

Or run row-by-row

-------------------------------------------------------------> (thread-0)
---------------------------------------------------------> (thread-1 trailing thread-0)
--------------------------------------------------> (thread-2 trailing thread-1)
...
---------------> (thread-n trailing thread-n-1)
-----------> (thread-x trailing thread-n (x is first of above to take row)
--> ...

The second route will require a fair amount of progress checking. The improvement strategy I would suggest for route two is to choose an appropriate interval number. Say for example 120 (for your 12,000 sized array). Create an array of char 100x100 (assuming processor supports cache coherent write combining of char otherwise use DWORD). Prior to run, zero out this array. Then start computation and threads acquire rows but check only the progress intervals above and to right of there row/120 col/120 neighbors and wait until neighboring 120's complete. When you complete your cell write a 1 into the 100x100 array indicating that cell done.

Assume:

thread-0 gets rows 0:119
thread-1 gets rows 120:239
...

Thread-0 computes a 120x120 tile origin-ed at 0,0 then writes 1 into cell 0,0 of progress array
Thread-1 waiting on progress cells 0,0 and 0,1
thread-2 waiting on progress cells 1,0 and 1,1
thread-3 waiting on progress cells 2,0 and 2,1
...

When thread-0 completes the 120x120 tile origin-ed at 0,120 and writes 1 into cell 0,1 of the progress array
then thread-1 begins working on its first tile origin-ed at 120,0

What this produces is staggered start times with periodic progress checking. You can reduce the size of the tile workspace from 120x120 at the expense of increased progress checking.

The first method (arc ripple) would be similar. Using 120x120 (or other selected tile size). Thread team starts. One thread wins pick for tile origin-ed at 0,0 others wait. When winning thread completes first tile and marks it done, it enters competition for next tile (tile origin-ed at 0,120). When that tile completes and marks it done, then there aretwo tile candidates ready for picking (tileorigin-ed at 0,240 and 120,0). As eachtile completes two tiles are added to thepool of ready tiles (except for the completion of bottom tile and in that case only one tile is added to the pool).As each tile completes the number of potential ready tiles grows (potential for concurrency increases). The benefit of this method is less adverse interaction should a thread get preempted by OS. (less so after excess of tiles available over number of worker threads).

Jim Dempsey

Tudor · ‎06-20-2011

Thank you for the advice. Another question would be: considering a dual quad-core processor, what would be the most effective mapping of threads to cores to maximize the efficiency of this scheme?

Edit: it seems my problem was solved on stackoverflow. The problem appears to have been a large number of cache misses. Aligning the rows of the input matrix to 64 byte boundaries solved the issue.

http://stackoverflow.com/questions/6425706/parallel-application-has-random-behavior