Software Archive
Read-only legacy content
17061 Discussions

cilk_spawn skips function calls (?)

Christos_T_
Beginner
1,141 Views

Hello there. I'm a student and i'm trying some experiments with CilkPlus of icc 15. I'm using Ubuntu 12.04 with x64 Intel Processor.

The following code is an implementation of a radix sorting algorithm of an octree using points' morton codes. The problem is that it seems that though cilk decides not to spawn a new thread in one of the 8 recursive calls, it also skips calling the function serially. This results in producing a non-complete sorted index vector, whose size is less than the original index vector's size and thus it doesn't apply sorting to all points. This is not happening if i implement serially the bin splitting and i apply cilk_for only to the recursive calls. Can you explain to me what's happening? Is there an alternative implementation? Should i correct something?

    #include <cstdlib>
    #include <cstdio>
    #include <vector>
    #include <cilk/cilk.h>
    #include <cilk/reducer_vector.h>

    #define MAXBINS 8


    typedef std::vector<unsigned int> UIVector;
    typedef std::vector<unsigned long int> ULIVector;
    typedef cilk::reducer< cilk::op_vector<unsigned int> > UIVectorReducer;
    typedef cilk::reducer< cilk::op_vector<unsigned long int> > ULIVectorReducer;

    void truncated_radix_sort(const ULIVector& morton_codes,
                              ULIVector* sorted_morton_codes,
                              const UIVector& index,
                  UIVector* permutation_vector,
                  unsigned int *level_record,
                  int population_threshold,
                  int sft, int lv)
    {
      int N = index.size();
      if (N <= 0)
      {
        return;
      }
      else if (N <= population_threshold || sft < 0)
      { // Base case. The node is a leaf
        level_record[0] = lv; // record the level of the node
        *permutation_vector = index;
        *sorted_morton_codes = morton_codes;
        return;
      }
      else
      {
        int i, j;
        level_record[0] = lv;
        
        /* Place point to a bin according to its morton code */
        UIVectorReducer* bins_reducer = new UIVectorReducer[MAXBINS];
        ULIVectorReducer* bin_codes_reducer = new ULIVectorReducer[MAXBINS];
        cilk_for (j = 0; j < N; j++)
        {
          unsigned int ii = (morton_codes>>sft) & 0x07;
          (bins_reducer[ii])->push_back(index);
          (bin_codes_reducer[ii])->push_back(morton_codes);
        }

        UIVector* sorted_bins = new UIVector[MAXBINS];
        ULIVector* sorted_codes = new ULIVector[MAXBINS];
        int offsets[MAXBINS];
        offsets[0] = 0;
        for (i = 1; i < MAXBINS; i++)
        {
          offsets = offsets[i-1] +
            bins_reducer[i-1].get_value().size();
        }

        /* Call the function recursively to split the lower levels */
        for (i = 0; i < MAXBINS; i++)
        {
          cilk_spawn truncated_radix_sort(
              bin_codes_reducer.get_value(), &sorted_codes,
              bins_reducer.get_value(), &sorted_bins,
              &level_record[offsets],
              population_threshold,
              sft-3, lv+1);
        }
        cilk_sync;

        /* Merge sorted vectors */
        permutation_vector->reserve(N);
        sorted_morton_codes->reserve(N);
        for (i = 0; i < MAXBINS; i++)
        {
          permutation_vector->insert(permutation_vector->end(), sorted_bins.begin(), sorted_bins.end());
          sorted_morton_codes->insert(sorted_morton_codes->end(), sorted_codes.begin(), sorted_codes.end());
        }

        delete[] sorted_bins;
        delete[] sorted_codes;
        delete[] bins_reducer;
        delete[] bin_codes_reducer;
      }
    }

Thank you in advance.

 

0 Kudos
1 Solution
Jim_S_Intel
Employee
1,141 Views

It seems like the following code looks like it might be extracting the value of a reducer inside a parallel region?   If a steal of the continuation of the spawn happens, then it seems like the dereference of bins_code_reducer in 70 and 71 would get a value from an empty identity view instead of the merged view.   
 

066     /* Call the function recursively to split the lower levels */
067     for (i = 0; i < MAXBINS; i++)
068     {
069       cilk_spawn truncated_radix_sort(
070           bin_codes_reducer.get_value(), &sorted_codes,
071           bins_reducer.get_value(), &sorted_bins,
072           &level_record[offsets],
073           population_threshold,
074           sft-3, lv+1);
075     }
076     cilk_sync;
077  

 

Cheers,

Jim

View solution in original post

0 Kudos
7 Replies
Bradley_K_
New Contributor I
1,141 Views

The most likely explanation is that there is a bug in your code.  Since you didn't include a complete test case, there is no way to tell.

 

Did you run the cilkscreen race detector on your code?

 

-Bradley

0 Kudos
Christos_T_
Beginner
1,141 Views

Cilkscreen does not find any race conditions. I am providing a characteristic dataset below. Before that sorted_morton_codes and permutation_vector are the output vectors. You should only feed a pointer of an empty vector. morton_codes and index are equal sized input vectors. level_record is a pointer to an index-sized array and it does not concern the functionality of this function actually. population_threshold is how many nodes should a leaf have and it means to stop the recursion. sft is where we should look to our morton code in order to partition locally our space and fill each bin-partition with a point. lv denotes the level of partition.

Typical and small input:

int N = 25;
std::vector<unsigned long int> morton_codes;
morton_codes[0] = 0;
morton_codes[1] = 17;
morton_codes[2] = 274;
morton_codes[3] = 299;
morton_codes[4] = 457;
morton_codes[5] = 144;
morton_codes[6] = 312;
morton_codes[7] = 466;
morton_codes[8] = 366;
morton_codes[9] = 352;
morton_codes[10] = 365;
morton_codes[11] = 76;
morton_codes[12] = 278;
morton_codes[13] = 209;
morton_codes[14] = 323;
morton_codes[15] = 110;
morton_codes[16] = 340;
morton_codes[17] = 186;
morton_codes[18] = 338;
morton_codes[19] = 355;
morton_codes[20] = 216;
morton_codes[21] = 89;
morton_codes[22] = 268;
morton_codes[23] = 346;
morton_codes[24] = 452;
std::vector<unsigned long int> sorted_morton_codes;
std::vector<unsigned int> index(25);
for (int i = 0; i < 25; i++) index = i;
std::vector<unsigned int> permutation_vector;
unsigned int level_record[25];
int pop_thres = 2;
int sft = 6;
int lv = 0;

// Call function
truncated_radix_sort(morton_codes, &sorted_morton_codes,
			 index, &permutation_vector,
                         level_record,
			 pop_thres, sft, lv);

Printing bins's sizes before and after execution reveals the problem.

#include <cstdlib>
#include <cstdio>
#include <vector>
#include <cilk/cilk.h>
#include <cilk/reducer_vector.h>

#define MAXBINS 8


typedef std::vector<unsigned int> UIVector;
typedef std::vector<unsigned long int> ULIVector;
typedef cilk::reducer< cilk::op_vector<unsigned int> > UIVectorReducer;
typedef cilk::reducer< cilk::op_vector<unsigned long int> > ULIVectorReducer;

void truncated_radix_sort(const ULIVector& morton_codes,
                          ULIVector* sorted_morton_codes,
                          const UIVector& index,
			  UIVector* permutation_vector,
			  unsigned int *level_record,
			  int population_threshold,
			  int sft, int lv)
{
  int N = index.size();
  if (N <= 0)
  {
    return;
  }
  else if (N <= population_threshold || sft < 0)
  { // Base case. The node is a leaf
    level_record[0] = lv; // record the level of the node
    *permutation_vector = index;
    *sorted_morton_codes = morton_codes;
    return;
  }
  else
  {
    int i, j;
    level_record[0] = lv;
    
    /* Place point to a bin according to its morton code */
    UIVectorReducer* bins_reducer = new UIVectorReducer[MAXBINS];
    ULIVectorReducer* bin_codes_reducer = new ULIVectorReducer[MAXBINS];
    cilk_for (j = 0; j < N; j++)
    {
      unsigned int ii = (morton_codes>>sft) & 0x07;
      (bins_reducer[ii])->push_back(index);
      (bin_codes_reducer[ii])->push_back(morton_codes);
    }

    UIVector* sorted_bins = new UIVector[MAXBINS];
    ULIVector* sorted_codes = new ULIVector[MAXBINS];
    int offsets[MAXBINS];
    offsets[0] = 0;
    for (i = 1; i < MAXBINS; i++)
    {
      offsets = offsets[i-1] +
        bins_reducer[i-1].get_value().size();
    }

    /* Printing bins before sorting */
    for (i = 0; i < MAXBINS; i++)
    {
      printf("lv %d, bin %d size: %d\n", lv, i, bins_reducer.get_value().size());
    }

    /* Call the function recursively to split the lower levels */
    for (i = 0; i < MAXBINS; i++)
    {
      cilk_spawn truncated_radix_sort(
          bin_codes_reducer.get_value(), &sorted_codes,
          bins_reducer.get_value(), &sorted_bins,
          &level_record[offsets],
          population_threshold,
          sft-3, lv+1);
    }
    cilk_sync;

    /* Printing respective bins after sorting */
    for (i = 0; i < MAXBINS; i++)
    {
      printf("lv %d, sorted %d size: %d\n", lv, i, sorted_bins.size());
    }

    /* Merge sorted vectors */
    permutation_vector->reserve(N);
    sorted_morton_codes->reserve(N);
    for (i = 0; i < MAXBINS; i++)
    {
      permutation_vector->insert(permutation_vector->end(),
          sorted_bins.begin(), sorted_bins.end());
      sorted_morton_codes->insert(sorted_morton_codes->end(),
          sorted_codes.begin(), sorted_codes.end());
    }

    delete[] sorted_bins;
    delete[] sorted_codes;
    delete[] bins_reducer;
    delete[] bin_codes_reducer;
  }
}

Print output:

lv 0, bin 0 size: 2
lv 0, bin 1 size: 3
lv 0, bin 2 size: 2
lv 0, bin 3 size: 2
lv 0, bin 4 size: 5
lv 0, bin 5 size: 8
lv 0, bin 6 size: 0
lv 0, bin 7 size: 3
lv 1, bin 0 size: 0
lv 1, bin 1 size: 1
lv 1, bin 2 size: 0
lv 1, bin 3 size: 1
lv 1, bin 4 size: 0
lv 1, bin 5 size: 1
lv 1, bin 6 size: 0
lv 1, bin 7 size: 0
lv 1, sorted 0 size: 0
lv 1, sorted 1 size: 1
lv 1, sorted 2 size: 0
lv 1, sorted 3 size: 0
lv 1, sorted 4 size: 0
lv 1, sorted 5 size: 0
lv 1, sorted 6 size: 0
lv 1, sorted 7 size: 0
lv 0, sorted 0 size: 2
lv 0, sorted 1 size: 1
lv 0, sorted 2 size: 0
lv 0, sorted 3 size: 0
lv 0, sorted 4 size: 0
lv 0, sorted 5 size: 0
lv 0, sorted 6 size: 0
lv 0, sorted 7 size: 0

Even if printf cannot be called by more that one thread, (sorted) bin's sizes after recursive execution as numbers suggest shouldn't be 0. Also, i have checked the vector merging and it works fine.

- Christos

 

0 Kudos
Bradley_K_
New Contributor I
1,141 Views

If you want help, you need to provide a complete test case, not one that I can assemble with enough ingenuity.  ("You feed a pointer to an empty vector to ...." doesn't cut it.)

The instructions should be like "copy the enclosed code to foo.cc, compile it as icc foo.c -o foo.  Expect output to be `3', but it's not."

-Bradley

0 Kudos
Jim_S_Intel
Employee
1,141 Views

Does your code run correctly with CILK_NWORKERS=1?  If not, then that is an easier case to debug.
If the code does work with CILK_NWORKERS=1, but not more than 1 worker,  my  first suspect would be something to do the arrays of reducer vectors + the combination of recursion, since it looks like you could be getting sizes from intermediate views.   In general, you shouldn't look up the value of a reducer until all the parallel strands you care about have synced.  
Cheers,

Jim

0 Kudos
Christos_T_
Beginner
1,141 Views

Thank you Jim for the suggestion. Code runs correctly for CILK_NWORKERS=1 and that's what is troubling me. I think that dynamically allocating reducers and using them like so (lines 41-48) is fine and works. I know that because each bin is filled correctly.

Also, at lines 52-64 <reducer>.get_value() is used after cilk_for, at the end of which i think an implicit taskwait resides. Thus using like so shouldn't be a problem (i think).

@Bradley, my apologies i thought that compiling this code should not be referenced as test instructions. Also, forgive my typo,  corrected it.

The expected output for starters in a sorting algorithm should have equal length to the input. So when we work in parallel, and run the above snippet one should assert at first that:

assert(morton_codes.size() == sorted_morton_codes.size());
assert(index.size() == permutation_vector.size());

Ground truth for large inputs is hard to be written by hand (for which i am sorry too). Since you do want a complete test case for this function, i will soon provide you with a gtest test case that compares the output of a serialization to the output of this one.

0 Kudos
Jim_S_Intel
Employee
1,142 Views

It seems like the following code looks like it might be extracting the value of a reducer inside a parallel region?   If a steal of the continuation of the spawn happens, then it seems like the dereference of bins_code_reducer in 70 and 71 would get a value from an empty identity view instead of the merged view.   
 

066     /* Call the function recursively to split the lower levels */
067     for (i = 0; i < MAXBINS; i++)
068     {
069       cilk_spawn truncated_radix_sort(
070           bin_codes_reducer.get_value(), &sorted_codes,
071           bins_reducer.get_value(), &sorted_bins,
072           &level_record[offsets],
073           population_threshold,
074           sft-3, lv+1);
075     }
076     cilk_sync;
077  

 

Cheers,

Jim

0 Kudos
Christos_T_
Beginner
1,141 Views

Jim thank you very much! It was just as you said! I was hoping that by dereferencing in the loop, i could pass immediately the reference to the next level of recursion. Nice!

0 Kudos
Reply