Offloading cilk code causes an exterme slow down

Sigurhjs · ‎06-03-2016

I hope this is in the correct forum for this.

I am doing some tests on offloading code to the Xeon Phi through both Cilk and OpenMP. I have managed the OpenMP code quite well and it gives decent results (or results as expected).

I am however having big issues with my Cilk code, which is what I'd like to get working properly. If I run a very short test burst of the program I am testing the one that hasn't been offloaded sits at about 8 seconds in runtime. However if I run the same code except that it is offloaded for the actual calculations I get a runtime around 30 seconds.

On a proper run the best results I've gotten was 68 seconds runtime with no offloading compared to the best run with offloading at 311 seconds (Depending on the number of threads I am usually around 4x-6x slower when offloading).

The hardware count seems to be close to the same so very much unsure why.

The code is very heavy on IO operations, however they are not offloaded, but it does use shared memory, as suggested in the book by Jim Jeffers and James Reinders . The book suggests that cilk should be more suitable for heavy IO operations but when comparing my OpenMP based code that handles the same run in as low as 9 seconds when offloaded.

I am not sure which files I can include as the vtune logs are about 500mb in size. I have included a screenshot of both the summary and the event count for both the offloaded and not offloaded code here http://imgur.com/a/eSH5v.

If someone could guide me towards the right direction at what I should look at I'd greatly appreciate that.

Hansang_B_Intel · ‎06-28-2016

Can you briefly explain your work load?

For example, is it something like OpenMP parallel for vs. _Cilk_for having the exactly same work load?

It would be great if you can post some code samples or pseudo code.

Sigurhjs · ‎06-28-2016

In short we are doing tests on a very naive video encoder. Most of the execution is inside the motion estimation which in short finds a 8x8 block and searches in a 4x16x16 grid around it for the closest matching pixel block. I've included a very simplified code block below for more details:

void me_8x8_block(int mb_x, int mb_y, uint8_t *orig, uint8_t *ref) {

  int range = 16;

  int left = mb_x * 8 - range;
  int top = mb_y * 8 - range;
  int right = mb_x * 8 + range;
  int bottom = mb_y * 8 + range;

  _Cilk_for (y = top; y < bottom; ++y)
    {
      int x;
      for (x = left; x < right; ++x)
      {
        //The sad block is just a sum of absolute differences calculation
        sad_block_8x8(orig + mb_y*8*width+mb_x*8, ref + y*width+x); 
      }
    }
}
void motion_estimate() {
  _Cilk_for (mb_y = 0; mb_y < rows; ++mb_y)
  {
    int mb_x;
    for (mb_x = 0; mb_x < cols; ++mb_x)
    {
      me_block_8x8(mb_x, mb_y, curr_frame, prev_frame);
    }
  }
}

The runs are fairly large doing up to 3840x2160 sized arrays. The runs we have above are between running the Cilk code on the native CPU and then running the code on the native CPU but offloading the actual encoding to a Xeon Phi.

Both runs have exactly the same code otherwise and therefore the same workload. The assumption is that the random work stealing algorithm of Cilk is causing the negative results but problem is understanding why the difference is so drastic between offloading and not (Offloading is about 5xslower on average). We have compared this to OpenMP runs and OpenMP does do better but we expected that as static work sharing scheduling should fare better than work stealing.

Hansang_B_Intel · ‎06-30-2016

First thing you can try is to serialize the cilk_for loop in me_8x8_block (me_block_8x8?). It really depends on what is being done within sad_block_8x8 but it is worth trying since the size of actual work being done by each worker can be very small; grain size is one by default ( = min(2048, ceil(trip_count / (8*total_workers)) ), and the inner loop's trip count is quite small (=32).

Another thing you can try is to use array notation if there is any vectorizable code in your work load. Your description of the sad_block_8x8 sounds like a good candidate for that.

I also recommend to work on the CPU performance first, since 68 seconds (Cilk on host) vs. 9 seconds (OMP on target) indicates something is not working well even on the CPU.

Sigurhjs · ‎06-30-2016

Yeah, sorry, I rewrote a lot of code for the sample for simplicity and I see that the function call got a bit mixed up. The SAD calculation basically goes like this.

int i, j, result = 0;

for(i = 0; i < 8; i++) {
  for(j = 0; j < 8; j++) {
    result += abs(array1 - array2);
  }
}

This is then just a basic SAD calculation as can be seen (The code being used has some improvements over this with vectorization). However removing the parallelization inside the me_8x8_block results in about the same runtime as with it included.

I did some more testing after I found out about an environment variable which makes shared memory updates less frequent and this appears to be the biggest part of the increased time in the offload code. So we set MYO_CONSISTENCE_PROTOCOL=HYBRID_UPDATE_NOT_SHARED and with that runtimes dropped to about double the time the code that wasn't offloaded instead of the massive increase we had before.

The expectation is that Cilk should perform worse than OpenMP because of the fact that Cilk uses and random work stealing algorithm while OpenMP uses a work sharing algorithm. This is because our code is easily dividable into chunks before runtime. If we change the work sharing algorithm in OpenMP from static to dynamic we can see that the runtime on the CPU is about the same, which fits with our theory. Is it then reasonable to assume that the remaining part of the increased runtime can also be put on the shared memory or should I other things as well?

Jim_S_Intel · ‎06-30-2016

Are the 1-worker-thread runs for Cilk and OpenMP comparable to the run with no parallel keywords at all? In the past, we had observed that performance differences are sometimes also caused by differences in compiler optimization of the body of the parallel loop, rather than the work-stealing algorithm itself. Experimenting with only single-working executions can be useful in isolating such effects.

Cheers,

Jim

Sigurhjs · ‎06-30-2016

Somewhat. If the code is run with a single thread and parallel loops through either pragmas (with dynamic scheduling) or _Cilk_for we see about a 10% increase in runtime compared to the code without any type of parallel keywords. If we use static scheduling in OpenMP with a single thread we see speedup of about 4. I am assuming that this speedup is due to some vectorization optimizations (should be noted that the single thread parallel keyword is compiled with -qopenmp while the non parallel is without).

jimdempseyatthecove · ‎07-01-2016

IMHO your problem is in placing your parallelization constructs too deep into your code. You'd be better to lift it up a(some) level(s).

You state your code is working on fairly large, up to 3840x2160, sized arrays, yet your search is a tiny 8x8 pattern within a 4x 16x16 grid. I assume this means 4 tiles of 16x16 in 2D array (3840x2160).

If you are tracking multiple items, then your parallelization should be performed on item-by-item... not partitioning sub-sections of 4x16x16 arrays.

On the other hand, if you are tracking one individual item, within a single 4x16x16 section of 2D array (3840x2160), and where the 4x16x16 subsection may migrate between frames. Then use OpenMP. Create a parallel region, depending on number of threads, preparation the 4x16x16 array offsets and extents. Create a shared array of results for each team member, and array of frame number processed by each team member.

The master thread (team member 0) is in charge of advancing the frame state... inside the parallel region. The other team members spin-wait for the master thread to update the read frame number. Essentially this is used as a barrier. If you wish you could use omp barrier, but the user written spin-wait will be more efficient. As each thread observes the read frame number change, it exits the spin-wait and then processes its subsection of the 4x16x16 array, places its best match row, col, min value into its position in the team members results data, then lastly updates the frame processed number in its position in the team members frame processed data.

The master thread, upon completing its search of its subsection, then consults and thread results array, waiting for the thread's frame completed number to match the master threads frame completed number. Then the master thread, when necessary, updates the best match location and value. Then continue waiting and processing the remaining threads results against the new best match.

Note, the read frame number has not advanced at this time, therefor the non-master threads will enter spin-wait for next frame to be read.

The master thread, will use the new best match, and update the global last known position of that which is being searched. Once this is done, the master thread can advance to the next frame.

The thread spin-waits (which you write) check for an exit flag while spin waiting. When the master thread determines it is time to exit, it sets the flag, then exits the parallel region.

Efficient frame processing would be double buffered.

Jim Dempsey

Jim_S_Intel · ‎07-01-2016

Am I parsing the text above correctly, that with OpenMP, static schedule, and OMP_NUM_THREADS=1, you are seeing 4x speedup over versions with either (a) no parallel keywords, or (b) Cilk keywords?

If the OpenMP version with a single thread is already 4 times faster than the Cilk version with a single thread (and also the non-parallel version which by definition has a single thread), then I would focus first on figuring out why single-thread performance is not almost the same. If you are already 4x slower when running with a single thread (e.g., because one version is not vectorizing), then I wouldn't expect to do any better than 4x slower once you add multiple threads into the mix. My suspicion would be that for some reason, the compiler is vectorizing the body of the loop iteration in the OpenMP case, but not the other cases. I believe there are compiler flags that will enable reports describing which parts of the code were vectorized.

Another thing you might try is creating a separate serial function for the body of the loop iteration (which is ideally compiled independently), and then comparing the OpenMP and Cilk parallel loops with calls to that function as the body of the loop. This approach should help isolate any differences in the compiler optimization in the loop body from the effects of runtime schedule on 1 or more threads.

Also, as a minor nit, the loop iteration variables in the example above (y and mb_y) do not appear to be declared. I assume that is a typo?

Cheers,

Jim

Sigurhjs · ‎07-02-2016

Good posts Jim and Jim.

To start in chronological order with Jim Dempsey's reply. I might be misunderstanding your thoughts on the search pattern and they might be actually spot on but I am gonna detail it a bit again just to be sure.

The code runs a search that is 16 steps in all directions. That is we have a 32x32 grid around our block that we search. We do iterations of 1 on our search, so in the end we do 32x32 searches where each search has a calculation from an 8x8 block. So each 8x8 block in the array does 1024 searches.

As you stated things like double buffered frame processing are more more efficient but the goal with the project isn't to create a very efficient encoding. We also have no optimizations on the reading and writing done by the encoder or any type of frame pipeline. The base encoder we have is never gonna reach a point of being called efficient. Our main goal however is to study the various effects of work scheduling.

I really like your thoughts on the selection on where we should create parallel regions. There are some good points in there I can use. I've tried both

And to reply to Jim Sukha, I am sorry but I made a terrible mistake. I had included a vector optimization in the static run, that I missed. When that is removed we get the same effect as we do with dynamic and guided which is about a 10% runtime increase compared to the original when we just use 1 thread.

I also excluded the declaration of y and mb_y (the methods have more code and the declaration is therefore higher up in the code) as I was trying to make the code as brief as possible.

I'll add the separate compilation of the body of the loop to the things we test, that is a very nice idea.