topic I assume that "process the in Software Archive

_cilkrts_hyperobject_dealloc is expensive

erling_andersen — Wed, 28 Jan 2015 13:08:46 GMT

Intel Vtune shows that _cilkrts_hyperobject_dealloc is a very expensive operation. Is it because I spawning too much?

I have fairly deep recursion so the stack grows quite large. Maybe that is the issue.

Spawning doesn't cause

Pablo_H_Intel — Thu, 29 Jan 2015 19:17:43 GMT

Spawning doesn't cause excessive calls to _cilkrts_hyperobject_dealloc; stealing does. Excessive stealing occurs when you spawn units of work that are too small, especially if the spawns occur one-at-a-time rather than in a parallel divide-and-conquer pattern. I would need to understand the pattern of your spawns to see if that could be the issue. The other thing that causes _cilkrts_hyperobject_dealloc to be called frequently is the use of a large number of reducers, such as, perhaps, an array of reducers. It is usually better to use other mechanisms, such as a (custom) reducer of array rather than an array of reducer. Finally, there may be nothing wrong with your code. You might simply have run into one of those areas of the Cilk runtime that have not been well optimized. __cilkrts_hyperobject_dealloc is simply a call to free(). Your pattern might be one where we could stand to improve performance of the runtime library by using a custom allocation scheme for hyperobjects.

If there is any chance that you could share a santitized version of your code (keep it small and no proprietary information, please), I might be able to help you further.

Regards,
Pablo

FYI I am new to Cilk and

erling_andersen — Fri, 30 Jan 2015 08:01:08 GMT

FYI I am new to Cilk and learning how it works. Initially I just tried to create many small tasks so there was a lot to parallelism. It seems I overdid it.

It is very simple to describe what I am doing. I processing a tree starting from the root recursively as follows.

processnode(root of tree)

where

func processnode(x)

begin

for y in child of x

cilk_spawn processnode(y)

cilk_sync

do some work x usually small towards the bottom

end

In fact what I did was to stop recursion at some point so instead of processing nodes I process whole subtrees. That reduced the time of allocation but it is was still significant. Ideas along this line is obvious.

You already helped me quite a bit. If needed I think I can create a simple example that simulates what I do you could have that does not relveal anything significant. But now let me first work more with cilk.

Btw I have no reducers. I still have not figured out whether they are useful for me and how I can use them from pure C code. All your examples seems C++ based.

Btw my company MOSEK (mosek.com) does mathematical optimization and we are looking into whether cilk can be used to improve the parallelization of our code i.e. better scalability. We have used Openmp earlier and now native threads but cilk seems nicer.

Btw now I will try the

erling_andersen — Fri, 30 Jan 2015 11:31:41 GMT

Btw now I will try the reverse e.g. process the tree starting at the leaf nodes. This has the drawbacks that I have to keep track of whether all the children of a node has been processed because then I can start processing the node. Also I have to have stack of nodes ready to process. This means I have to introduce locks when working on that info. Now locks does not seem to be a part of cilk which is a pain because Linux and Windows do not provide a unified lock.

Well, we have made a wrapper for locks but it is pain anyway.

I assume that "process the

Bradley_K_ — Fri, 30 Jan 2015 12:42:09 GMT

I assume that "process the tree starting at the leaf nodes" means that you plan to implement a queue or something of things to work on, and then implement your own scheduler. I recommend against that path for the reasons you have mentioned (you would have to introduce locks), but also because your scheduler is unlikely to be as good as the Cilk scheduler.

The problem sounds like you may not have much parallelism: (you have indicated that your tree is very deep). Have you run cilkview to find out what is the parallelism?

-Bradley

Jim Sukha wrote an excellent

Barry_T_Intel — Fri, 30 Jan 2015 15:16:52 GMT

Jim Sukha wrote an excellent article on how to determine why your program isn't speeding up: Why is Cilk Plus not speeding up my program? I recommend it highly.

- Barry

I reorganized the

erling_andersen — Mon, 02 Feb 2015 13:47:41 GMT

I reorganized the computations but still get things like:

Top hotspots with 4 threads

func@0x78ea15fa 0.523s

_cilkrts_hyperobject_dealloc 0.337s

Top Hotspots with 2 threads
Function CPU Time
run 0.265s
do_something 0.223s
func@0x78ea15fa 0.172s

in Vtune. I have no idea what the anonymous function with @ is. Since I do not use reducers at all then it is excessive stealing that must be the issue. Note dealloc does not feature for 2 workers but does for 4 workers.

However, I have cut down the number of spawns a lot e.g. something like 2*NWORKERS spawns. Then I hoped I would not see the dealloc function but I do for NWORKERS>1 and particular for 4. It is a quad core CPU.

My conclusion is load balancing is the issue. Sounds reasonable?

My original hope with cilk was that load balancing issues should be less severe but maybe that is not the case.

To Pablo:

erling_andersen — Tue, 03 Feb 2015 19:51:10 GMT

To Pablo:

I am close to try something different than cilk because even if I reduce the number spawns to to very little, t Then Vtune tells me that some anonymous function and the dealloc function mentioned in my original post uses a lot of time. Maybe that is nothing to worry about and an artificate of how Cilk or Vtune works but it makes me fell that there is an inefficiency somewhere that I cannot get at. In other words cilk has not provided much benefit over our existing native threads based code.

I can provide an *.exe linked to cilk and some example data if you want to profile it.

Thanks for the help.

Hi,

Jim_S_Intel — Tue, 03 Feb 2015 21:06:34 GMT

Hi,

A few questions/comments based on my quick reading of the post. Something seems a little bit fishy with the report from VTune + Cilk Plus; if your program does not use reducers, then I'm not sure why it would spend a lot of time in __cilkrts_hyperobject_dealloc. I am wondering maybe if the symbols in the Cilk Plus runtime aren't being read properly, and the time is showing up in the wrong functions?

Some other wild guesses on my part:

1. Do you know how deep the nesting of spawned functions is? In the current implementation, there is a limit of ~ 1000 for nested of cilk_spawn nodes. It might be possible that if you are nesting deeper than that, or perhaps if you are overflowing the 1MB stacks that Cilk Plus uses, then some data structure could be corrupted / something unexpected could be happening? But I would have expected the program to crash if something is being overflowed, and it sounds like you changed that already...

2. Do you know which model CPU you are running on --- is it 4 full cores, or 2 cores with SMT/hyperthreading turned on? Running Cilk Plus worker threads on the SMT threads doesn't always work well. In that case, I've seen behaviors where the worker threads which are idle will interfere with the other workers.

Having a binary to test and profile seems like it would helpful for tracking down potential runtime issues. Are you compiling / running on Windows, (or Linux)?

Cheers,

Jim

I agree that the Vtune

erling_andersen — Wed, 04 Feb 2015 09:07:01 GMT

I agree that the Vtune results are fishy and not trust worthy. You conclusion about Vtune and symbols is likely to be case I would say but it makes profiling hard. Do you have any suggestion for "fixing" this i.e. getting reliable Vtune results? Are there any information about using Vtune on cilk applications?

I run on Windows and has disabled the hyperthreading and because my code is very floating point intensive. Actually it does a lot of calls to seq. MKL. THe CPU is a E3-1270 v2 which to my understanding has 4 cores.

I could potentially do a very deep nesting for certain large problems but my current test examples does not have it. But is really NICE to know that could be issue that is better avoided.

I have not tried on Linux that but the code is build to work on Linux.

I have wished for a way to

TimP — Wed, 04 Feb 2015 12:37:21 GMT

I have wished for a way to restrict workers to 1 per core, particularly on the recent platforms which lack a bios set-up option.

The PDB that is shipped with

Barry_T_Intel — Wed, 04 Feb 2015 13:58:07 GMT

The PDB that is shipped with cilkrts20.dll is stripped - it only has exported symbols. This usually allows you some clue about why you're spending time in the Cilk runtime, but it can get confused. If you're seeing a large offset from the start of the routine, the symbol resolution is almost always confused.

We'll need to get the program in-house to figure out where you're spending your time.

- Barry

Thanks. Let me work some more

erling_andersen — Thu, 05 Feb 2015 07:43:46 GMT

Thanks. Let me work some more and have vacation next week before I give you a binary if needed.