Detecting a steal without holder (and why are hypermap lookups expensive?)

tfk · ‎11-13-2013

I would like to be able to detect when a worker has stolen work without the use of a holder. Ideally I would like to to add a "hook" into the runtime system so that a function I define is called upon a successful steal, or have access to a worker-local "successful steal" counter that is incremented each time a worker executes a successful steal. Are either of these things possible using documented or undocumented features of the cilk runtime system in gcc or icc?

A higher level question I have, in addition, is regarding the cost of hypermap lookups. Why are they so expensive? I have a benchmark using a single reducer for which __cilkrts_hyper_lookup takes ~35% of the time using cilk gcc 4.9 and ~25% of the time when using icc 13.1.1.

Cheers,

Tim

Jim_S_Intel · ‎11-13-2013

We don't have statistics collection enabled in the existing Cilk Plus runtime. But if you are willing to compile your own runtime, it would be straightforward to add the hooks you describe for your own experimentation purposes.

You might look at CILK_PROFILE flag in stats.h, and the "INTERVAL_STEAL_SUCCESS" event, which counts successful steals.
There is a "dump_stats" function that prints out statistics.
The file "local_state.h" defines some of the worker_local state, if you want to add a function pointer as a field in each worker.

I'm not sure what your microbenchmark for reducer lookups is, but I believe a reducer lookup requires access to thread-local storage to lookup the current worker as well as a hash table lookup. That can be expensive compared to an access to a normal variable. The last time I checked, the TLS-lookup was most of the expense? It would be nice if we could make it faster...

Cheers,

Jim

tfk · ‎11-14-2013

Thanks for the tip. Tweaking the runtime system was surprisingly painless. I didn't bother enabling stats, instead I just added an integral field to each worker's local state and increment it upon a successful steal. The change gives me ~1.5x speedup on my benchmark (both on the 12 core time and on the 1 core time) over my previous code that detected steals using a holder-like object. My holder may have done a bit more work than the canonical one, so milage may vary. One issue I observe is that the worker-local state is destroyed upon exiting a parallel region of the code. I'd prefer for this to not happen, but I could probably just store the counter elsewhere (or copy it when a worker is destroyed).

I'd bet that a simple caching scheme can improve the lookups for general reducers --- e.g. maintain 2 size R worker-local arrays, one for pointers to local views, and another indicating whether the view is 'valid' and only perform a hypermap lookup when the view is invalid. The invalid/valid state could be maintained efficiently using worker local steal counts.

Cheers,

Tim

Jim_S_Intel · ‎11-14-2013

I'm not sure what you mean by saying that "worker-local state" is destroyed upon exiting a parallel region. When a cilk_for loop completes, it may return on a different worker thread from the worker thread that the loop started on. (It would be more expensive to require it to return on the same worker thread). Does that explain the effect you are seeing?

"R" is the number of reducers that you have in the system? It would be interesting to experiment with different implementations of a reducer maps. I think there is probably still some room for optimizing the reducer map implementations.

If the majority of the overhead of the reducer lookup is the lookup of the worker from thread-local storage, then the scheme you describe wouldn't save that much however. Managing the worker local arrays might also be as interesting as creating a hash table. Also, reducers of different types and sizes get more interesting...

Cheers,

Jim

tfk · ‎11-14-2013

I'm not sure what you mean by saying that "worker-local state" is destroyed upon exiting a parallel region. When a cilk_for loop completes, it may return on a different worker thread from the worker thread that the loop started on. (It would be more expensive to require it to return on the same worker thread). Does that explain the effect you are seeing?

No I don't think that explains it. What I observe is that the pointer returned by __cilkrts_get_tls_worker() is sometimes null (seems to be when invoked from a region of code with no parent frame). I guessed that this was caused by a call to destroy_worker when returning from the initial frame, but I'm a bit unfamiliar with the terminology used in the runtime source comments so I may be misinterpreting something.

If the majority of the overhead of the reducer lookup is the lookup of the worker from thread-local storage, then the scheme you describe wouldn't save that much however.

Reducer lookup seems (from my experiments) to be a lot more expensive than looking up a worker's identifier, at least on gcc. The cost of looking up a worker's id/steal-count isn't negligible, but it is < 2% of the total runtime in my benchmark.

Managing the worker local arrays might also be as interesting as creating a hash table. Also, reducers of different types and sizes get more interesting...

Right, or a SPA table if you know the identifiers in advance (as with reducer arrays). The specific benchmark I'm working with is, in fact, working with reducer-like objects of "different types and sizes." We should chat offline next time you're around CSAIL.

Cheers,

Tim

Barry_T_Intel · ‎11-14-2013

I'm not sure what you mean by saying that "worker-local state" is destroyed upon exiting a parallel region. When a cilk_for loop completes, it may return on a different worker thread from the worker thread that the loop started on. (It would be more expensive to require it to return on the same worker thread). Does that explain the effect you are seeing?

No that doesn't explain it. What I observe is that the pointer returned by __cilkrts_get_tls_worker() is sometimes null (seems to be when invoked from a region of code with no parent frame).

When you leave the outermost parallel region, the worker is "unbound". That is, the worker is no longer associated with the thread. This way it can be associated with the next thread that enters a parallel region. The is usually the same thread, but if you're using other threading packages, it can be a different thread.

- Barry

tfk · ‎11-14-2013

I see. So it appears that unbinding a thread doesn't actually call destroy_worker unless it was an 'overflow worker.' Cool.

Tim