To expand a bit on Hansang's

Haris_R_1 · ‎09-23-2014

Hello,
I would like to understand run-time execution in Cilk a little better.
I have downloaded Intel Cilk run-time release (cilkplus-rtl-003365 - released 3-May-2013).

On 09/09/2013 I had asked a question seeking to figure out which is the last function executed before Cilk run-time ends assuming execution went without any problems.

Barry suggested to look at “__cilkrts_c_return_from_initial()” in scheduler.c and indeed that was what I needed at that time.

I noticed that, “__cilkrts_c_return_from_initial()” in scheduler.c invokes another function called “__cilkrts_unbind_thread()” also in in scheduler.c

My question is following. When I run ./fib 30 then, “__cilkrts_unbind_thread()” is invoked only once at the end of the execution BUT when I run some other benchmarks (for example PBBS) then “__cilkrts_unbind_thread()” seems to be invoked multiple times.

I’m missing some connection here. I would honestly appreciate any hint, advice, or recommendation.
My ultimate goal is to set a flag in shared memory upon run-time exit assuming execution of any application went without any problems.

Tx, Haris
PS. I want to sincerely thank you for patiently answering my questions since I have been on this forum.

Barry_T_Intel · ‎09-23-2014

OK, let's get into the bits!

Most of this information is spelled out in the Cilk ABI (Application Binary Interface) available at http://www.cilkplus.org/download#open-specification. We're going to be discussing structures and functions defined in include/internal/abi.h. Note that some of these functions are versioned. That is, instead of __cilkrts_enter_frame(), the compiler really calls __cilkrts_enter_frame_1() to indicate that the __cilkrts_stack_frame it's been passed includes the fields for ABI version 1. I'm going to ignore these details because it gets pedantic and I'm sure I'll forget to include the trailing number at least once. :o)

Let's start with some terms. In a Cilk application there are two kinds of threads; user worker threads and system worker threads (usually referred to as "user workers" and "system workers"). The major difference for this discussion is that system workers are created by the Cilk runtime, and user workers come from somewhere else. Every worker thread has a __cilkrts_worker structure associated with it using thread-local storage. A worker is "bound" when it's associated with a __cilkrts_worker structure. System workers are bound when they're created.

As part of it's initialization, the Cilk runtime allocates an array of __cilkrts_worker structures. The first P-1 (where P is the total number of workers) are reserved for the system workers. The remaining __cilkrts_worker entries are available for user workers. By default we allocate an array with space for 3P __cilkrts_workers, but you can override that. We use a simple array because it's fast to index into the array when looking up a worker by index - for example when we're selecting a worker to steal from.

A function containing a _Cilk_spawn is referred to as a "spawning function." Every spawning function allocates a __cilkrts_stack_frame structure on the stack. This structure must be initialized by a call to __cilkrts_enter_frame(), though some compilers (like icc/icl) inline it. As part of that initialization, the function will attempt to look-up the __cilkrts_worker the thread that has been assigned to the thread. If there isn't one, __cilkrts_bind_thread() is called to initialize the Cilk runtime (if it hasn't already happened) and allocate a __cilkrts_worker from the array. We then use thread-local storage to associate the __cilkrts_worker with the thread, "binding" the __cilkrts_worker to the thread. The function that binds the __cilkrts_worker to the thread has the CILK_FRAME_LAST bit set in the __cilkrts_stack_frame flags field, indicating that it's the topmost function.

As part of the epilog of a spawning function, __cilkrts_leave_frame() is called. If CILK_FRAME_LAST is set, then __cilkrts_leave_frame() will call __cilkrts_c_return_from_initial(). Among other things, it will break the association between the __cilkrts_worker and the thread by setting the TLS location to NULL. This is called "unbinding."

Now that you have all of that background, we can explain what's happening. If you call a spawning function in a loop, the __cilkrts_worker will be bound and unbound each time around the loop. In addition, the Cilk runtime is suspending the system worker threads and then resume them each time around the loop.

- Barry

srijoni_m_ · ‎05-30-2015

Hi

In the Cilk ABI document it has been mentioned that "By default the number of workers is the number of cores on the system". My processor is two core , i want to know how cilk distributes user threads in the various cores and how sync works in such case. Is there a way to identify in which core the functions(spawned) are executing? I am using PIN to instrument my cilk code and extract relevant information and the core statistics are very much required for my project.

Thanking you In Advance

Barry_T_Intel · ‎06-01-2015

The Cilk runtime does nothing to distribute the threads. It depends on the OS to execute each worker on an idle core and assumes that the OS will distribute them appropriately.

You can use OS-specific functions to determine which core your workers are currently working on, but they may migrate at any time.

- Barry

srijoni_m_ · ‎08-01-2015

Hi Barry

Thank you for the answer.

I have implemented several parallel programs using all the cilk constructs(cilk_for,cilk_spawn,cilk_sync) and have instrumented the same using Pintool. In all the cases only 2 threads are detected, 0 and 1. If i dont explicitly create a thread , then apart from the two threads no other threads are created by cilk runtime. It was given in the manual number of threads for cilk = number o cores. My processor is dual core, so does it signify that scalability of cilk programs will be only 2(threads).

Hansang_B_Intel · ‎08-03-2015

Hi,

Yes, your programs will have maximum speedup of 2 on your dual-core machine. I guess HT is turned off on your machine since the runtime would check the number of logical processors when deciding the number of workers to create. Anyhow, this doesn't necessarily mean the speedup of your programs are limited to 2 on other machines with 2+ logical processors if the programs have enough work for all 2+ workers.

Barry_T_Intel · ‎08-04-2015

To expand a bit on Hansang's comment, the number of cores used by a Cilk program defaults to the number of cores that the OS sees.

This default can be overridden for all Cilk programs by setting the CILK_NWORKERS environment variable to the number of workers you want. Usually you'll want to restrict the run to a subset of the available cores, but you can force the runtime to oversubscribe your system if you'd like. For example, to use 4 workers on your 2 core system, use

set CILK_NWORKERS=4  (on Windows)

or

export CILK_NWORKERS=4  (on a Unix using the Bash shell)

Finally, the number of cores used can be overridden on a per-program basis by calling __cilkrts_set_param("nworkers", "n"), where "n" is the number of workers as a string. This call must be done before the Cilk runtime is initialized. It's usually best to do this before calling any functions containing a cilk_spawn or cilk_for.

This is all documented in cilk_api.h. You can see all the gory details in globals_state.cpp.

- Barry

TimP · ‎08-04-2015

Cilk_nworkers =3 seems to give best performance on dual core with ht win8.1. I guess it maximizes time with at least 1 worker per core.

I don't see consistent gain beyond cilk_nworkers=118 on 61 core Mic even though I never found out how to spread workers evenly among cores.

srijoni_m_ · ‎08-05-2015

Thank You All very much

Will check and try the same

Run-time exit function