And you need to ensure that

Darin_O_ · ‎07-29-2013

I am seeing strange program states in gdb that make it look that the backtrace that I would have expected for my main thread (thread 1 by gdb's numbering) on another thread, and the backtrace for thread 1 appears to be a cilk thread.

For example:

    (gdb) thread 1
    [Switching to thread 1 (Thread 0x2aaaaaafd900 (LWP 26329))]
    #0 0x00002aaaad28c1d7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
    81    ../sysdeps/unix/syscall-template.S: No such file or directory.
    (gdb) where
    #0 0x00002aaaad28c1d7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
    #1 0x00002aaaaf4ca944 in __cilkrts_scheduler ()
       from /home/dohashi/sandbox/main/4/internal/bin.X86_64_LINUX/debug/libcilkrts.so.5
    #2 0x00002aaaaf4c5c80 in worker_user_scheduler ()
       from /home/dohashi/sandbox/main/4/internal/bin.X86_64_LINUX/debug/libcilkrts.so.5
    #3 0x00002aaaaf4c5baf in __cilkrts_sysdep_import_user_thread ()
       from /home/dohashi/sandbox/main/4/internal/bin.X86_64_LINUX/debug/libcilkrts.so.5
    #4 0x00002aaabe728a68 in ?? ()
    #5 0x00002aaaaab00020 in ?? ()
    #6 0x00002aaaaab00020 in ?? ()
    #7 0x0000000000000001 in ?? ()
    #8 0x00007ffffffc7530 in ?? ()
    #9 0x00000001fffc7590 in ?? ()
    #10 0x00002aaaab8332b2 in atTrySearchOrInsertInArray (t=0x5600000003, table=0x2aaab288e828,
        dag=0x2aaab2b08ae8, hash=252201579132749424)
        at /home/dohashi/sandbox/main/4/internal/src/atomicTable.c:344
    Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Whereas for thread 6 (in this case. I've seen it pop up on other threads)

    (gdb) thread 6
    [Switching to thread 6 (Thread 0x2aaab33fb700 (LWP 26335))]
    #0 0x00002aaaad1f8037 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
    56    in ../nptl/sysdeps/unix/sysv/linux/raise.c
    (gdb) where
    #0 0x00002aaaad1f8037 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
    #1 0x00002aaaad1fb698 in __GI_abort () at abort.c:90
    #2 0x00002aaaab867688 in FailAssertion (
        fileName=0x2aaaab9f83d0 "/home/dohashi/sandbox/main/4/internal/src/eval.c", sourceLine=1567,
        assertion=0x2aaaab9f8bf4 "I( s[1] ) < LENGTH( Actvlocals )")

    .

    .

    .

    #169 0x0000000000405630 in main (argc=4, argv=0x7fffffffc858)
        at /home/dohashi/sandbox/main/4/internal/oem/main.c:389
    #170 0x00002aaaad1e2ea5 in __libc_start_main (main=0x4047c4 <main>, argc=4, ubp_av=0x7fffffffc858,
        init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffc848)
        at libc-start.c:260
    #171 0x00000000004020b9 in _start ()

Is gdb getting confused or is cilk doing something I'm not expecting?

Thanks

Darin

Barry_T_Intel · ‎07-29-2013

A thread is always a system thread or a user thread. It can't switch mid-stream.

Intel Cilk Plus uses "stack switching". That is, when a steal occurs, the runtime will allocate a stack (a fiber on Windows), modify the saved jump buffer to reference the new stack, and execute a longjmp to move execution of the continuation onto the new stack. Note that we don't modify the frame pointer (EBP or RBP), so you're in a wierd state, with the frame pointer (and all your local variables for the spawning function) on the old stack. This can confuse debuggers which expect simple, linear stacks. I know that the Windows debugger will not traverse beyond the bounds of the current stack. I believe there is a way to force GDB to ignore the stack bounds, but I don't recall it at the moment.

The Cilk runtime guarantees that only once worker can be executing on a stack at any given instant. Of course, if you pass the address of a stack variable into a spawned function, you may have created a race on that variable. But that's what race detectors are for.

After a cilk_sync, you're guaranteed to be on the same stack that was being used when the function was entered.

There used to be a limitation that user threads could not steal. This was removed with the release of Composer 12.1.

- Barry

Darin_O_ · ‎07-29-2013

In my case, this appears to be happening after a cilk_sync. The backtrace on thread 6 (which appears to be the main thread) does not show the functions that use cilk. All the other threads appear to be idle. I added the following assertion in my code, and it is failing:

    if ( l > 1 )
    {
        pthread_t TID = pthread_self(), newTID;
        for ( i = 1; i < l; i += 2 ) {
            cilk_spawn simplTask( s, i, posteval );
        }

        cilk_sync;
        newTID = pthread_self();
        ASSERTION( TID == newTID );
    }

Could my code in simplTask be doing something to mess this up?

Thanks

Darin

Jim_S_Intel · ‎07-29-2013

In general, in a spawning function (a function that has a cilk_spawn), there is no guarantee that execution after a cilk_sync will return on the same pthread as the pthread that started the spawning function.   As Barry mentioned, the stack is guaranteed to be the same, but the pthread might be different.   That could explain why you are failing the assert.

It turns out, for a variety of technical reasons, that requiring every function to return on the same thread that it started on imposes unnecessary restrictions (and thus overheads, either in time or space) on the scheduler.   Thus, this is a deliberate design choice in Cilk Plus.

The only exception to this situation is if your spawns were at "top-level," e.g., in the outermost spawning function.   In that case, the top-level spawning function will start and finish on the same user thread.

Cheers,

Jim

jimdempseyatthecove · ‎07-30-2013

>>there is no guarantee that execution after a cilk_sync will return on the same pthread as the pthread that started the spawning function.

Then Cilk Plus excludes the use of thread local storage. I notice the Intel Cilk Plus document cautions about critical sections (Windows) where the releasing thread must be the locking thread, but I see no mention/caution about using thread local storage.

This also means inefficiencies relating to scalable allocators (these perform better when the deallocating thread is the same thread as the allocating thread).

Jim Dempsey

Darin_O_ · ‎07-30-2013

Thanks, actually I realized that the assertion was failing when run within cilk as opposed to only check at the top level which is what I intended. When I fixed it to make that work, the assertion stopped failing. Also, I figured out what was causing my original problem, there is some error handling code that was longjmping outside of the cilk functions. I wanted to post a reply to my reply last night, but my first reply was stuck in moderation until after I left work.

Does cilk have any support for error handling? That is, if an error occurs during the parallel execution and I want to terminate the current run and push some sort of error up to the top level?

Thanks

Darin

Barry_T_Intel · ‎07-30-2013

We've always cautioned against using Thread Local Storage. You might want to look at "holders" which are part of the Cilk reducer library. They provide "worker local" storage.

Scalable memory allocators such as TBBmalloc handle will work with Cilk Plus. The thread-local cache of memory works just fine, as long as they handle returning memory to the proper thread.

What sort of error handling are you looking for? You can always return from a spawned function with an error code. And Cilk Plus supports throwing C++ exceptions up the stack. There's no built-in way to abort sibling strands, but you're free to implement your own "all_is_well" flag, have your code check it occasionally, and abort if necessary. We've discussed having a mechanism to abort sibling threads and decided that left too many openings for leaked resources.

- Barry

Darin_O_ · ‎07-30-2013

I guess what I'd like is a system that had a __cilkrts_error function (or something like that) that I could call and pass in some arbitrary data, void * or something like that. It would flag the nodes is the cilk spawn tree such that no new strands would start executing. Strands that are already executing would be allowed to finish (hopefully they are relatively short), and the spawn tree would collapse up to the root. After the top level cilk_sync, there would be a __cilkrts_was_error_raised function that I could call. It would say if an error was raised, and if it was, allow me access to the data passed in to the error function. Of course there is a difficulty here differentiating between the top level and recursive cases. Also the error function should only effect the cilk spawn tree that the error call was in, and not a parallel top level call into the cilk runtime from a seperate user thread.

If the error function is called by multiple strands, then whichever strand called it first would take effect. This probably means that the error function should return true or false to let the strand know if its error data was accepted or not, in case the error data needs to be cleaned up if it is not accepted.

I can see what you mean about leaked resources though, if a strand returns dynamically allocated memory it will get lost in my scheme. Of course, using the error function would be optional, so you could always just document that using the error function could lead to these sorts of situations, but that's pretty unfriendly, and of course, people tend not to read documentation anyway.

I'll have to experiment with rolling my own error handling, I'm just concerned that it will introduce more overhead to my code.

Thanks

Darin

Barry_T_Intel · ‎07-30-2013

Why reinvent it? We support C++ error handling. You can throw an error from a strand, and it will be caught by the catch clause of a try block, just like in regular C++. If multiple strands throw an error, the left-most (earliest if you were executing sequentially) is the one that is delivered.

The only thing we don't give you is a way to cancel any new spawned functions. What should they return to indicate that they've been cancelled by a pending error?

It might be possible for us to look up the spawn tree and see if any errors are pending, but there's currently no function to do that.

- Barry

Nick_M_3 · ‎07-30-2013

Interesting. I sent a message to the Email for comments, and would have made it stronger if I had seen some of these responses first. In the light of C++11, the CilkPlus specification needs to say that using C++ (or POSIX) threading facilities is Asking For Trouble. Even using the atomics is problematic, because of the memory ordering rules, and using locking is out of sight; thread-local storage is in between.

A general comment to the OP here is that, the more that you make the threading explicit, the less scalable and portable you make your program. This applies for OpenMP, too. The ideal is that you can develop it with (say) 8 threads, and merely recompile for totally different systems with 1 or 512 threads, and it would get good efficiency on all of them.

Barry_T_Intel · ‎07-30-2013

Using locks or atomic instructions are sometimes required to avoid races. But you need to understand that you've created a region that must be sequential and may slow down your code if there's a lot of contention for the region. This is true regardless of what type of threads you're using.

Cilk Plus provides reducers which can avoid this performance problem, since each parallel strand gets its own view of the reducer which it can update without using locks. One of the very cool things about reducers is that if your reducer implements an associative operation and you code it correctly, you maintain the ordering of the operations. For example, if you use a list reducer, you end up with the list in the same order as if the program ran sequentially. See the Reducer's section of the Cilk Plus tutorial for the code: http://cilkplus.org/tutorial-cilk-plus-reducers . This makes testing easy, since the result is aways the same, regardless of how many workers you use.

- Barry

Nick_M_3 · ‎07-30-2013

And you need to ensure that you aren't going to create a deadlock. I agree that reductions are the right way to avoid most uses of atomics and sometimes locking.

system worker threads becoming user threads?