Introducing a fiber-based UMS

George_Kotorlis · ‎01-09-2009

Hello,

during my private research on scheduling methods for multicore processors I've implemented a user mode scheduler based on fibers. What's special about this UMS is, it provides synchronization primitives for fibers like CriticalSection, Event or Semaphore. This seems to be something nobody has tried before...

You can find more on my web site at http://www.fiberpool.de/en/index.html (sorry for my poor English)

The idea behind is to parallelize serial code with linear or cyclic data dependencies.

Example (pseudocode):
Stack stack(5);
Producer p;
Concumer c;
p.Produce(stack, 1000);
c.Consume(stack, 1000);
SyncPoint;
DoSomethingElse();

Having only one processor the scheduler will start executing p.Produce(). When the stack is full, p.Produce() will block. The scheduler starts executing c.Consume(). When the stack is empty, the scheduler resumes p.Produce() because of the sync point. Without the sync point, the scheduler would execute DoSomethingElse().

This codecan be scaled now if two processors or cores are available.

Having more than two processors and by removing the sync point(if possible) this code would scale even better!

The whole magic resides inside the synchronization primitive (here: the Stack object) which communicates with the scheduler.

Since I have worked on this for some months all on my own, I'd like to get some feedback if this may be the right way to go.

Thanks and regards
George

Dmitry_Vyukov · ‎01-16-2009

Do I get it right that you are creating fiber per task? And then instead of blocking thread (on critical section) you are just calling scheduler, blocking current fiber and scheduling another runnable fiber?

Well, such scheme can suite perfectly, for example, for server software. Although the problem here is that every (or at least running) task now requires dedicated stack, i.e. at least for example 64Kb instead of for example 64b per task with traditional approach (such as TBB). I.e. 1000-fold degradation in memory consumption.

How are you going to cope with this?

The only system that gracefully copes with the problem of memory consumption by fiber stacks, I am aware of, is Capriccio. It really allows one to serve 100'000 connections with "fiber per connection" approach. And if you will dedicating 64Kb stack per fiber, well, you need 6,4GB of memory just for stacks... Definitely trade-off decision...

However UMS based on fibers gracefully solves the problem of system-induced deadlocks.
Here is the solution I was proposing for TBB:
http://software.intel.com/en-us/forums/showpost.php?p=72514
It does basically the same: when task must block on a mutex, it calls scheduler and starts executing another runnable task instead. However with tasks (as opposed to fibers) such solution is amenable to system-induced deadlocks (which is not good indeed).

George_Kotorlis · ‎01-16-2009

As you said,a blocking task suspends the running fiber. The sync object switches to the scheduling fiber then which selects the next task to run on a fiber. This can be a new task or a ready-fiber (resumed task).

The task does not need a stack as long as it's not scheduled for execution so it's initial size is about 32 Bytes. Usually it is scheduled to run on an existing fiber which already has a stack. New fibers (and new stacks) have to be created only if all existing fibers are suspended (blocked tasks).

Coming to your example you would need 100,000 fibers only if 99,999 would block at the same time. But this would rather be a design problem then. Since you cannot run that number of tasks in parallel on a limited number of processors/cores anyway my recommendation would be to replace the flat model which consumes too much memory with a multi-stage execution model (which is possible with fibers) where memory consumptionis much lower.

About your solution and TBB in general: Does switching to work stealing mode mean that the stack of the blocking task is (re)used for executing another task? If yes, the blocking task can be resumed only if all tasks on its stack have terminated. Is this right?

Dmitry_Vyukov · ‎01-16-2009

Quoting - georgetm

As you said,a blocking task suspends the running fiber. The sync object switches to the scheduling fiber then which selects the next task to run on a fiber. This can be a new task or a ready-fiber (resumed task).

And how many fiber switches there is between task executions? One, right?
Preference is given to the resumed tasks?

George_Kotorlis · ‎01-16-2009

Quoting - Dmitriy Vyukov

And how many fiber switches there is between task executions? One, right?
Preference is given to the resumed tasks?

There's a switch from the blocking fiber to the scheduling fiber. The scheduling fiber has to notify the task object that its fiber has really suspended. After that, the scheduler first tries to resume a task before it tries to execute a new task. Resuming/executing a new task is another switch.

Dmitry_Vyukov · ‎01-16-2009

Quoting - georgetm

The task does not need a stack as long as it's not scheduled for execution so it's initial size is about 32 Bytes. Usually it is scheduled to run on an existing fiber which already has a stack. New fibers (and new stacks) have to be created only if all existing fibers are suspended (blocked tasks).

Coming to your example you would need 100,000 fibers only if 99,999 would block at the same time. But this would rather be a design problem then. Since you cannot run that number of tasks in parallel on a limited number of processors/cores anyway my recommendation would be to replace the flat model which consumes too much memory with a multi-stage execution model (which is possible with fibers) where memory consumptionis much lower.

No, it's not design problem, it's just different usage pattern. If I have 100'000 TCP connections to my server, then it's natural to create 100'000 "threads per connection", if threading library supports that of course. For example, with Erlang I can do that. And I can simulate that with TBB (plain tasks w/o own stack) + continuations. And I can do that with Capriccio.

Dmitry_Vyukov · ‎01-16-2009

Quoting - georgetm

About your solution and TBB in general: Does switching to work stealing mode mean that the stack of the blocking task is (re)used for executing another task? If yes, the blocking task can be resumed only if all tasks on its stack have terminated. Is this right?

Yes, stack of the task is reused. Yes, blocking task can be resumed only after all tasks on it's stack are terminated.
Yes, this is right. And yes, this can lead to system-induced deadlocks and suboptimal resource usage.

Dmitry_Vyukov · ‎01-16-2009

Quoting - georgetm

There's a switch from the blocking fiber to the scheduling fiber. The scheduling fiber has to notify the task object that its fiber has really suspended. After that, the scheduler first tries to resume a task before it tries to execute a new task. Resuming/executing a new task is another switch.

Usually it's possible to accomplish task switching with only ONE fiber switch. Usually it's possible to do something like:
1. When task is going to block, mark it as blocked
2. Execute scheduler code directly on current fiber
3. Scheduler will pick up next task to run, create/get from pool fiber for that task (if it's new task)
4. Switch directly to that fiber

I.e. scheduler really does NOT need dedicated fiber to run:

[cpp]void mutex_lock(mutex_t* m)
{
    while (false == m->try_lock())
        call_scheduler();
}

void call_scheduler()
{
    scheduler_t* s = get_current_thread_scheduler();
    fiber_t* f = s->pick_up_next_fiber();
    f->switch_to();
}


[/cpp]

When I was implementing fiber scheduler for Relacy Race Detector:
http://groups.google.com/group/relacy

initially I've also implemented "two-phase" switching (with dedicated fiber for scheduler), but then I've switched to "direct" switching (scheduler parasitizes on user fibers) and noticed some 30% performance improvement, because number of fiber switches was effectively halved.

George_Kotorlis · ‎01-16-2009

Quoting - Dmitriy Vyukov

Usually it's possible to accomplish task switching with only ONE fiber switch. Usually it's possible to do something like:
1. When task is going to block, mark it as blocked
2. Execute scheduler code directly on current fiber
3. Scheduler will pick up next task to run, create/get from pool fiber for that task (if it's new task)
4. Switch directly to that fiber

I understand but this will not work in a multithreaded environment: After you mark a task as blocked another task running on another thread may release the mutex and unblock the task. The next thing to do is to resume the task if possible, but you don't have the information of the blocking task's fiber if it's still running or suspended.

That's why I switch to the scheduling fiber first: After the switch I'm sure the task's fiber is suspended and I can tell it to the sync object. After getting the notification the sync object can immediately resume the task if meanwhile its state has changed.

Dmitry_Vyukov · ‎01-31-2009

Quoting - georgetm

I understand but this will not work in a multithreaded environment: After you mark a task as blocked another task running on another thread may release the mutex and unblock the task. The next thing to do is to resume the task if possible, but you don't have the information of the blocking task's fiber if it's still running or suspended.

That's why I switch to the scheduling fiber first: After the switch I'm sure the task's fiber is suspended and I can tell it to the sync object. After getting the notification the sync object can immediately resume the task if meanwhile its state has changed.

I still don't see why it's impossible to implement task switch with only 1 fiber switch...
It seems that you are mixing up two things - fiber state (that of little interest) and task state (which is completely your abstraction and has nothing to do with current fiber). You can manipulate (check/update) task state from any fiber (scheduler's fiber or current task's fiber).
There is a race between mutex acquisition and mutex release. You just have to provide correct synchronization (mutual exclusion) between those activities. If order of events is (1) mutex release -> (2) successful mutex acquisition, then task continues to execute. If order of events is (1) failed mutex acquisition -> (2) mutex release, then task is descheduled on (1) and putted back to runnable queue on (2). How current fiber (whether it's task's fiber or scheduler's fiber) affects this?

Dmitry_Vyukov · ‎01-31-2009

Quoting - Dmitriy Vyukov

I still don't see why it's impossible to implement task switch with only 1 fiber switch...
It seems that you are mixing up two things - fiber state (that of little interest) and task state (which is completely your abstraction and has nothing to do with current fiber). You can manipulate (check/update) task state from any fiber (scheduler's fiber or current task's fiber).
There is a race between mutex acquisition and mutex release. You just have to provide correct synchronization (mutual exclusion) between those activities. If order of events is (1) mutex release -> (2) successful mutex acquisition, then task continues to execute. If order of events is (1) failed mutex acquisition -> (2) mutex release, then task is descheduled on (1) and putted back to runnable queue on (2). How current fiber (whether it's task's fiber or scheduler's fiber) affects this?

Btw, here are some papers on fiber-based UMS:
http://www.hpclab.ceid.upatras.gr/home.php?action=publications_details&language=2
For example see "A Time and Memory Efficient Implementation of the Nano-Threads Programming Model", authors described how they implement switching between tasks with 1 fiber switch (and even with direct stack reuse, if current task is already finished).

And here is similar recently announced library (fiber-based UMS):
http://groups.google.com/group/asynclib?hl=en

George_Kotorlis · ‎01-31-2009

Quoting - Dmitriy Vyukov

I still don't see why it's impossible to implement task switch with only 1 fiber switch...
It seems that you are mixing up two things - fiber state (that of little interest) and task state (which is completely your abstraction and has nothing to do with current fiber). You can manipulate (check/update) task state from any fiber (scheduler's fiber or current task's fiber).
There is a race between mutex acquisition and mutex release. You just have to provide correct synchronization (mutual exclusion) between those activities. If order of events is (1) mutex release -> (2) successful mutex acquisition, then task continues to execute. If order of events is (1) failed mutex acquisition -> (2) mutex release, then task is descheduled on (1) and putted back to runnable queue on (2). How current fiber (whether it's task's fiber or scheduler's fiber) affects this?

The problem is that descheduling needs two actions:
1. Tell (the mutex) that the fiber is going to suspend
2. Switch to another fiber

If a thread context switch occurs between 1 and 2 the mutex block queue will already contain the fiber. If the mutex is released at this moment from another thread it will put the fiber back to the runnable queue although the fiber is still running and the fiber state is not saved.

But I think there is another way to make a task switch with only one fiber switch:
A fiber switch is saving the current fiber's state and restoring the next fiber's state. Between these two actions it's possible to notify the sync object that the fiber has been suspended. What do you think?

The Greek universities are a good source for papers about fiber-based UMS. I tried to contact one of the authors a few weeks ago with no feedback yet...

Dmitry_Vyukov · ‎01-31-2009

Quoting - georgetm

The problem is that descheduling needs two actions:
1. Tell (the mutex) that the fiber is going to suspend
2. Switch to another fiber

If a thread context switch occurs between 1 and 2 the mutex block queue will already contain the fiber. If the mutex is released at this moment from another thread it will put the fiber back to the runnable queue although the fiber is still running and the fiber state is not saved.

But I think there is another way to make a task switch with only one fiber switch:
A fiber switch is saving the current fiber's state and restoring the next fiber's state. Between these two actions it's possible to notify the sync object that the fiber has been suspended. What do you think?

Damn! I completely missed the point that another thread may start executing the fiber while it's still being executed by current thread (it will be problematic to execute the fiber, if it's context is not yet completely saved).

Both SwitchToThread() and swapcontext() make both actions - save current context and restore new context - so, I think, it's impossible to make anything between these two actions. However, I think, it's perfectly Ok to notify sync object right after the switch to the new fiber. Basically, first part of scheduler's code (choose of next fiber) will be executed by current thread, and second part of scheduler's code will be executed by new fiber.

Dmitry_Vyukov · ‎02-03-2009

Quoting - Dmitriy Vyukov

Damn! I completely missed the point that another thread may start executing the fiber while it's still being executed by current thread (it will be problematic to execute the fiber, if it's context is not yet completely saved).

Both SwitchToThread() and swapcontext() make both actions - save current context and restore new context - so, I think, it's impossible to make anything between these two actions. However, I think, it's perfectly Ok to notify sync object right after the switch to the new fiber. Basically, first part of scheduler's code (choose of next fiber) will be executed by current thread, and second part of scheduler's code will be executed by new fiber.

If we are rejecting the idea of dedicated scheduler fiber, this also allows us to directly reuse the fiber when current task is finished (important case).
I am thinking about something like (very crude code-sketch):

[cpp]void main_fiber_proc(thread_desc* th)
{
    // current and next fibers can communicate via 'th' object
    for (;;)
    {
        if (th->sync_object_to_notify)
        {
            th->sync_object_to_notify->notify(th->prev_fiber);
            th->sync_object_to_notify = 0;
        }
        if (th->next_task_to_execute == 0)
            break;
        th->next_task_to_execute->execute();
        th->next_task_to_execute = scheduler(th);
        // next task is executed directly on current thread/fiber w/o any switches
    }
}

void block_on_sync_object(thread_desc* th, sync_object* obj)
{
    th->next_task_to_execute = scheduler(th);
    th->sync_object_to_notify = obj;
    th->prev_fiber = th->curr_fiber;
    th->curr_fiber = get_fiber_from_pool(th);
    switch_fiber(th->prev_fiber, th->curr_fiber);
    if (th->sync_object_to_notify)
    {
        th->sync_object_to_notify->notify(th->prev_fiber);
        th->sync_object_to_notify = 0;
    }
}[/cpp]

Alsoin some cases it's possible to execute several tasks on single stack (fiber) w/o sacrificing any properties. For example, it there are parent-child relations between tasks, then child task can be executed directly on parent's stack (not consuming additional stack, and not causing additional fiber switch). If fork/join (divide/conquer) decomposition is used, then the optimization can reduce memory consumption basically to 1 stack (fiber) per worker thread.

Dmitry_Vyukov · ‎02-03-2009

Quoting - Dmitriy Vyukov

Damn! I completely missed the point that another thread may start executing the fiber while it's still being executed by current thread (it will be problematic to execute the fiber, if it's context is not yet completely saved).

Both SwitchToThread() and swapcontext() make both actions - save current context and restore new context - so, I think, it's impossible to make anything between these two actions. However, I think, it's perfectly Ok to notify sync object right after the switch to the new fiber. Basically, first part of scheduler's code (choose of next fiber) will be executed by current thread, and second part of scheduler's code will be executed by new fiber.

If we are rejecting the idea of dedicated scheduler fiber, this also allows us to directly reuse the fiber when current task is finished (important case).
I am thinking about something like (very crude code-sketch):

[cpp]void main_fiber_proc(thread_desc* th)
{
    // current and next fibers can communicate via 'th' object
    for (;;)
    {
        if (th->sync_object_to_notify)
        {
            th->sync_object_to_notify->notify(th->prev_fiber);
            th->sync_object_to_notify = 0;
        }
        if (th->next_task_to_execute == 0)
            break;
        th->next_task_to_execute->execute();
        th->next_task_to_execute = scheduler(th);
        // next task is executed directly on current thread/fiber w/o any switches
    }
}

void block_on_sync_object(thread_desc* th, sync_object* obj)
{
    th->next_task_to_execute = scheduler(th);
    th->sync_object_to_notify = obj;
    th->prev_fiber = th->curr_fiber;
    th->curr_fiber = get_fiber_from_pool(th);
    switch_fiber(th->prev_fiber, th->curr_fiber);
    if (th->sync_object_to_notify)
    {
        th->sync_object_to_notify->notify(th->prev_fiber);
        th->sync_object_to_notify = 0;
    }
}[/cpp]

Alsoin some cases it's possible to execute several tasks on single stack (fiber) w/o sacrificing any properties. For example, it there are parent-child relations between tasks, then child task can be executed directly on parent's stack (not consuming additional stack, and not causing additional fiber switch). If fork/join (divide/conquer) decomposition is used, then the optimization can reduce memory consumption basically to 1 stack (fiber) per worker thread.

George_Kotorlis · ‎02-03-2009

Quoting - Dmitriy Vyukov

If we are rejecting the idea of dedicated scheduler fiber, this also allows us to directly reuse the fiber when current task is finished (important case).
I am thinking about something like (very crude code-sketch):

[cpp]void main_fiber_proc(thread_desc* th)
{
    // current and next fibers can communicate via 'th' object
    for (;;)
    {
        if (th->sync_object_to_notify)
        {
            th->sync_object_to_notify->notify(th->prev_fiber);
            th->sync_object_to_notify = 0;
        }
        if (th->next_task_to_execute == 0)
            break;
        th->next_task_to_execute->execute();
        th->next_task_to_execute = scheduler(th);
        // next task is executed directly on current thread/fiber w/o any switches
    }
}

void block_on_sync_object(thread_desc* th, sync_object* obj)
{
    th->next_task_to_execute = scheduler(th);
    th->sync_object_to_notify = obj;
    th->prev_fiber = th->curr_fiber;
    th->curr_fiber = get_fiber_from_pool(th);
    switch_fiber(th->prev_fiber, th->curr_fiber);
    if (th->sync_object_to_notify)
    {
        th->sync_object_to_notify->notify(th->prev_fiber);
        th->sync_object_to_notify = 0;
    }
}[/cpp]

Alsoin some cases it's possible to execute several tasks on single stack (fiber) w/o sacrificing any properties. For example, it there are parent-child relations between tasks, then child task can be executed directly on parent's stack (not consuming additional stack, and not causing additional fiber switch). If fork/join (divide/conquer) decomposition is used, then the optimization can reduce memory consumption basically to 1 stack (fiber) per worker thread.

The code is very similar to my implementation. The main difference is that the sync object notification is done by the scheduler fiber before switching to the next fiber.

I can move this into my own "SwitchToFiber" implementation which will reduce the number of switches to 1.

The scheduler fiber still makes sense: When the queue runs out of tasks the executing fiber can suspend by switching to it. The scheduler fiber saves the supended fiberand suspends the thread then.

This switch is done at the position where the break-statement is in your code. Since fibers are controlled by the scheduler it's no problem to let them run in an endless loop. Btw, terminating the fiber is fatal.

Fibers are saved and reused when possible. This helps predicting the maximum number of stacks (fibers) needed for one process in some way. It mainly depends on the number of worker threads and the maximum possible block count at a time.

Dmitry_Vyukov · ‎02-05-2009

> The code is very similar to my implementation. The main difference is that the sync object notification is done by the > scheduler fiber before switching to the next fiber.
> I can move this into my own "SwitchToFiber" implementation which will reduce the number of switches to 1.

I think that for general-purpose library it's worth doing, especially taking into consideration that the improvement will not so much complicate the implementation.

> The scheduler fiber still makes sense: When the queue runs out of tasks the executing fiber can suspend by switching
> to it. The scheduler fiber saves the supended fiber and suspends the thread then.

Additional fiber switch before blocking is perfectly Ok, I was saying only about common path.

> This switch is done at the position where the break-statement is in your code. Since fibers are controlled by the
> scheduler it's no problem to let them run in an endless loop. Btw, terminating the fiber is fatal.

Damn! 'break' must be replaced with 'switch_to_scheduler_fiber()'.

Dmitry_Vyukov · ‎02-05-2009

Btw, here is interesting thing. In Windows 7 user-space THREAD scheduler is introduced. It basically allows to combine advantages of both threads and fibers. From fibers - lightweight user space scheduling, from threads - ability to make any calls (including blocking).
Here is the links:
http://blogs.msdn.com/nativeconcurrency/archive/2009/02/04/concurrency-runtime-and-windows-7.aspx
http://channel9.msdn.com/shows/Going+Deep/Dave-Probert-Inside-Windows-7-User-Mode-Scheduler-UMS

Although, since it will be available only on Windows 7, it's of little practical interest to developers of portable libraries...

George_Kotorlis · ‎02-06-2009

Quoting - Dmitriy Vyukov

Btw, here is interesting thing. In Windows 7 user-space THREAD scheduler is introduced. It basically allows to combine advantages of both threads and fibers. From fibers - lightweight user space scheduling, from threads - ability to make any calls (including blocking).
Here is the links:
http://blogs.msdn.com/nativeconcurrency/archive/2009/02/04/concurrency-runtime-and-windows-7.aspx
http://channel9.msdn.com/shows/Going+Deep/Dave-Probert-Inside-Windows-7-User-Mode-Scheduler-UMS

Although, since it will be available only on Windows 7, it's of little practical interest to developers of portable libraries...

Indeed, ConcRT is very similar to Fiber Pool. So the answer to my initial question "if this may be the right way to go" comes directly from the Microsoft team.

Now, that everybody should be convinced (by MS), the discussion about "real" scalable scheduling (which is my research) can start:

UMS as presented by ConcRT or Fiber Pool is ok if you have one single process in the system using it. But if you have two or more processes using UMS, scheduling becomes more complicated.

Example:
The system: A 4-core processor
Process P1: A GUI thread that handles a window message with 4 tasks
Process P2: A video encoder thread with 8 tasks
Process P3: Another video encoder thread with 8 tasks

Let's take P1 aside for a moment: From the system's point of view P2 and P3 are not required to run concurrently so scheduler's algorithm could be implemented that way to complete P2's tasks first before executing P3's tasks. In this case, preemptive scheduling is not necessary so the more efficient cooperative scheduling can be used here.

Now, while P2's and P3's tasks are scheduled cooperatively P1 gets the message. Processing the message cannot wait for P2 and P3 to complete so some running tasks have to be preempted in order to execute P1's tasks.

The question here is, how many tasks have to be preempted in order to execute P1'sthread properly? The answer is not that simple.

The term "thread" does not fit here because its tasks may be parallelized but I will continue to use it for serial programs from the logical point of view.

The number of parallelizable tasksin a thread's task queueis the thread's scalability degree at a specific time. With this important value the scheduler can perform the actual scaling on system level.

The worst case would be to assign only one core to the thread which means serial execution of its tasks (which is the behaviour of an actual thread), e.g. if the system load is high. With decreasing load the scheduler can assign more and more cores to the thread.

What we get is a task scheduler for "thread" scaling depending on the system load.

A quick overview of what I do in my free time...