issue with nested parallel_for

xavbuf · ‎01-24-2011

Hi all!

While moving my mutli-threaded application (rendering engine) from pthreads to TBB I am facing an issue with nested parallel_for.
In the application the main process is first splitted to several tasks via a parallel_for (image tiles)
While computing this specific tasks a common data may be lazily evaluated (image texture loaded from disk and color corrected), so that the first task asking for it involves its evaluation. The lazy evaluation is protected by a mutex.
Because the evaluation of this data may be expensive, a secondary parallel_for is used.

I ran the application several times and it seemed to perfectly work, until I got the application to freeze.

Here is a very simplified and naive example of what I am doing:

[cpp]#include 
#include 
#include 
#include 


struct ProcessBuffer
{
    ProcessBuffer( size_t* buffer ) : _buffer( buffer ) {}
    ProcessBuffer( const ProcessBuffer& p ) : _buffer( p._buffer ) {}

    void operator()( tbb::blocked_range& range ) const
    {
        for ( size_t i = range.begin(); i < range.end(); ++i )
        {
            _buffer = i;
        }
    }

    size_t* _buffer;
};


struct Texture
{
     Texture() : _buffer( 0x0 ) {}
    ~Texture() { delete[] _buffer; }

    const size_t* buffer()
    {
        tbb::mutex::scoped_lock lock(_mutex);
        if ( _buffer == 0x0 )
        {
            const size_t SIZE = 2048*1556*4;
            size_t* p = new size_t[SIZE];
            tbb::parallel_for( tbb::blocked_range( 0, SIZE ), ProcessBuffer( p ) );
            _buffer = p;
        }
        return _buffer;
    }

    size_t* _buffer;
    tbb::mutex _mutex;
};


struct ProcessTiles
{
    ProcessTiles( Texture* texture ) : _texture( texture ) {}
    ProcessTiles( const ProcessTiles& p ) : _texture( p._texture ) {}

    void operator()( tbb::blocked_range& range ) const
    {
        for ( size_t i = range.begin(); i < range.end(); ++i )
        {
            const size_t* p = _texture->buffer();
            const size_t SIZE = 2048*1556*4;
            std::cerr << std::accumulate( p, p + SIZE, 0 ) << std::endl;
        }
    }

    Texture* _texture;
};


int main( int, const char*[] )
{
    tbb::task_scheduler_init init;
    Texture* texture = new Texture;
    tbb::parallel_for( tbb::blocked_range( 0, 1000 ), ProcessTiles( texture ) );
    delete texture;
    return 0;
}
[/cpp]

I realized in the debugger that all the threads went finally locked by the mutex.
It seems that as the lazy evaluation is done, the thread asks for a new task in the main task pool, but get locked itself:

[bash]#0  0x00007f0f7b0c6384 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x00007f0f7b0c1bf0 in _L_lock_102 () from /lib/libpthread.so.0
#2  0x00007f0f7b0c14fe in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x0000000000401b08 in tbb::internal::start_for<:BLOCKED_RANGE>, ProcessTiles, tbb::auto_partitioner>::execute ()
#4  0x00007f0f7bfefa34 in tbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::local_wait_for_all ()
#5  0x00007f0f7bfedd3f in tbb::internal::generic_scheduler::local_spawn_root_and_wait ()
#6  0x00007f0f7bfedc4a in tbb::internal::generic_scheduler::spawn_root_and_wait ()
#7  0x0000000000401a6b in tbb::parallel_for<:BLOCKED_RANGE>, ProcessBuffer> ()
#8  0x0000000000401c5d in tbb::internal::start_for<:BLOCKED_RANGE>, ProcessTiles, tbb::auto_partitioner>::execute ()
#9  0x00007f0f7bfefa34 in tbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::local_wait_for_all ()
#10 0x00007f0f7bfedd3f in tbb::internal::generic_scheduler::local_spawn_root_and_wait ()
#11 0x00007f0f7bfedc4a in tbb::internal::generic_scheduler::spawn_root_and_wait ()
#12 0x000000000040192b in tbb::parallel_for<:BLOCKED_RANGE>, ProcessTiles> ()
#13 0x00000000004015f5 in main ()
[/bash]

I tryed to modify the task scheduler to make the second process totaly independant from the main one (non-preemptive scheduling?), but without any success.

Any help or suggestion will be appreciated.
Thanks in advance,

Xavier

jimdempseyatthecove · ‎01-24-2011

Xavier,

The parallel_for in main will result in each task performing operator() on a block range independent of the other tasks. IOW, only one thread will process any one buffer during the execution of the main parallel_for.

Therefore, the mutex need not be called.

Subsequently to build/allocation, in your eventual application

When the main level runs a new parallel_for for the advancement of the state of the tiles, this too will not require a mutex.

Only should you have two parallel_for's concurrently working on the same list of tiles, would you have a requirement for using the mutex. e.g. you may have a display engine and seperate state advance engine.

When the user manipulates some sort of control, which say changes texture, then consider making the texture as part of your state advancement. IOW your standard state advancement loop, in which each thread (via seperate task with seperate blocked range) has non-interference access to each tile, the state advancement could test for the texure change, and perform the change as a direct function call.

Also, for your initialization, consider the following:

[cpp]const size_t* buffer()   
{   
  if ( _buffer == 0x0 )   
  {   
     tbb::mutex::scoped_lock lock(_mutex);   
     if ( _buffer == 0x0 )   
     {   
         const size_t SIZE = 2048*1556*4;   
         size_t* p = new size_t[SIZE];   
         tbb::parallel_for( tbb::blocked_range( 0, SIZE ), ProcessBuffer( p ) );   
         _buffer = p; 
     }  
  }   
  return _buffer;   
}[/cpp]

Note, the initialization does not require the mutex.

***
Also note, in your current code, the mutex is not held during normal operator() work.
As to if it needs to acquire the mutex, this would depend on if some other concurrent task will relocate and/or disruptively modify the buffer.

Jim Dempsey

xavbuf · ‎01-24-2011

Jim,

Thanks for your quick answer.

Do you mean I don't need any mutex?
Well, as I mentioned, my code example is really naive, in the real case Textures are loaded on-demand as the final image is rendered and geometry discovered, and shared on memory for all threads/tasks.
The main point is as one thread/task needs the texture it asks for it. The process of loading the texture and processing it (ie. color correction) is managed by another library. I know this library also use TBB and parallel_for.

What I am surprise about is that as the texture image processing is done, the scheduler tryes to execute one of the main tasks.

What I don't understand is the backtrace log of a locked thread, moreover the two bolded lines:

#00x00007f0f7b0c6384in__lll_lock_wait()from/lib/libpthread.so.0
#10x00007f0f7b0c1bf0in_L_lock_102()from/lib/libpthread.so.0
#20x00007f0f7b0c14feinpthread_mutex_lock()from/lib/libpthread.so.0
#30x0000000000401b08intbb::internal::start_for<:BLOCKED_RANGE>,ProcessTiles,tbb::auto_partitioner>::execute()
#40x00007f0f7bfefa34intbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::local_wait_for_all()
#50x00007f0f7bfedd3fintbb::internal::generic_scheduler::local_spawn_root_and_wait()
#60x00007f0f7bfedc4aintbb::internal::generic_scheduler::spawn_root_and_wait()
#70x0000000000401a6bintbb::parallel_for<:BLOCKED_RANGE>,ProcessBuffer>()
#80x0000000000401c5dintbb::internal::start_for<:BLOCKED_RANGE>,ProcessTiles,tbb::auto_partitioner>::execute()
#90x00007f0f7bfefa34intbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::local_wait_for_all()
#100x00007f0f7bfedd3fintbb::internal::generic_scheduler::local_spawn_root_and_wait()
#110x00007f0f7bfedc4aintbb::internal::generic_scheduler::spawn_root_and_wait()
#120x000000000040192bintbb::parallel_for<:BLOCKED_RANGE>,ProcessTiles>()
#130x00000000004015f5inmain()

How a ProcessTiles may inderectly call another ProcessTiles?
I expected it to wait until the end of the ProcessBuffer.

Xav

RafSchietekat · ‎01-25-2011

ProcessTiles may seem to indirectly call itself because of how stealing works.

If that's all there is to it, a reentrant lock might do the trick, but I have not looked closely at the problem yet.

Jim's answer doesn't seem to say anything about a diamond dependency that seems to be involved (different tiles using the same texture), and I think that's where a solution should be sought, but I'm not surethe general technique (setting yourself up as a continuation for adummy task handed off to the commondescendent)will work from inside a parallel_for where you're not setting up your own tasks. Merely locking to wait for the other side to finish wastes an entire hardware thread, if it works at all (TBB is about optional parallelism, and required concurrency can be a hard problem).

Perhaps somebody else can take it from here?

xavbuf · ‎01-25-2011

Thanks Raf,
I think you pointed out the problem ("because of how stealing works")

I made a new search in the forum with "reentrant lock" as keywords, and I found this interesting thread ("using multiple tbb containers in code") on January 2010 , which seems to describe my problem.
http://software.intel.com/en-us/forums/showthread.php?t=71082&o=a&s=lr

There are some explanations from Alexey Kukanov and you:

"How is that going to work if the inner loop's work is stolen? I didn't consider the circumstance that anyone might want to do that, but if I have to, then I'd say this would in fact be a substantial problem with that approach."

"You are right, using a reentrant lock is more of a mediocre workaround than a good solution, and I should write more about it.
The dealock will be avoided, but the behavior may still be incorrect. The thread will simultaneouslyexecute the critical section for two iterations of the outer loop, with the first one being kind of "paused" in the middle (at the point of call to the inner parallel loop). Indeed executing another iteration by the same thread is not expected and can change the program state for the "paused" one, which can then break when resumed. So one should be careful when trying reentrant lock for deadlock avoidance in the described case, and understand well how the things (don't) work to not fall into anothertrouble."

I will consider to use a reentrant lock and see how it may fix the problem.

Thanks,
Xav

RafSchietekat · ‎01-25-2011

"I will consider to use a reentrant lock and see how it may fix the problem."
No promises, I just wanted to know if you have considered this. I didn't even take the time to find out that it's called "recursive" lock, and the earlier topics you found (thanks for that) don't seem to bode well.

But the generic solution for a diamond relationship should work after all, if the parallel_for task sets up a child and grandchild (both empty_task), with the grandchild handed over to the task that's currently computing the pattern (by way of the pattern itself), and then waits for the child to complete. When the other code is finished with the pattern, it deletes the grandchild, which spawns the child, which lets the first task return from spawn_and_wait_for_all(). That way you don't need locks, except briefly for registering the grandchild, and no hardware threads are kept idle (during spawn_and_wait_for_all the thread will steal any other tasks currently pending). It's like a condition variable tailored for tasks, because a traditional condition variable also ties up the waiting thread. Corrections welcome.

(Added) I'm not sure if it can be done with just an empty_task child that's deleted by the pattern, but perhaps that's possible too.

xavbuf · ‎01-25-2011

Well, I tryed the reentrant lock solution. There is no more deadlock but the behaviour of the Texture initialization is hard to control and I got cases were it was allocated more than once.
One problem fixed, but other troubles...

So I tryed the solution suggested by Jim and Raf.
As I am not confortable at all with the task scheduler I am not sure if I have found the best solution.
But the new code seems to work now, no more deadlock, and only a single lazy-evaluation of the Texture buffer.

EDIT: In fact, no, after more new tests, I still have deadlocks :(

I still have a mutex because I didn't find a way to remove it and to keep the insurance that the lazy-evaluation is done once.

Here are the changes:

[cpp]struct BufferTask : public task
{
    size_t*& _buffer;
    BufferTask( size_t*& p ) : _buffer(p) {}

    task* execute()
    {
        std::cerr << "buffer initialization" << std::endl;
        size_t* p = new size_t[BUFFER_SIZE];
        parallel_for( blocked_range( 0, BUFFER_SIZE ), ProcessBuffer( p ) );
        _buffer = p;
        return 0x0;
    }
};


struct Texture
{
    size_t* _buffer;
    mutex _mutex;
     Texture() : _buffer( 0x0 ) {}
    ~Texture() { delete[] _buffer; }

    const size_t* buffer()
    {
        if ( _buffer == 0x0 )
        {
            mutex::scoped_lock lock(_mutex);
            if ( _buffer == 0x0 )
            {
                task& child = *new( task::allocate_root() ) empty_task;
                child.set_ref_count(2);
                task& grandchild = *new( child.allocate_child() ) empty_task;
                task& t = *new( grandchild.allocate_child() ) BufferTask(_buffer);
                task::spawn(t);
                child.wait_for_all();
                task::destroy(child);
            }
        }
        return _buffer;
    }
};
[/cpp]

RafSchietekat · ‎01-25-2011

So locks don't work in this situation...

Give the Texture a locked list of task pointers. If you're the first to compute one, mark it busy, compute it using TBB, and mark it complete, then delete all task pointers from the list. If you're not the first, add a chiild (or grandchild if a child cannot be deleted during wait_for_all(), I'd have to check) to the locked list of task pointers, then wait_for_all() and the current thread may actually participate in the computation of the Texture (or another Texture, or another tile...).

Note that the double-checked locking will only work with an added fence, so you'd have to use an atomic for _buffer to have a portable solution.

Sorry for only sketching the outline of a solution, I hope it will still be helpful.

RafSchietekat · ‎01-26-2011

I think that a lot of lock-based code would have broken with the move from depth-based scheduling to single deques (2.1), because the depth limitation wasa remedy against deadlock. And of course recursive locks are linked to threads, based on the assumption that a logical stack is tied to a single thread, whereas TBB programs may branch out and jump across threads (when tasks are stolen), so the context of the lock should be changed from a thread identity to a task descendants identity, possibly interrupted on the stack when a task is stolen. It seems that locks should not be used to control code that may contain TBB-based concurrency.

(Edited 2011-11-26) Changed "2.2?" to "2.1" after looking at the CHANGES file.

jimdempseyatthecove · ‎01-26-2011

There is a lot untold of your application with respect to activities

In particular, what nullifies _buffer (other than ctor during initialization)?

There is no indication in your code (coments) is once only or is a multiple occurance (i.e. during changing texture).

When can _buffer be nullified? Are there associated race conditions (i.e. can one thread be using old _buffer while another thread is changing the texture (deleting old _buffer, adding new _buffer)?

When texture changes, does it apply to one (a few) objects or all objects?

Other than for texture change can (will)more than one thread have access to the buffer (and/or dependents of the buffer)?

Jim Dempsey

xavbuf · ‎01-26-2011

Hi,

The real application is a photoreaslitic rendering engine used for feature films and digital special effects.
The process for computing a single frame is not real time at all but can take hours, and neither is interactive: once started, you can't change parameters or data.

The main steps are:

split the image to compute in small buckets (ie. tiles of 16x16 pixels) so that the user can see the progression. Here we use a parallel_for on the bucket range.
for each tile we compute the geometry visibility of the 3D objects, and we evaluate the surface color of the objects, which can be defined by a single color or by an image texture (pictures, digital paintings, ...)
the textures are not present in memory but loaded the first time the engine asks for them (several Gb of texture images are used for complex scenes, we load them on demand because of network access and memory issues)
once the texture is loaded, it may be processed (color corrected, resized, etc.) to fit the rendering needs.
the texture is never replaced until the end of the image. Once loaded and processed it doesn't change.
We don't manage how the texture is processed, another library is in charge for that. But I know this image processing library also use TBB and its loop features.

I don't know if this short description makes things clearer now...

Maybe the question should be:
How can I safely use a parallel_for loop on tasks requiring the value of a singleton which is lazy-evaluated (and protected by a mutex?) maybe itself by other parallel_for loops ?

Thank you for your help and your patience,
Xavier

RafSchietekat · ‎01-26-2011

"Maybe the question should be:
How can I safely use a parallel_for loop on tasks requiring the value of a singleton which is lazy-evaluated (and protected by a mutex?) maybe itself by other parallel_for loops ?"
Leave out the "how": the answer seems to be "no". I've sketched the outline of what you should use instead of a mutex (which should only be used on predictably short waits), assuming everything is still compute-bound. Do you need more detail?

xavbuf · ‎01-26-2011

Thanks Raf,
I followed your suggestions but without success for now.
As I told you before, I am not confortable with the task scheduler. I am reading both the Tutorial and Reference guides to know more about it.

For the moment I changed my code this way, but I still have deadlock. Sure I wrote something wrong, but I couldn't figure out what.

[cpp]struct LockedList
{
	tbb::task_list			_list;
	tbb::mutex				_mutex;

	void push_back( task& t )
	{
		mutex::scoped_lock lock(_mutex);
		_list.push_back( t );
	}

	void delete_all()
	{
		mutex::scoped_lock lock(_mutex);
		while ( !_list.empty() )
		{
			task::destroy(_list.pop_front());
		}
	}
};


struct Texture
{
	tbb::atomic	_buffer;
	tbb::atomic		_init;
	tbb::mutex				_mutex;
	LockedList				_list;

	Texture() : _buffer(), _init() {}
	~Texture() { delete[] _buffer; }

	const size_t* buffer()
	{
		if ( !_buffer )
		{
			bool first = false;
			if ( !_init )
			{
				mutex::scoped_lock lock(_mutex);
				if ( !_init )
				{
					_init = true;
					first = true;
				}
			}

			if ( first )
			{
				std::cerr << "buffer initialization" << std::endl;
				size_t* p = new size_t[BUFFER_SIZE];
				parallel_for( blocked_range( 0, BUFFER_SIZE ), ProcessBuffer( p ) );
				_buffer = p;

				_list.delete_all();
			}
			else
			{
				task& child = *new( task::allocate_root() ) empty_task;
				child.set_ref_count(2);
				task& grandchild = *new( child.allocate_child() ) empty_task;

				_list.push_back( grandchild );

				child.wait_for_all();
			}
		}
		return _buffer;
	}
};
[/cpp]

I will take more time reading the papers in depth.

Thanks!
Xav

RafSchietekat · ‎01-26-2011

Ignoring ultimate performance for now, I would still recommend against the use of mutex, which is only appropriate for long waits by threads that can safely go to sleep in the kernel. Use spin_mutex instead.

task::destroy() will not put the parent in the ready pool if its refcount becomes zero, but I don't recall what the right form was, so you might just spawn() it instead. Maybe you can dispense with the child/grandchild nonsense by directly manipulating reference counts (in the calling parallel_for body function set the reference count to 1 (representing only the wait function), insert task::self() into the task_list of the texture, and call wait_for_all(); when the computation of the texture is complete, call decrement_ref_count() on all the captive tasks, and clear() the task_list, without destroying its contents), but I would have to check if this is valid, so you could perhaps first try the spawn() first with the code you already have.

(Added) At first sight neither task::destroy() of the child nor decrement_ref_count() on the childless parallel_for task will let the parallel_for task return from wait_for_all() (putting the parent in the ready pool would be another situation), so my best advice at this point is to have just the child (no grandchild) and spawn it. The Tutorial has a General Acyclic Graphs of Tasks example that shows use of decrement_ref_count() with explicit test for zero and a conditional spawn(), but that's not the right way to liberate the wait_for_all(). Continuing my search...

(Added) Did you reduce the incomplete double-checked locking to just an atomics test? That's not what I meant: double-checked locking needs the atomic release semantics to be portable, they go together.

(Added) Oh yes, the reference count to be set if just decrement_ref_count() would work as well is 2: 1 for the wait_for_all(), 1 for the decrement_ref_count(). It might just work...

(Added after #18) Er, of course the double-checked locking had to be removed, so please ignore the second "added" above. :-)

jimdempseyatthecove · ‎01-26-2011

Xavier,

Thanks for the description. It helps understand this alot.

From your description, the texture buffer will get loaded (with possible modification at/just after load), then remain static for the remainder of the frame. (is this also for the remainder of all frames?)

** Due to an object having a possibility of being not visible during processing of the first frame, is it safe to assume that the texture buffer is not loaded for non-visible objects? And conversely, the texture buffer is loaded only on the first occurance of an object becomming visible?

If the above assumptions are true then my observations are

If your main processing state advancement is performed by a parallel_for, then (using TBB) the main process state advancement blocks untilall tasks of the currentmain processing state advancement parallel_for completes. IOW your main loop parallel_for cannot be concurrent with subsequent main loop parallel_for.

Because main processing state advancement is performed by a parallel_for, the only one thread will have access to any one object at any one time (and no other threads in the current parallel_for team will have access to the same object until the current parallel_for completes and the next parallel_for begins).

Therefore, when the main processing state advancement discovers a NULL texture buffer pointer, it can directly call the library that pulls in the texture and adjusts it. This call may TBB (from your declaration it does) and spawn any number of tasks.

Note, excepting for your main processing state advancementthread that discovered the NULL texture buffer pointer, the other threads executing your parallel_for (while not in discovery of a NULL texture buffer pointer) will be compute bound and not available to run task for the texture library pull-in and precondition tasks. IOW only one thread will be immediately available (thread making NULL texture buffer pointer), but the other threads running your main processing state advancementwill become available as they complete their slice of the parallel_for in the main processing state advancement.

Therefore, you need not have a mutex

*** However,

Because you are observing a lockup on this mutex, this implies that two of your main processing state advancementthreadsareaccessing the same texture buffer.

Therefore is it correct to assume multiple objects may reference the same texture buffer?

If this is correct, and you only require one thread to perform the initialization of thesame texture buffer then

volatile intptr_t _texture; // NULL == not allocated
volatile intptr_t initializingTextureBuffer; // 0 == not initializing
...

while(!_texture)
{
if(exchangeAndAdd(&initializingTextureBuffer, 1) == 0)
{
if(!_texture)
_texture = initYourBuffer();
}
else
{
while(!_texture)
this_tbb_thread::yield();
}
exchangeAndAdd(&initializingTextureBuffer, -1)
}
return _texture;

Note, you can replace volatiles with atomic if you desire. The exchangeAndAdd function my be called something elsefor your compiler.

Jim Dempsey

RafSchietekat · ‎01-26-2011

Jim, my understanding is that several tiles can share a texture, which takes a while to compute and possible uses TBB itself, so I'm describing a way to wait for the texture to become available and do useful work at the same time.

jimdempseyatthecove · ‎01-26-2011

The yield(); should do that.... (looking at code)

In looking at your TBB code, the TBB yield() does not yield to TBB thread pool. Instead it does an OS SwitchToThread() in Windows or similar function in Linux.

Unfortunately, it is not yielding task execution time to other tasks in the TBB thread pool (as the way it is done with QuickThread). So you will need the hoop jumping as you have outlined. This hoop jumping is not unusual and should be available as a standard TBB API (perhaps it is). Something along the line of

while(somethingNotReady())
tbb::StealOneTask(); // IOW Yield();

The code you are writing essentially does this.

Jim

Alexey-Kukanov · ‎01-26-2011

Quoting xavbuf

Maybe the question should be:
How can I safely use a parallel_for loop on tasks requiring the value of a singleton which is lazy-evaluated (and protected by a mutex?) maybe itself by other parallel_for loops ?

[The post editedto better express the details and ideas]
Congratulations (and apologies)! You have hit a famous unfortunate scenario that indeed leads to a deadlock in TBB, and the one which is really hard for us to fix. And, you correctly identified the scenario - it is nested parallelism with a lock in between, i.e. the inner level executed under a lock acquired at the outer level. And, that old forum discussion you found does relate to the same problem I believe.

Replacing outer-level parallel_for with direct use of tasks is of no help if the lock remains in use, and as long as wait_for_all() is called at the *inner* level (which you even seem to have no control over). What happens in the task scheduler is that, in case of nesting parallelism, tasks from the outer and the inner level may alternate on the stack of a thread. If this happens, it creates a dependency that never existed in a serial execution: the inner level task on the stack depends on completion of (i.e. is blocked by) a completely unrelated stolenouter level task being executed on the same stack. If outer level tasks depend on each other (as in case of using a lock at that level), it creates a circular dependency which may deadlock the program.

Possible ways out of that situation:
- do not use any form of a lock/wait between the iterations at the outer level; instead, find a way to postpone thetask that needs the texture till the latter is ready, and move on to another job. Seems Raf is suggesting something along that way, though I did not learn it in detail.
- do texture initialization serially (i.e. drop parallelism at the inner level);
- remove nesting.

The way to remove nesting is to start a new thread for texture initialization, and wait for its completion. It helps because in TBB 3.0 the work submitted by different external threads (masters) is isolated from each other; any TBB worker thread can only do work for one master at a time. So when youstart a separate thread for texture initialization, you make that work appear as independent for the TBB scheduler, andalternation of outer and inner work on the stack (as I described above) will not happen.This approach is obviously suboptimal for performance (thread creation time etc), but seems easier to implement than avoidance of waiting (which seems to require getting rid of parallel_for at the outer level altogether, and replacing it with non-trivial task machinery).

RafSchietekat · ‎01-26-2011

I see no problem here that can't be solved without going serial or using multiple masters, it's just a dag (directed acyclic graph) of tasks instead of a tree, and that's a bit awkward without assurance that decrement_ref_count() of a captive task will let it return from wait_for_all(), although it seems quite plausible. Can you provide that assurance, and perhaps some API sugar?

xavbuf · ‎01-27-2011

Hi!
I tryed the solution suggested by Alexey. It works well both in my simple example, and with the real rendering engine. I didn't find any performance issue and it is easy to code:

[cpp]struct Texture
{
     Texture() : _buffer(0x0), _init() {}
    ~Texture() { delete[] _buffer; }

    const size_t*           buffer();

    private:

    void                    initBuffer();

    size_t*                 _buffer;
    tbb::atomic       _init;
    tbb::spin_mutex         _mutex;

};

void Texture::initBuffer()
{
    std::cerr << "buffer initialization" << std::endl;
    _buffer = new size_t[BUFFER_SIZE];
    tbb::parallel_for( tbb::blocked_range( 0, BUFFER_SIZE ), ProcessBuffer( _buffer ) );
}

const size_t* Texture::buffer()
{
    if ( !_init )
    {
        tbb::spin_mutex::scoped_lock lock(_mutex);
        if ( !_init )
        {
            tbb::tbb_thread(
                boost::bind(
                    &Texture::initBuffer,
                    this ) ).join();
            _init = true;
        }
    }
    return _buffer;
}
[/cpp]

That solution is just perfect for my needs and with my knowledge with TBB!

Thanks to all!
I found your different ideas very valuable and helpfull for using TBB in a more efficient way.

Xavier