Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2470 Discussions

TBB Task Scheduler: Integrating Data Tasks for Data Parallelism (need help)

AJ13
New Contributor I
1,704 Views

I have spent the past semester researching how to combine data parallelism with TBB, so that TBB can offset data tasks to a GPU when available. My work is nearly complete, and now I need to ask the TBB community for some help. My approach is to separate data parallelism from task parallism, which involves the creation of a new type of task: the data task. For the purposes of my work, a data task represents an operation which can utilize data parallelism. For example, multiplying a large vector by a scalar, or a matrix by a scalar.

There are two work pools in my design. There is the task pool which remains completely controlled by TBB. I also have added a single shared data task pool, from which the GPU will consume data tasks. If a CPU has no work to be done, that is normal TBB tasks to consume, it will steal data tasks from the data task pool. Ordinary TBB tasks can only be executed on the CPU, whereas data tasks may be executed on either the GPU or CPU. A data task is very simple, it only provides an virtual execute() function for now.

I would like to integrate data task with the TBB task scheduler in a simple way. Data tasks can be run from within TBB tasks. I would like to allocate a task that spawns that data tasks, putting them into the data task pool. Rather than blocking until the data task is complete, I would like for the worker thead (which ran the tbb::task that spawned data tasks) to continue to process other TBB tasks as normal, and steal tasks as required. When all data tasks are complete, the allocated parent task is ready to be re-executed.

So if there wasn't a mixture of data tasks and tbb::tasks, I would allocate a parent, spawn children, and wait for all to complete. This problem is slightly more complex, because some of these children are of a different type, and are placed into a data task pool instead of the task pool used by TBB. If a program utilizes a large amount of data parallelism which the GPU can help with, TBB worker threads should steal data tasks for themselves too.

Thanks for your input, and as always I'm on #tbb on irc.freenode.net for a chat.

AJ

0 Kudos
41 Replies
RafSchietekat
Valued Contributor III
592 Views

"It works..." Not quite what I suggested, though! I think you've been lucky, even though it seems quite predictable what's happening: empty_task is probably immediately executed on the same thread (an obvious implementation of spawn_and_wait_for_all()), letting the ref_count drop to 2. Then "t->set_ref_count(t->ref_count() - 1);" only works because it is the only thread manipulating the ref_count (set_ref_count() is not thread safe), letting it drop to 1, which is spawn_and_wait_for_all()'s cue to return. empty_task is only a diversion here (it would be equivalent to forget all about it and call wait_for_all() instead); in my suggestion it is a tool to decrement the ref_count atomically, as suggested in the introduction to "8.3.3 Explicit task Destruction" in "Reference Manual (Open Source).pdf", from an additional child (destroy() is a non-static method, so it's not even a choice anyway).

0 Kudos
RafSchietekat
Valued Contributor III
592 Views

A bit more explicitly: "because it is the only thread manipulating" -> "because, thanks to the most likely synchronous execution by spawn_and_wait_for_all() vs. the long delay in Terminator, when it executes it is almost certainly the only thread still manipulating". I think that even the appearance of a race should be avoided.

0 Kudos
AJ13
New Contributor I
592 Views

Okay, here it is again without the empty_task.

#include
#include
#include

#include


class Terminator
{
public:
void operator()(tbb::task* t)
{
std::cout << "Sleeping..." << std::endl;
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(10.0) );

std::cout << "Set ref_count = 0" << std::endl;
t->set_ref_count(t->ref_count() - 1);
};
};


class Waiter : public tbb::task
{
public:
tbb::task* execute()
{
set_ref_count(2);

Terminator terminator;
tbb::tbb_thread thread(terminator, this);
wait_for_all();

std::cout << "Done Waiting." << std::endl;
return 0;
}
};

int main()
{
tbb::task_scheduler_init init;

// Allocate a parent task
Waiter& parent = *new (tbb::task::allocate_root()) Waiter();
tbb::task::spawn_root_and_wait(parent);


return 0;
}

0 Kudos
RafSchietekat
Valued Contributor III
592 Views

Sure, but you should primarily have a more disciplined way of decrementing ref_count, as described. This code may happen to work (if you only want to have one GPU task per parent!), but set_ref_count() is not meant to be thread safe.

0 Kudos
AJ13
New Contributor I
592 Views

If I wrap set_ref_count() in a spin_lock will you be happy?

0 Kudos
RafSchietekat
Valued Contributor III
592 Views

I would be very disappointed...

0 Kudos
AJ13
New Contributor I
592 Views

Actually I might just modify the TBB source to make set_ref_count atomic... do you think I'd get a performance hit vs. an external spin_lock in my code on ref_count access?

0 Kudos
RafSchietekat
Valued Contributor III
592 Views

I'm curious: why this insistence not to use what is available?

0 Kudos
AJ13
New Contributor I
592 Views

I'm not entirely sure what you mean, or are suggesting... if ref_count() and set_ref_count are not thread safe, I need to either 1) Make them thread safe or 2) Lock the task before calling them.

0 Kudos
RafSchietekat
Valued Contributor III
592 Views
Quoting - AJ

I'm not entirely sure what you mean, or are suggesting... if ref_count() and set_ref_count are not thread safe, I need to either 1) Make them thread safe or 2) Lock the task before calling them.

No, you should 0) destroy a stand-in empty_task from within an additional task as described and motivated earlier. You had an empty_task, but it was immediately thrown away by spawn_and_wait_for_all(), then you removed all traces of it; instead, you should have kept the allocation but changed spawn_and_wait_for_all() to just wait_for_all(), resulting in an semi-excess ref_count (that was the speculative side of my proposal). Instead of directly manipulating ref_count when the GPU work has completed, you should allocate_additional_child_of() a task that will destroy() the stand-in empty_task that was waiting in the wings; in the structure describing the GPU work you should keep a reference to the "parent" for use in allocate_additional_child_of() and a reference to the empty_task to set the target for destroy().

In the "Additions to atomic" patch I have made ref_count atomic because I was suspicious of supposedly atomic operations on non-atomic (and even non-volatile!) variables (and with reason, as it turned out, even if it was not exactly this that failed), but even there I would not recommend directly manipulating the ref_count, not even with an atomic decrement operation, because it serves no purpose to second-guess the existing API if an equivalent official alternative is readily available. There is much that needs to be changed about TBB (especially now that its leadership has declared its intentions for retrograde progress), but otherwise it seems better to leave well enough alone.

0 Kudos
RafSchietekat
Valued Contributor III
592 Views

For some more clarity (hopefully):

I wrote "I would be very disappointed..." That was of course an intentionally silly response to "AJ"'s "will you be happy", in case anyone missed that.

"AJ" wrote "if ref_count() and set_ref_count are not thread safe, I need to either 1) Make them thread safe or 2) Lock the task before calling them." That is invalid, because it is not officially clear that ref_count is merely a task-local property that may change asynchronously. It's not (maybe the Reference Manual should make this clear...), and thread safety is not a single indivisible concept. In this case, just see the documentation for destroy(task&), which insists on making clear how it behaves differently, with regard to ref_count, from, e.g., a last child returning from execute() (my counterexample).

My intention was to offer a necessarily hacked solution with absolutely minimal assumptions, for maximum chance of success, and, despite a moment of confusion on my side, all elements are required. There's no use in taking more risks than this.

Anyway, that's all from my side: I can only explain, but of course you're free to draw your own conclusions.

0 Kudos
jimdempseyatthecove
Honored Contributor III
592 Views

AJ,

I wish to apologize for taking so long to get back with you on this.

Let me preface this by saying that it looks as if you have made good headway in converting TBB to your purposes (combined GPGPU and TBB) and this post is not intended to disuade you from taking that route.

I took the time to adapt a sample program in a programmer's tool I'm working on to do just what you asked to do :). Unfortunately it is not a commercial application just yet (looking for alpha testers).

The test platform is an Intel Q6600 Windows XP x64 with MS VS 2005 C++
I have the Intel C++ compiler too, but have not run the test using that just yet.

For the GPU I have one AMD FireStream 9170 running Brook+/CAL 1.2.1

The test program was taken from thier double_precision_matmult.br sample program. The original program illustrate and compared the performance difference between a double precision matrix multiply between the GPU and a single thread process.

The modifications I made were to add parallel programming to the CPU portion of the code. And to add (combine) the execution of the parallel CPU code with the GPU.
i.e. Parallel CPU paralleled with GPU.

To perform this in a manner to simulate an application the sample program was modified further whereby the number of iteration loops was written to be reinterpreted as the number of Objects to create. Each Object contains three matricies (two input and one output). So the simultion test run loops through performing a matrix multiplaction on a list of Objects.

The tuning parameters were set to use 100 objects, each with matrix size of 256 x 256. Varying the size of the matrix greatly affects the relative performance. On the GPU side, the smaller the matrix the higher the proportion of overhead to start and stop the GPU. On the CPU side, much larger matricies exceed the size of the L2 cache. The matrix sizes were chosen such that it would not overly diminish the performance of either GPU or CPU.

Here are the relative performance figures:

Calculate performance with GPU 4.862223 Gflops
Calculate performance with CPU 0.824801 Gflops
Calculate performance with parallel CPUs 2.655731 Gflops
Calculate performance with GPU and parallel CPUs 6.217411 Gflops

GPU 5.895x that of 1 core
GPU 1.831x that of 4 cores
4 cores 3.22x faster than 1 core
GPU + 4 cores 1.279x faster than GPU

The following file is the C++ side of the sample application showing the threading technique. The Brook+ side is not included since this is not pertinant to the "how to implement data parallelism into a multi-threaded application".

Jim Dempsey

[cpp]// DoQtInit.cpp
#include 
#include 
#include 

#include "common.h"
#include "QuickThread.h"
#include "parallel_for.h"

using namespace qt;
extern struct infoRec cmd;

int do_main(void* context);

int ObjectAllocationError = 0;

// Setup Affinity bitmasks for qtControls
// Firestream Brook+/CAL has the quirk that
// the first thread to touch the GPU owns the GPU.
//
// qtControl for GPU
qtControl qtControlGPU;
// qtControl for CPUs
qtControl qtControlCPU;
// qtControl for CPU exclusive of GPU
qtControl qtControlCPUxGPU;

// Setup_qtControls() called by the main thread
// which will be an arbitrary thread from the thread pool.
// Figure out what thread we are and setup the qtControl's
void	Setup_qtControls()
{
	// Setup qtControlGPU to be accessed only by current thread
	qtControlGPU.Affinities.SelectAffinities(_L0_);
	// Setup qtControlCPUxGPU to be accessed by all threads
	// (but enqueued by my thread)
	// First set to all threads
	qtControlCPUxGPU.Affinities.SelectAffinities(_M3_);
	// Now exclude current thread on AffinityDeQueue
	qtControlCPUxGPU.AffinityDeQueue &= ~qtControlGPU.AffinityQueue;
	// And set the AffinityQueue to any but the current thread
	qtControlCPUxGPU.AffinityQueue = qtControlCPUxGPU.AffinityDeQueue;
}

struct Object
{
    double* inputA;
    double* inputB;
    double* output;
    unsigned int Width;
    unsigned int Height;
	Object()
	{
	  inputA = NULL;
	  inputB = NULL;
	  output = NULL;
	  if(!ObjectAllocationError)
	  {
        Height = cmd.Height;
        Width = cmd.Width;
	    /////////////////////////////////////////////////////////////////////////
        // Allocate memory 
	    /////////////////////////////////////////////////////////////////////////
        inputA = allocate_mat_d(Height, Width);
        inputB = allocate_mat_d(Width, Height);
        output = allocate_mat_d(Height, Height);
        if (!inputA || !inputB || !output)
		  ObjectAllocationError = -1;
	  }
	};
	~Object()
	{
	  if(inputA) free(inputA);
	  if(inputB) free(inputB);
	  if(output) free(output);
	};
};

Object* ObjectList = NULL;

double* inputA(int iObj)
{
	return ObjectList[iObj].inputA;
}

double* inputB(int iObj)
{
	return ObjectList[iObj].inputB;
}
double* output(int iObj)
{
	return ObjectList[iObj].output;
}
unsigned int Width(int iObj)
{
	return ObjectList[iObj].Width;
}
unsigned int Height(int iObj)
{
	return ObjectList[iObj].Height;
}

int AllocateObjects()
{
	ObjectList = new Object[cmd.Iterations];
	return ObjectAllocationError;
}
void DeleteObjects()
{
    if (ObjectList)
		delete[] ObjectList;
}

void FillObjects()
{
	for (unsigned int i = 0; i < cmd.Iterations; ++i)
	{
      fill_mat_d(
		  ObjectList.inputA,
		  ObjectList.Height,
		  ObjectList.Width,
		  1, RANDOM);
      fill_mat_d(
		  ObjectList.inputB,
		  ObjectList.Width,
		  ObjectList.Height,
		  1, RANDOM);
	}
}

// stub called from Brook+ to initialize
// and start up QuickThread (using defaults
int DoQtInit()
{
	qtInit	qtInit;
	return qtInit.QueueMain(do_main, NULL);
}

// 1 core matmult
void matmultCPU(double* a, double* b, double* c, int m, int k, int n)
{
    int y;
    for (y = 0; y < m; y++)
    {
        int x;
        for (x = 0; x < n; x++)
        {
            double temp = 0;
            int z;
            for (z = 0; z < k; z++)
            {
                temp += a[y * k + z] * b[z * n + x];
            }

            c[y * n + x] = temp;
        }
    }
}

// Partial matmult stripe
void  matmultCPUparallel_y(
	intptr_t yBegin, intptr_t yEnd,
	double  *a, double  *b, double  *c, int  k, int  n)
{
    intptr_t  x, y;
	for(y = yBegin; y < yEnd; ++ y)
	{
		for (x = 0; x < n; x++)
		{
		  double  temp = 0;
		  int  z;
		  for (z = 0; z < k; z++)
		  {
			temp += a[y * k + z] * b[z * n + x];
		  }
		  c[y * n + x] = temp;
		}
	}
}
#define USE_CPUparallel_L2
#ifdef USE_CPUparallel_L2
// perform blocking parallel_for across all cores sharing L2
void  matmultCPUparallel(
	double  *a, double  *b, double  *c, int  m, int  k, int  n)
{
	// schedule on our thread's L2
	parallel_for(_L2_, matmultCPUparallel_y, 0, m, a, b, c, k, n);
}
#else
// perform blocking parallel_for across all cores
void  matmultCPUparallel(
	double  *a, double  *b, double  *c, int  m, int  k, int  n)
{
	parallel_for(matmultCPUparallel_y, 0, m, a, b, c, k, n);
}
#endif
// perform non-blocking parallel_for
// on all cores except the core driving the GPU
void  matmultCPUxGPUparallel(
	double  *a, double  *b, double  *c, int  m, int  k, int  n)
{
	parallel_for(qtControlCPUxGPU, matmultCPUparallel_y, 0, m, a, b, c, k, n);
	qtControlCPUxGPU.WaitTillDone();
}

// 1 core pick an Object and then matmult
void  DoObjectCPU(intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
    matmultCPU(Obj.inputA, Obj.inputB, Obj.output, Obj.Height, Obj.Width, Obj.Height);
}

// all cores pick an Object and then matmult
void DoObjectCPUparallel(intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
    matmultCPUparallel(Obj.inputA, Obj.inputB, Obj.output, Obj.Height, Obj.Width, Obj.Height);
}

// all cores except core driving GPU pick an Object and then matmult
void DoObjectCPUxGPUparallel(intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
    matmultCPUxGPUparallel(Obj.inputA, Obj.inputB, Obj.output, Obj.Height, Obj.Width, Obj.Height);
}

// template for Brook+ routine to perform GPU matmult
void  double_precision_simple_matmult (const float  Width,
		::brook::stream A,
		::brook::stream B,
		::brook::stream result);

// pick and object and perform matmult on GPU
void DoObjectGPUstream(
	::brook::stream& A, ::brook::stream& B, ::brook::stream& C, intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
	/////////////////////////////////////////////////////////////////////////
    // Brook code block
	/////////////////////////////////////////////////////////////////////////
    {
		streamRead(A, Obj.inputA);
		streamRead(B, Obj.inputB);

		// Run the brook program 
		double_precision_simple_matmult(
			(float)Obj.Width,
			A,
			B,
			C);
		// Write data back from stream 
		streamWrite(C, Obj.output);
    }
}

// Hack, save copy of Brook+ stream pointers
::brook::stream* pStreamA;
::brook::stream* pStreamB;
::brook::stream* pStreamC;


void DoObjectGPUstreams(void* A, void* B, void* C)
{
	pStreamA = (::brook::stream*)A;
	pStreamB = (::brook::stream*)B;
	pStreamC = (::brook::stream*)C;
}

// pick an object and pass on to GPU for matmult
void DoObjectGPU(intptr_t i)
{
	DoObjectGPUstream(*pStreamA,*pStreamB,*pStreamC,i);
}

// Run through all objects and pass on to GPU
void	DoObjectsGPU()
{
    for (intptr_t i = 0; i < cmd.Iterations; ++i)
    {
		DoObjectGPU(i);
    }
}

// Run through all objects and pass on to CPU (1 core)
void	DoObjectsCPU()
{
    for (intptr_t i = 0; i < cmd.Iterations; ++i)
    {
		DoObjectCPU(i);
    }
}

#ifdef USE_CPUparallel_L2
// Run through all objects and pass on to CPU (n cores in parallel)
// This is run without the GPU so all cores are available
void	DoObjectsCPUparallel()
{
	// Create 1 or more instances of DoObjectCPUparallel
	// split up across (for each) L2 cache
	parallel_for_each(_OneEach_L2_, 0, cmd.Iterations, DoObjectCPUparallel);
}
#else
// Run through all objects and pass on to CPU (n cores in parallel)
// This is run without the GPU so all cores are available
void	DoObjectsCPUparallel()
{
    for (intptr_t i = 0; i < cmd.Iterations; ++i)
    {
		DoObjectCPUparallel(i);
    }
}
#endif
// asynchronous "task" performing GPU matmult on object list
// in competition with CPU in parallel.
void	DoObjectsGPU_each(qt::parallel_for_iterator* p_itr)
{
	// Next perform iterations for our thread
	// The parallel CPU tasks will take some of the iterations
	// Because the GPU is faster than CPU we want to make sure
	// that the GPU finishes up last. To do this we make sure
	// the GPU iteration space reserves some object numbers in
	// advance of calculation. Depending on array sizes the
	// optimal number will vary.
	// Get the GPU first object number
	intptr_t i=p_itr->next();
	// reserve next iteration
	intptr_t iNext1=p_itr->next();
	// reserve next iteration
	intptr_t iNext2=p_itr->next();
	// while not done
	while(i < p_itr->end())
	{
		DoObjectGPU(i);			// process an object on the GPU
		i = iNext1;				// percolate the reserved object numbers
		iNext1 = iNext2;
		iNext2=p_itr->next();	// reserve one more object number
	}
}

// Perform the parallel CPU matmults on the object list
// in competition with the GPU. Use only the CPUs that are
// not the one driving the GPU.
void	DoObjectsCPUxGPU_each(qt::parallel_for_iterator* p_itr)
{
	// Next perform iterations for our thread
	// reserve the last few for the GPU
	for(intptr_t i=p_itr->next(); i+4 < p_itr->end(); i=p_itr->next())
	{
		DoObjectCPUxGPUparallel(i);
	}
}

// Fire up the tests that perform the parallel CPU
// in parallel with GPU working on one iteration space
// of the object list. 
//
void	DoObjectsGPUCPUparallel()
{
	// create an iterator across the number of objects
	// Note, cmd.Iterations was used to specify number of objects.
	parallel_for_iterator	itr(0,cmd.Iterations);
	// make local qtControl
	qtControl	qtControl;
	// local qtControl to assume affinity of qtControlCPUxGPU
	// which will run on any of the CPUs that is not the one
	// designated to drive the GPU
	qtControl.Affinities = qtControlCPUxGPU.Affinities;
	// queue up the parallel CPU task twice
	qtControl.QueueWork(DoObjectsCPUxGPU_each, &itr);
	qtControl.QueueWork(DoObjectsCPUxGPU_each, &itr);
	// Instead of queuing DoObjectsGPU_each as a task
	// we can avoid the overhead and call it directly
	DoObjectsGPU_each(&itr);
}
[/cpp]
0 Kudos
AJ13
New Contributor I
592 Views

Thank you for the detailed response. I have to go through the TBB scheduler code again. I'll simplify my question.

I have a single function with prototype: bool steal_data_task();

This function will try to steal a data task, and run it on the CPU. It will return true if a task was stolen and executed, hinting that there might be more work. False otherwise.

I want to insert this single call into the TBB scheduler code, so that the worker threads will attempt to steal work from the data task pool via this function call.

I also realized that I needed to write a function void wakeup_all_workers(); that will wake up any sleeping workers, so that when I add data tasks, the workers can start stealing tasks.

Any input on where the steal_data_task() call should go is appreciated. I have to trace through the scheduler code a few more times to actually understand it.

0 Kudos
RafSchietekat
Valued Contributor III
592 Views

"I also realized that I needed to write a function void wakeup_all_workers(); that will wake up any sleeping workers, so that when I add data tasks, the workers can start stealing tasks."

See GenericScheduler::mark_pool_full(), but, while you may need it to run full throttle and/or to get any result if the GPU/CPU task was added from an independent thread while TBB was sound asleep, you still have to add steal_data_task() near "fail:" first.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
592 Views

Here are the relative performance figures:

Calculate performance with GPU 4.862223 Gflops
Calculate performance with CPU 0.824801 Gflops
Calculate performance with parallel CPUs 2.655731 Gflops
Calculate performance with GPU and parallel CPUs 6.217411 Gflops

Jim, this example indeed shows the power and flexibility of your library.

How many threads did you start in parallel CPU+GPU example? 4 or 5? Are calls to Brook+/CAL blocking? If so then I think you must start 5 threads. From provided numbers it's possible that only 3 CPU threads was running.

0 Kudos
jimdempseyatthecove
Honored Contributor III
592 Views
Quoting - Dmitriy Vyukov

Jim, this example indeed shows the power and flexibility of your library.

How many threads did you start in parallel CPU+GPU example? 4 or 5? Are calls to Brook+/CAL blocking? If so then I think you must start 5 threads. From provided numbers it's possible that only 3 CPU threads was running.


Dmitriy,

I am still a newbie on Brook+ GPGPU programming. The CAL (underlaying Brook+) has completion routine capability. The Brook+ coding (from my limited working with it) appears to wait implicitly. I am assuming at this time it is either in a compute loop looking at a shared memory variable and as to if this loop contains mm_wait or SwitchToThread I do not know. Therefor the coding model I chose was (on Q6600 with 4 HW threads) to launch 4 affinitized compute threads with one of the threads designated as the GPU control thread. Therefore, the CPU portion of the parallel CPU in parallel with GPU was using only 3 threads.

The sample program has a #define variable that can be used to specify if the concurrent CPU portion of the parallel programmiing was to include or exclude the GPU designated thread.As Irecallthis #define did not extend the number of compute bound threads by 1 (to 5). I believe better results will come from making the GPU controlling thread an I/O class thread. Some experimentation needs to be done to determine best juxtapositioning of threads. Also tweaking the Brook+ may be needed to use SwitchToThread would be in order. I wanted to get sample code outfor AJ to review and comment on.

I haven't used CUDA as I do not have an nVidia adapter on my test systems.

Jim Dempsey

0 Kudos
robert_jay_gould
Beginner
592 Views

I haven't used CUDA as I do not have an nVidia adapter on my test systems.

Jim Dempsey

Although it won't really help measure real performance the CUDA SDK comes with an emulator, that simulates the parallel processing and expected gains, etc...

If you are really curious, you could try that out before commiting to buy an nVidia card or not.

0 Kudos
jimdempseyatthecove
Honored Contributor III
592 Views

I haven't used CUDA as I do not have an nVidia adapter on my test systems.

Jim Dempsey

Although it won't really help measure real performance the CUDA SDK comes with an emulator, that simulates the parallel processing and expected gains, etc...

If you are really curious, you could try that out before commiting to buy an nVidia card or not.


Thanks for the tip. The CUDA emulator might help with initial testing. My main development system (Q6600) has only one PCIx16 slot and I'd rather not be popping cards in and out. Also the ATI and nVidia might not live well together even if I had dual PCIx16 slots.

Jim Dempsey

0 Kudos
it_d_balajigmail_com
592 Views
Hi,

I am currently looking at this work. I am just curious about what happened to this work? How much of it got completed? Could you please direct me to publications, if any, based on this work?

Thanks,
Balaji

Quoting - AJ

I have spent the past semester researching how to combine data parallelism with TBB, so that TBB can offset data tasks to a GPU when available. My work is nearly complete, and now I need to ask the TBB community for some help. My approach is to separate data parallelism from task parallism, which involves the creation of a new type of task: the data task. For the purposes of my work, a data task represents an operation which can utilize data parallelism. For example, multiplying a large vector by a scalar, or a matrix by a scalar.

There are two work pools in my design. There is the task pool which remains completely controlled by TBB. I also have added a single shared data task pool, from which the GPU will consume data tasks. If a CPU has no work to be done, that is normal TBB tasks to consume, it will steal data tasks from the data task pool. Ordinary TBB tasks can only be executed on the CPU, whereas data tasks may be executed on either the GPU or CPU. A data task is very simple, it only provides an virtual execute() function for now.

I would like to integrate data task with the TBB task scheduler in a simple way. Data tasks can be run from within TBB tasks. I would like to allocate a task that spawns that data tasks, putting them into the data task pool. Rather than blocking until the data task is complete, I would like for the worker thead (which ran the tbb::task that spawned data tasks) to continue to process other TBB tasks as normal, and steal tasks as required. When all data tasks are complete, the allocated parent task is ready to be re-executed.

So if there wasn't a mixture of data tasks and tbb::tasks, I would allocate a parent, spawn children, and wait for all to complete. This problem is slightly more complex, because some of these children are of a different type, and are placed into a data task pool instead of the task pool used by TBB. If a program utilizes a large amount of data parallelism which the GPU can help with, TBB worker threads should steal data tasks for themselves too.

Thanks for your input, and as always I'm on #tbb on irc.freenode.net for a chat.

AJ


0 Kudos
jimdempseyatthecove
Honored Contributor III
590 Views

Blaji,

I do not know how far AJ has progressed on his work in integrating CUDA with TBB. AJ is quite competent and may have successfuly done this (i.e. is bug free or at least produces usable code while using documented coding practices).

I have successfully integrated the ATI Brook+ into test applications using my threading tool QuickThread. I have not attempted this with CUDA (haven't downloaded the SDK and do not have a recent nVidia card). I do not see why the CUDA integration could not be integrated as easily as I have done with Brook+

On the Brook+ and AMD FireStream card I have, and at least on the version of the SDK I have, the CPU <-> Kernal code worked fine as long as the CPU thread remained constant. In my case I also found it wanted to be the main thread of the application as well. This is not a problem in coding in QuickThread as the tasks can easily be directed at one or a subset of threads. The test application I ran was a matrix multiply operation performed row-by-row in the outer loop and where the rows are picked by availability of CPU thread and GPU. i.e. some rows handled by GPU and others by cores on CPU in first come first serve manner.

You might be able to ge AJ's email address and communicate with him directly.
My email is (squish out the spaces and change 'at' to @)

jim _ dempsey at ameritech . net

Jim
0 Kudos
AJ13
New Contributor I
590 Views
I'm still around, but I check messages very infrequently nowadays. So far as "what happened" to my research, it's still ongoing. As research code, none of it is ready for production usage. I have begun to implement real, production code that draws on my experiences with past research versions. However this is slow going, and I doubt anything will be ready for another year. I have had some big problems with deciding what work should be offset to the GPU and what should be offset to the CPU. There are also a lot of technologies involved, and hence many possible approaches to integrating GPUs with TBB.

Feel free to drop by IRC and discuss my work, I'm there a lot of the time... on #tbb on irc.freenode.net.

Research publications will be coming soon, I'm under a lot of pressure to start to publish some of the preliminary results. When these publications are ready I'll post a link on the forum :-)
0 Kudos
Reply