Re: parallel Depth first search

pramodhremo · ‎10-21-2009

im an undergraduate student. i tired to implement depth first search parallelly using intel tbb. i didnt know which part of the algorithm to parallelize so i decided to run serial recursive DFS with 8 processors at a time. my idea here is that 8 processors run their individual DFS at same time on the graph. i used a concurrent_vector to note the visited nodes and im printing the visited nodes.
But i dont think im making use of all the processors here. when i run "top" command to see the cpu usage only 1 processor completes the job.
As im new to TBB i didnt know any other data structures or parallel algorithms to use.

im attaching my code here.....ur ideas+suggestions+corrections will help me sooooo much ...thankQ in adv.

robert-reed · ‎10-21-2009

Quoting - pramodhremo

im an undergraduate student. i tired to implement depth first search parallelly using intel tbb. i didnt know which part of the algorithm to parallelize so i decided to run serial recursive DFS with 8 processors at a time. my idea here is that 8 processors run their individual DFS at same time on the graph. i used a concurrent_vector to note the visited nodes and im printing the visited nodes.
But i dont think im making use of all the processors here. when i run "top" command to see the cpu usage only 1 processor completes the job.
As im new to TBB i didnt know any other data structures or parallel algorithms to use.

im attaching my code here.....ur ideas+suggestions+corrections will help me sooooo much ...thankQ in adv.

Sorry to say, there's lots wrong with this code so I'm not surprised that you never saw more than one HW thread executing concurrently in it. Here's a few hints.

The selected blocked_range, with an interval [1,8) and a grain size 2, is only providing enough task division to support about 4 HW threads. (With TBB 2.2 the default partitioner becomes the auto-partitioner, which might give a better result than I suggest.)
I bet a substantial part of the program is the function initial(), which creates the graph serially and while waiting for file I/O. It would be interesting to run a hot spot analysis and see how much time is actually spent in the DFS.
The concurrent_vector here serves no purpose since it is never concurrently grown. The only concurrent accesses happen inside dfs(), where you have innocuous data races as nodes are marked as visited and some possibility of false sharing conflicts (thrashing)as different HW threads try to mark nodes as visited and may displace each other's cache line copies.
The DFS basically has no work, so most of the time in this algorithm will be spent in management overhead.
Both m and visited are global data structures but are passed as parameters as if that provides some different kind of access, but m is passed into the first dfs call and then just passed down the recursion chain (as p) without ever being changed. You can have that access just from the global.

The big thing to consider when trying to write a parallel version of algorithms is what bits can be done without interfering with other bits. This dependence analysis helps to figure out what data structures can be safely and concurrently accessed and which cannot. That's where the art of parallel programming lives these days. If you've got the funds, I'd invest in a parallel programming book to learn the basics. I'm sure there are several people on this forum who can suggest specific titles.

pramodhremo · ‎10-22-2009

Quoting - Robert Reed (Intel)

Sorry to say, there's lots wrong with this code so I'm not surprised that you never saw more than one HW thread executing concurrently in it. Here's a few hints.

The selected blocked_range, with an interval [1,8) and a grain size 2, is only providing enough task division to support about 4 HW threads. (With TBB 2.2 the default partitioner becomes the auto-partitioner, which might give a better result than I suggest.)

I bet a substantial part of the program is the function initial(), which creates the graph serially and while waiting for file I/O. It would be interesting to run a hot spot analysis and see how much time is actually spent in the DFS.

The concurrent_vector here serves no purpose since it is never concurrently grown. The only concurrent accesses happen inside dfs(), where you have innocuous data races as nodes are marked as visited and some possibility of false sharing conflicts (thrashing)as different HW threads try to mark nodes as visited and may displace each other's cache line copies.

The DFS basically has no work, so most of the time in this algorithm will be spent in management overhead.

Both m and visited are global data structures but are passed as parameters as if that provides some different kind of access, but m is passed into the first dfs call and then just passed down the recursion chain (as p) without ever being changed. You can have that access just from the global.

The big thing to consider when trying to write a parallel version of algorithms is what bits can be done without interfering with other bits. This dependence analysis helps to figure out what data structures can be safely and concurrently accessed and which cannot. That's where the art of parallel programming lives these days. If you've got the funds, I'd invest in a parallel programming book to learn the basics. I'm sure there are several people on this forum who can suggest specific titles.

ThanQ first of all for taking time in reading my stupid code and the hints you have given.

the hint #3 was expected but i didnt know how to stop this. I thought concurrent_vector in TBB would take care of it.
I tired with different gain sizes in blocked_range but the threads are not executing concurrently. I dont understand why i couldnt see threads executing concurrently even after not using any locks.
my first interest here is to make all threads run concurrently even though they give an unexpected output. so please help me with this part.
Should i try and modify a NON-RECURSIVE SERIAL DFS using stacks for each thread to run parallelly rather than a recursive one ??
if you can help me correcting and modifying the previous code will be of great help.

I dont have much practical experience with parallel programing. I have brought books on parallel programing from college library. If you can suggest a particular book it will be of help to me.

vu64 · ‎10-22-2009

About books, the Nutshell Intel TBB book and "The Art of concurrency" by Clay Beshears would be nice. Also look out for blogs, forums and online articles.