- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This post is a description of some experiences and experiments that I performed, while using TBB to parallelize part of a large bio-informatics genome assembly program, specifically large amounts of disk I/O and large memory footprints. I invite any and all feedback relating similar experiences, criticisms of methodology or conclusions. I know it's long,
I started this work with two existing phases of the larger program, which amount to graph building and tip removal. The genetic data used as input arrives as a large file (or set of files) of relatively homogenous data, consisting of millions of short "reads" from the original DNA exemplar. There is no inherent order to the input data - the reads can be processed in any order, and the final graph is deterministic (modulo layout in memory, of course). Interesting data sets are several gigabytes and larger (e.g. a yeast genome with 50x coverage as a single file is 3.5GB on disk). The graph building phase stitches the genome back together, read by read. The second phase does "tip removal", where weak connections in the full-sized graph are removed, in an effort to eliminate singleton, "one-off" errors in the input data (which occurs because the input reads are collected by a chemical process, which introduces some corrupt reads).
Computationally, the graph building involves building a large data structure in memory from on-disk input data. This is I/O and memory intensive, but the actual computation is fairly minimal, with some hashing, some counter updates and lots of pointer-chasing. A hash table is used to allow (relatively) cheap lookup to discover if a node already exists for a read that has been seen before. The second phase is similar, but operates only on in-memory data, shrinking the structure by un-linking small connections and removing extremely rarely touched sections. This regularly has a factor-of-ten reduction on the size of the graph in memory. Neither phase has meaningful cache locality, touching data across the entire graph as building and removal proceeds. There is some locality possible in the hash table, but aliasing concerns push this table beyond what reasonably fits even in outer levels of cache. In a sense, if we could guess a-priori which nodes should be co-located in memory, we would have answered the original question of what order the reads were connected.
To parallelize the above, I ended up using a TBB::pipeline for the graph-building phase, and a TBB::parallel_for loop for the tip-removal phase. I also leverage the tbb::atomic datatypes in some cases, but usually I had to fall back on compiler-specific intrinsics for fetch_and_phi and compare_and_swap operations. As an example of why, consider the atomic type and a linked data structure. For a linked list, it's common to have a struct that looks something like this:
struct T { ...; struct T* next; }
You can't use the atomic pointer type here, because one of the initialization steps for that object is to take the size of the templated type, to set up pointer arithmetic, and of course the type is not yet complete. I could have used a void*, which skips this step, but that would litter my code with static_cast calls and generally make the same mess that fetch_and_phi macros do. C'est la vie.
Edit: And of course, this particular example is a problem fixed in an update later than the one I started with.
Graph building:
The original implementation I built for this phase was a parallel_for loop over a vector of input files, which got memory-mapped and faulted in as needed. This implementation sucked - it didn't get close to the disk through-put potential, and couldn't handle imbalanced input files well (nor single input-files at all), and always ran slower than the sequential version. I also discovered that my system was spending an enormous amount of time waiting on disk-seeks, which I traced back to having decimated the sequential access pattern to the data files, which the file system handled well enough. The pipeline version addressed all of the above problems - a serial "pre-fault" filter allowed me to map-in a "chunk" of the next input file, touch each page in sequence to ensure that the data really is in memory, and then pass this to a parallel "graph-build" filter which took the large character buffer of reads from a file and add those reads to the graph. This restored the sequential access pattern presented to the file-system, getting us within a factor of 2 of the maximum disk throughput (measured against the results of the Linux utility hdparm, which is an ideal case).
Results:
On a 4-socket Core 2-based system with a large RAIDed disk system, we also saw good scaling up to 12 threads (roughly a 6x speed improvement), after which point memory bandwidth became the new bottleneck. Serialized pre-faulting vs. faulting the data in as-you-go is a factor of 2-3 run-time difference holding all else constant while still beneath the memory bandwidth bottleneck. Live token count in the pipeline doesn't have much effect on performance beyond the scaling factor, in that there is little practical difference between 12 tokens (the speed-up on 24 cores) and 24 tokens (which would allow work to proceed on all cores). 10 tokens has a sizable effect on run-time, but anywhere from 24 to 200 tokens have similar run-times.
I also attempted, as a singleton, to run P+1 threads on P processors, with P live tokens, in an attempt to force a thread to dedicate itself to doing I/O alone, but this had little effect on runtimes. Given the memory bandwidth at this number of worker threads, this isn't too surprising, but it seemed like an interesting design point.
Tip removal:
The final implementation divides up the hash table as a parallel_for loop, using the automatic partitioner, and each task walks through its assignment of hash buckets doing two things. One, any time the end of a tip is detected (graph connections in only one direction), this task walks to the base of the tip (looking for the connection point to a "trunk", or useful non-tip) and marks all of the nodes along the way as "disconnected". Second, each task is responsible for actually de-allocating any nodes that are marked as disconnected that are in a hash bucket it owns. In this way, any thread may mark a node as removable, but only one thread will ever try to actually remove that node. This approach requires the same number of passes over the graph that the original, serial version of the code does (i.e. removing one tip may reveal another one), and avoids data races. Passes continue until no changes are made, at which point the phase is over.
Results:
The automatic and cache_affinity partitioners have little difference in their effect on performance, probably because of the lack of cache locality noted before. Simple partitioning makes no sense on a 16 million entry hash table, and it shows. Over-commitment of worker threads appears to have only a small effect on performance. Up to a factor of 32x the suggested number of worker threads has overhead within the noise of timing data, less than 5%.
I started this work with two existing phases of the larger program, which amount to graph building and tip removal. The genetic data used as input arrives as a large file (or set of files) of relatively homogenous data, consisting of millions of short "reads" from the original DNA exemplar. There is no inherent order to the input data - the reads can be processed in any order, and the final graph is deterministic (modulo layout in memory, of course). Interesting data sets are several gigabytes and larger (e.g. a yeast genome with 50x coverage as a single file is 3.5GB on disk). The graph building phase stitches the genome back together, read by read. The second phase does "tip removal", where weak connections in the full-sized graph are removed, in an effort to eliminate singleton, "one-off" errors in the input data (which occurs because the input reads are collected by a chemical process, which introduces some corrupt reads).
Computationally, the graph building involves building a large data structure in memory from on-disk input data. This is I/O and memory intensive, but the actual computation is fairly minimal, with some hashing, some counter updates and lots of pointer-chasing. A hash table is used to allow (relatively) cheap lookup to discover if a node already exists for a read that has been seen before. The second phase is similar, but operates only on in-memory data, shrinking the structure by un-linking small connections and removing extremely rarely touched sections. This regularly has a factor-of-ten reduction on the size of the graph in memory. Neither phase has meaningful cache locality, touching data across the entire graph as building and removal proceeds. There is some locality possible in the hash table, but aliasing concerns push this table beyond what reasonably fits even in outer levels of cache. In a sense, if we could guess a-priori which nodes should be co-located in memory, we would have answered the original question of what order the reads were connected.
To parallelize the above, I ended up using a TBB::pipeline for the graph-building phase, and a TBB::parallel_for loop for the tip-removal phase. I also leverage the tbb::atomic datatypes in some cases, but usually I had to fall back on compiler-specific intrinsics for fetch_and_phi and compare_and_swap operations. As an example of why, consider the atomic
struct T { ...; struct T* next; }
You can't use the atomic pointer type here, because one of the initialization steps for that object is to take the size of the templated type, to set up pointer arithmetic, and of course the type is not yet complete. I could have used a void*, which skips this step, but that would litter my code with static_cast calls and generally make the same mess that fetch_and_phi macros do. C'est la vie.
Edit: And of course, this particular example is a problem fixed in an update later than the one I started with.
Graph building:
The original implementation I built for this phase was a parallel_for loop over a vector of input files, which got memory-mapped and faulted in as needed. This implementation sucked - it didn't get close to the disk through-put potential, and couldn't handle imbalanced input files well (nor single input-files at all), and always ran slower than the sequential version. I also discovered that my system was spending an enormous amount of time waiting on disk-seeks, which I traced back to having decimated the sequential access pattern to the data files, which the file system handled well enough. The pipeline version addressed all of the above problems - a serial "pre-fault" filter allowed me to map-in a "chunk" of the next input file, touch each page in sequence to ensure that the data really is in memory, and then pass this to a parallel "graph-build" filter which took the large character buffer of reads from a file and add those reads to the graph. This restored the sequential access pattern presented to the file-system, getting us within a factor of 2 of the maximum disk throughput (measured against the results of the Linux utility hdparm, which is an ideal case).
Results:
On a 4-socket Core 2-based system with a large RAIDed disk system, we also saw good scaling up to 12 threads (roughly a 6x speed improvement), after which point memory bandwidth became the new bottleneck. Serialized pre-faulting vs. faulting the data in as-you-go is a factor of 2-3 run-time difference holding all else constant while still beneath the memory bandwidth bottleneck. Live token count in the pipeline doesn't have much effect on performance beyond the scaling factor, in that there is little practical difference between 12 tokens (the speed-up on 24 cores) and 24 tokens (which would allow work to proceed on all cores). 10 tokens has a sizable effect on run-time, but anywhere from 24 to 200 tokens have similar run-times.
I also attempted, as a singleton, to run P+1 threads on P processors, with P live tokens, in an attempt to force a thread to dedicate itself to doing I/O alone, but this had little effect on runtimes. Given the memory bandwidth at this number of worker threads, this isn't too surprising, but it seemed like an interesting design point.
Tip removal:
The final implementation divides up the hash table as a parallel_for loop, using the automatic partitioner, and each task walks through its assignment of hash buckets doing two things. One, any time the end of a tip is detected (graph connections in only one direction), this task walks to the base of the tip (looking for the connection point to a "trunk", or useful non-tip) and marks all of the nodes along the way as "disconnected". Second, each task is responsible for actually de-allocating any nodes that are marked as disconnected that are in a hash bucket it owns. In this way, any thread may mark a node as removable, but only one thread will ever try to actually remove that node. This approach requires the same number of passes over the graph that the original, serial version of the code does (i.e. removing one tip may reveal another one), and avoids data races. Passes continue until no changes are made, at which point the phase is over.
Results:
The automatic and cache_affinity partitioners have little difference in their effect on performance, probably because of the lack of cache locality noted before. Simple partitioning makes no sense on a 16 million entry hash table, and it shows. Over-commitment of worker threads appears to have only a small effect on performance. Up to a factor of 32x the suggested number of worker threads has overhead within the noise of timing data, less than 5%.
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Edit: And of course, this particular example is a problem fixed in an update later than the one I started with."
Or in my patch more than five months ago...
Are there any other limitations that would force you to bypass atomic? Would you inherit from atomic to implement fetch_and_phi, or should it take a function object? Any other ideas?
"10 tokens has a sizable effect on run-time, but anywhere from 24 to 200 tokens have similar run-times."
Would you still bother with tuning the pipeline if you didn't have to, e.g., because of something like the example's circular buffer? I have submitted a proposal for a pipeline reimplementation that should tune itself, although I'm not yet confident that it actually works. :-)
"Over-commitment of worker threads appears to have only a small effect on performance. Up to a factor of 32x the suggested number of worker threads has overhead within the noise of timing data, less than 5%."
Maybe because cache locality plays no role here?
Or in my patch more than five months ago...
Are there any other limitations that would force you to bypass atomic? Would you inherit from atomic to implement fetch_and_phi, or should it take a function object? Any other ideas?
"10 tokens has a sizable effect on run-time, but anywhere from 24 to 200 tokens have similar run-times."
Would you still bother with tuning the pipeline if you didn't have to, e.g., because of something like the example's circular buffer? I have submitted a proposal for a pipeline reimplementation that should tune itself, although I'm not yet confident that it actually works. :-)
"Over-commitment of worker threads appears to have only a small effect on performance. Up to a factor of 32x the suggested number of worker threads has overhead within the noise of timing data, less than 5%."
Maybe because cache locality plays no role here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Or in my patch more than five months ago..."
As you say, though I don't have five months of history examining details of the TBB source. I'm still pretty new here. :)
"Are there any other limitations that would force you to bypass atomic? Would you inherit from atomic to implement fetch_and_phi, or should it take a function object? Any other ideas?"
The only reason I'd be concerned about re-tooling to use the atomic primitives is that a lot of the code is arranged to make building with TBB optional. I use my own wrapper function throughout the code to allow fetch_and_phi operations to either be a compiler-specific macro, or just "do the op" if I'm not building a parallel version of the code. Setting that aside, I can't see any reason I couldn't go back and use atomic variables.
The other issue that I ran into was trying to dump the value of a tbb::atomic with a printf; yes, I'm a little behind the times, using C-style output, but the following hack shouldn't really be necessary.
printf("Atomic int value: %dn", atomic_int + 0);
"Would you still bother with tuning the pipeline if you didn't have to, e.g., because of something like the example's circular buffer?"
I was only testing the sensitivity for my own curiousity. I was interested in the idea of dealing with blocked worker threads by over-commitment. Further, it was interesting to contrast the token count's control over the number of active threads vs. adjusting the worker thread count manually.
"Maybe because cache locality plays no role here?"
This is certainly the case, but I had expected to see more overhead from tasks getting 'trapped' on de-scheduled threads, or perhaps some overhead coming from having diluted the victim pool for random work stealing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"The other issue that I ran into was trying to dump the value of a tbb::atomic with a printf; yes, I'm a little behind the times, using C-style output, but the following hack shouldn't really be necessary.
printf("Atomic int value: %dn", atomic_int + 0);"
You could also just cast; I thought a load() operation might be nice to have as well (especially to be able to specify non-default memory semantics).
"I was interested in the idea of dealing with blocked worker threads by over-commitment. Further, it was interesting to contrast the token count's control over the number of active threads vs. adjusting the worker thread count manually."
Only task_scheduler_init's argument can create over-commitment (number of threads); pipeline::run()'s argument merely limits the number of tasks deployed by the pipeline: of those not waiting in a serial filter's input buffer, only as many as TBB decides would be scheduled to execute, and then the kernel has its final say of course.
printf("Atomic int value: %dn", atomic_int + 0);"
You could also just cast; I thought a load() operation might be nice to have as well (especially to be able to specify non-default memory semantics).
"I was interested in the idea of dealing with blocked worker threads by over-commitment. Further, it was interesting to contrast the token count's control over the number of active threads vs. adjusting the worker thread count manually."
Only task_scheduler_init's argument can create over-commitment (number of threads); pipeline::run()'s argument merely limits the number of tasks deployed by the pipeline: of those not waiting in a serial filter's input buffer, only as many as TBB decides would be scheduled to execute, and then the kernel has its final say of course.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page