- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I'm porting another simple I/O intensive piece of code to TBB, but my two versions differ hugely in their results. The serial version uses a unordered_map of strings to ints, and the parallel one accordingly uses a concurrent_hash_map. The pipeline reads strings, and counts them concurrently, as you will see, making use of std::atomic.
The serial code is as follows:
// Start parsing at the beginning of the file seqRawDataIterator it = data_.begin(); std::string read, s(length_, '-'); read.reserve(maxReadLen_); std::unordered_map<std::string, std::size_t> map; map.reserve(size_); std::size_t i = 0, j = 0, counter = 0; while (it != data_.end()) { it = nextRead(it, read); if (read == "") break; for (j = 0; j <= read.length() - length_; j++) { std::copy_n(read.begin() + j, length_, s.begin()); map+= 1; counter++; } i++; }
My equivalent parallel code is as follows:
seqRawDataIterator it = data_.begin(); tbb::concurrent_hash_map<std::string, std::size_t> map; map.rehash(size_); std::size_t i = 0; std::atomic<std::size_t> counter(0); auto reader = [&](tbb::flow_control& fc) -> std::string { std::string read; it = nextRead(it, read); if (it != data_.end() && read != "") { i++; return read; } // Stop fc.stop(); return ""; }; auto parser = [&](std::string s) -> void { std::string kmer(length_, '-'); for (int j = 0; j <= s.length() - length_; j++) { std::copy_n(s.begin() + j, length_, kmer.begin()); tbb::concurrent_hash_map<std::string, std::size_t>::accessor accessor; map.insert(accessor, s); accessor->second += 1; counter++; } }; tbb::parallel_pipeline(100, tbb::make_filter<void, std::string>(tbb::filter::serial, reader) & tbb::make_filter<std::string, void>(tbb::filter::parallel, parser));
Basically, I'm counting all the substrings of length k of a set of strings.
The serial results show:
Sequences: 31,178 (31.18 K) Num kmers: 943,107 (943.11 K) Substring: 1,122,408 (1.12 M)
The parallel results are different:
Sequences: 31,177 (31.18 K) Num kmers: 27,824 (27.82 K) Substring: 1,122,372 (1.12 M)
There is something profoundly wrong with how I'm using the hash map, maybe, or even the pipeline differs, but I can't see where.
Can you see my errors?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are assuming that nextRead cannot fetch less than length_ characters? (if this is possible, you have unsigned int overflow (j is a size_t) in your sequential code)
My guess to your problem is the pipeline break condition. If we assume nextRead advance it and return next position, your last sequence won't be processed because of the condition. In this case, it points to data.end() and you call fc.stop().
if (it != data_.end() && read != "") |
|
On a side note, you should consider testing for string emptyness using member empty() instead of comparing with "".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Michel,
Yes, nextRead reads at most maxReadLen_ charachters, and maxReadLen_ > length_ by definition. You spotted an error for the number of read strings, so I corrected it. Thanks!
However, I might be really tired, so I am not really looking at the code as it is, but as I think it is.
The serial code starts from begin(), and enters the loop, reads a string and advances it, if the string is empty (I will change the code to use read.empty()), exits, otherwise it is split into substrings and each substring is added to the hashmap increasing the mapped value.
The parallel reader starts with begin(), and reads a string advancing it, so I need to check it before reading, in case it is non-empty I return it passing to the next parallel filter. The parallel filter uses the same loop, and with an accessor, each substring is added to the hashmap increasing the mapped value.
if (it != data_.end()) it = nextRead(it, read); if (read != "") { i++; return read; } // Stop fc.stop(); return "";
Now the read string number (31,178) is in line with the serial code, as well as the substring count (1,122,408).
However, why do hash maps values differ? I need to know how many distinct substrings are in the hash map, so I am using map.size(), with both std::unordered_map and tbb::concurrent_hash_map.
Am I missing something?
PS. Again thanks for spotting the bug! :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe you should insert 'kmer' instead of 's'
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I feel ashamed I didn't spot that... Man, I need a rest!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page