Solved: Maybe you should insert 'kmer

Sensei_S_ · ‎08-27-2014

Dear all,

I'm porting another simple I/O intensive piece of code to TBB, but my two versions differ hugely in their results. The serial version uses a unordered_map of strings to ints, and the parallel one accordingly uses a concurrent_hash_map. The pipeline reads strings, and counts them concurrently, as you will see, making use of std::atomic.

The serial code is as follows:

    // Start parsing at the beginning of the file
    seqRawDataIterator it = data_.begin();

    std::string read, s(length_, '-');
    read.reserve(maxReadLen_);
    
    std::unordered_map<std::string, std::size_t> map;
    map.reserve(size_);
    
    std::size_t i = 0, j = 0, counter = 0;

    while (it != data_.end())
    {
        it = nextRead(it, read);
        
        if (read == "") break;
        
        for (j = 0; j <= read.length() - length_; j++)
        {
            std::copy_n(read.begin() + j, length_, s.begin());
            map += 1;
            counter++;
        }

        i++;
    }

My equivalent parallel code is as follows:

    seqRawDataIterator it = data_.begin();
    
    tbb::concurrent_hash_map<std::string, std::size_t> map;
    map.rehash(size_);
        
    std::size_t i = 0;
    std::atomic<std::size_t> counter(0);

    auto reader = [&](tbb::flow_control& fc) -> std::string {
            std::string read;
            
            it = nextRead(it, read);
                  
            if (it != data_.end() && read != "")
            {
                i++;                
                return read;
            }

            // Stop
            fc.stop();
            return "";
        };

    auto parser = [&](std::string s) -> void {
        std::string kmer(length_, '-');

        for (int j = 0; j <= s.length() - length_; j++)
        {
            std::copy_n(s.begin() + j, length_, kmer.begin());

            tbb::concurrent_hash_map<std::string, std::size_t>::accessor accessor;
            map.insert(accessor, s);
            accessor->second += 1;

            counter++;
        }

    };
    
    tbb::parallel_pipeline(100, tbb::make_filter<void, std::string>(tbb::filter::serial, reader) &
                                tbb::make_filter<std::string, void>(tbb::filter::parallel, parser));

Basically, I'm counting all the substrings of length k of a set of strings.

The serial results show:

Sequences: 31,178 (31.18 K)
Num kmers: 943,107 (943.11 K)
Substring: 1,122,408 (1.12 M)

The parallel results are different:

Sequences: 31,177 (31.18 K)
Num kmers: 27,824 (27.82 K)
Substring: 1,122,372 (1.12 M)

There is something profoundly wrong with how I'm using the hash map, maybe, or even the pipeline differs, but I can't see where.

Can you see my errors?

Thanks!

MLema2 · ‎08-28-2014

Maybe you should insert 'kmer' instead of 's'

View solution in original post

MLema2 · ‎08-27-2014

We are assuming that nextRead cannot fetch less than length_ characters? (if this is possible, you have unsigned int overflow (j is a size_t) in your sequential code)

My guess to your problem is the pipeline break condition. If we assume nextRead advance it and return next position, your last sequence won't be processed because of the condition. In this case, it points to data.end() and you call fc.stop().

	
				if (it != data_.end() && read != "")

On a side note, you should consider testing for string emptyness using member empty() instead of comparing with "".

Sensei_S_ · ‎08-28-2014

Thanks Michel,

Yes, nextRead reads at most maxReadLen_ charachters, and maxReadLen_ > length_ by definition. You spotted an error for the number of read strings, so I corrected it. Thanks!

However, I might be really tired, so I am not really looking at the code as it is, but as I think it is.

The serial code starts from begin(), and enters the loop, reads a string and advances it, if the string is empty (I will change the code to use read.empty()), exits, otherwise it is split into substrings and each substring is added to the hashmap increasing the mapped value.

The parallel reader starts with begin(), and reads a string advancing it, so I need to check it before reading, in case it is non-empty I return it passing to the next parallel filter. The parallel filter uses the same loop, and with an accessor, each substring is added to the hashmap increasing the mapped value.

            if (it != data_.end()) it = nextRead(it, read);

            if (read != "")
            {
                i++;
                
                return read;
            }

            // Stop
            fc.stop();
            return "";

Now the read string number (31,178) is in line with the serial code, as well as the substring count (1,122,408).

However, why do hash maps values differ? I need to know how many distinct substrings are in the hash map, so I am using map.size(), with both std::unordered_map and tbb::concurrent_hash_map.

Am I missing something?

PS. Again thanks for spotting the bug! :)

MLema2 · ‎08-28-2014

Maybe you should insert 'kmer' instead of 's'

Sensei_S_ · ‎08-28-2014

I feel ashamed I didn't spot that... Man, I need a rest!

Serial vs Parallel with hashmap & pipeline: discrepancy