I am having trouble with applying TBB to an apparently very simple-sounding problem: "process huge number of files in parallel" (whole file gets processed by 1 thread, it is not divided into pieces to be processed).
Current sequential code looks like "std::for_each(file, file_end, file_processor(statistics))". Important to note is that "file" is an InputIterator (it is a boost::filter_iterator<:FILESYSTEM::DIRECTORY_ITERATOR>). Also important to mention is that I/O consumes less then 5% time, the heavy lifting is really done on the computational side (decompressing, XML parsing and document statistics building into a central concurrent_hash_map) - that's why I think it's OK for the I/O (reading of file) to be part of the parallel processing.
The solutions I am/was thinking about are:
- tbb::parallel_for - fail, since it requires a splittable range and to be able to know size of the range, both of which would be easy to calculate if "file" was a RandomAccessIterator, but it's impossible with InputIterator.
- tbb::parallel_for, with list of files in std::vector - I would first do std::copy(file, file_end, std::back_inserter(file_list)), then make a range class with template parameter of the file_list's const_iterator, which, being Random access iterator, satisfies requirements on Value template parameter of blocked range. This is a solution, but not an ideal (more like a workaround), and also I don't like the fact that the processing does not start right away, but waits for list of files to be found and built. Also, the operator() of body must be const, but I am modifying a huge concurrent_hash_map based on data in the files - that does not go well together (unless I use "mutable") - I am not sure if it's just a formal issue, or (more likely) a guard against misuse. Let's look for something better.
- tbb::parallel_do - it's signature is just like std::for_each, and requirements on the iterator as well (input iterators suffice), so it seems like I can simply replace std::for_each in my present code with tbb::parallel_do. But there is a catch: "The parallelism in parallel_do is not scalable if all of the items come from the input
stream. To achieve scaling, design your algorithm such that the body often adds more
than one piece of work, or use parallel_for instead." The only source of work are the input iterator-specified items. I could make the body nearly empty and use it to feed the actual work, but that would add only one item per body call (more than one is needed for some speedup). And again, "operator() (...)_const_". Btw. did I understand correctly that the citation from TBB's documentation says that parallel_do in my case would behave with the same performance as a serial std::for_each? Or was it meant that the performance will still increase with more cores, just the "n", where "n" is number of cores when the performance increase will stop, will be smaller?
- tbb::pipeline, with first serial stage producing the path to be processed, and second parallel stage to do all the work (i.e. 1 token = 1 file path). Seems like this is a "clean" solution, even though not as straightforward as a theoretical parallel version of std::for_each would be.
Now how to do the job in the best way possible: 2. is not really scalable in general (i.e. if the purpose of the iterator being InputIterator was gradual reading and processing of an otherwise huge input, i.e. input much bigger than amount of RAM, then std::copy would be a performance killer, if the program would not be killed by the user/OS for consuming too much memory). While it is not really an issue in my case (I have 7000 files to process), I would like still to apply the "right" approach, so it seems that my final conslusion goes with 4.
That being said, parallelizing std::for_each (1 line of code) results in much more lines of boilerplate code (2 tbb::filter-derived class definitions), even though trivial.
So the final questions accompanying above text are:
Am I missing some better way?
Isn't there a hidden and shy tbb::parallel_for_each hiding in the corner (with the same semantics as std::for_each, just parallelized and out-of-order)? If not, is it planned for some future release?
Am I wrong anywhere above in the analysis (i.e. parallel_do scalability question)?
What about operator() (...) being const in 1.-3., when I need to update the central concurrent_hash_map (kept by reference)?
Any comments on my analysis would be greatly appreciated.
Thanks for your response.
In the meantime, I implemented solution 4, on a 2-core testing machine, processing gets from 15.8s for 1 core down to 8.5s with 2 cores (for 3000 files), that's approx. 1.85x speed-up. I am not sure whether the remaining x0.15 is because intensively updating the same concurrent_hash_map (for each word in input document, a record is either inserted or updated), or whether the remaining x0.15 come because of waiting on I/O (reminds me of a PDC talk about Visual Studio 2010 and its ability to answer exactly this question.)
Also I will try to run this on the production 4-core machine to see how it scales (after hunting down a parallal-only bug).
I would also try parallel_do (solution 3) as you suggest - but what about void operator()(...) being const? (I do need to update the central hash_map, kept as a non-const reference private member of the body.)
if I was given a task that you described, parallel_do would be my first attempt, with pipeline coming second.
As you said, the signature of parallel_do is just like for_each. Though the potential scalability issue we described in documentation might sound scary, what we really wanted to say is that if you have a sequential interface, like InputIterator, you should know that scalability will be limited (remember Amdahl's law). But it does not mean that performance is limitedto that of serial execution; it will still increase with number of cores, and since you said most of cycles is spent on real processing, my guess would be that parallel_do would work good enough for 4 cores. Though it is never bad to think of future scalability :), soyou might try to estimate it using the data for 1, 2, 4 cores and Amdahl's law. As for accessing concurrent_hash_map, either make it mutable data member (this way you basically say that it is thread safe on its own) or access via a pointer like Raf suggested - both makes perfect sense to me.
As for speedup achieved with the pipeline, it sounds pretty good. I'd attribute remaining percents to the synchronization inside concurrent_hash_map, since you said it is used extensively. For better understanding of what's going on, you could try performance analyzers. Intel provides a few for different needs, from Performance Tuning Utilities on WhatIf.intel.com, to Intel Parallel Amplifier Beta.
"As for accessing concurrent_hash_map, either make it mutable data member (this way you basically say that it is thread safe on its own) or access via a pointer like Raf suggested - both makes perfect sense to me." Well, no, I only suggested the use of "mutable" in that other thread because the person who asked the question made a specific effort to hide the exact purpose or context: I didn't feel like speculating, and therefore decided to take the questionliterally until given more information. Here I already have enough information to know that only a pointer or reference will do (a reference is equivalent to a const pointer (to be distinguished from pointer to const)), because parallel_do will create several Body instances that are supposed to all share a single hash map.
I only meant that you suggested pointer, and I suggested mutable on my own (in this thread). But in fact, you are right thatnon-const reference should work just fine. From Boris's post starting the thread, I got impression that he already tried it and it does not work; but possibly it was a wrong guess of mine. On a simplified test, I checked that accessing non-const methods via non-const reference to another object (e.g concurrent_hash_map&) work just fine in a const method of a given object (e.g Body of parallel_do).
Meanwhile declaring mutable references is non-standard, while using reference (or pointer) is essential to access the same global hash map. Thus my suggestion of using mutable was just wrong, and I withdraw it.