Today, I've been playing around more, and have learned some more stuff that is puzzling to me and I was wondering if anyone had any hints.
I took yesterday's problem and transformed it from a pipeline to a parallel_while, as either works for me in this case. So my problem is now:
while not end of file
read object from file
calculate attributes on object
the calculate and save steps are the body of the while loop and the read step is the pop_if_present function.
Overall the execution time for parallel while and pipeline are the same, and again, I'm getting no speedup when I use more threads. I know the I/O is a small portion of the time, but I did a quick test on a small subset of my data. I changed the parallel_while to something like this
first read all of my objects outside the parallel_while
while not at the end of my object list
calculate attributes on object
So this is the same as the previous loop, except I read everything first and then process the list in memory (only works for a small subset of my data, as normally I have multi gigibyte files to process). This way I do get a speed up for multi threading.
Here's the timing (times in ms) on a dual core AMD running Windows
For the first way, where the file I/O is in the parallel while
1 Thread: 1173
2 Threads: 1217
3 Threads: 1228
4 Threads: 1208
For the second way, where the file I/O is done first and then the parallel while loops over the objects in memory
Serial: 882 (file I/O is accounts for the missing 191 ms from the above)
1 Thread: 955
2 Threads: 659
3 Threads: 680
4 Threads: 665
So from the timing, we see that the file I/O is a small portion of the total time (about 20% in the serial case), and everything else should be perfectly parallel. But in the example where I read the file inside the parallel_while loop, I see no speedup. So the I/O is suspicious to me. I'm using standard c buffered fread() as my I/O and since the I/O is done in the pop_if_present routene, it should be serial and not be thrashing as I increase the number of threads.
Even for serial code, the OS should be overlapping some of the I/O and computing. Hence subtracting times might not reveal the time spent on I/O. If you remove all the compute stuff and do only the I/O portion, what is the running time?
Arch, inMike's other dialog on hispipeline attempt he breaks down I/O times to reads and writes, though it's not clear how much overlap might exist even between these figures. Given that writing generally takes more time than reading, his proportions (10% read, 7% write)suggest a data reduction operation of some sort.
Mike, if we assume the I/O proportions are correct (even so, I hope you'll perform Arch's suggested experiment to truly separate the I/O and processing times), there should be room to scale at least to 4 processors (83/17= 4.8). Both pipelines and parallel_while have inherent serialization that limits their scaling, but within those constraints some scaling is possible. My blog series on TBB pipeline for overlapping I/O with processing demonstrates the possibility. I'm not sure why your results are so dissimilar. Perhaps if you can reveal more details of your code organization, we might be able to ferret out possible performance issues.
Thanks for the replies,