Even for serial code, the OS should be overlapping some of the I/O and computing. Hence subtracting times might not reveal the time spent on I/O. If you remove all the compute stuff and do only the I/O portion, what is the running time?
Arch, inMike's other dialog on hispipeline attempt he breaks down I/O times to reads and writes, though it's not clear how much overlap might exist even between these figures. Given that writing generally takes more time than reading, his proportions (10% read, 7% write)suggest a data reduction operation of some sort.
Mike, if we assume the I/O proportions are correct (even so, I hope you'll perform Arch's suggested experiment to truly separate the I/O and processing times), there should be room to scale at least to 4 processors (83/17= 4.8). Both pipelines and parallel_while have inherent serialization that limits their scaling, but within those constraints some scaling is possible. My blog series on TBB pipeline for overlapping I/O with processing demonstrates the possibility. I'm not sure why your results are so dissimilar. Perhaps if you can reveal more details of your code organization, we might be able to ferret out possible performance issues.