Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Parallel work coming from serial input

Yesterday I wrote about a problem I was having where my pipeline didn't speed up like I thought it would:

Today, I've been playing around more, and have learned some more stuff that is puzzling to me and I was wondering if anyone had any hints.

I took yesterday's problem and transformed it from a pipeline to a parallel_while, as either works for me in this case. So my problem is now:

while not end of file
read object from file
calculate attributes on object
save attributes

the calculate and save steps are the body of the while loop and the read step is the pop_if_present function.

Overall the execution time for parallel while and pipeline are the same, and again, I'm getting no speedup when I use more threads. I know the I/O is a small portion of the time, but I did a quick test on a small subset of my data. I changed the parallel_while to something like this

first read all of my objects outside the parallel_while
while not at the end of my object list
calculate attributes on object
save attributes

So this is the same as the previous loop, except I read everything first and then process the list in memory (only works for a small subset of my data, as normally I have multi gigibyte files to process). This way I do get a speed up for multi threading.

Here's the timing (times in ms) on a dual core AMD running Windows

For the first way, where the file I/O is in the parallel while

Serial: 1073
1 Thread: 1173
2 Threads: 1217
3 Threads: 1228
4 Threads: 1208

For the second way, where the file I/O is done first and then the parallel while loops over the objects in memory

Serial: 882 (file I/O is accounts for the missing 191 ms from the above)
1 Thread: 955
2 Threads: 659
3 Threads: 680
4 Threads: 665

So from the timing, we see that the file I/O is a small portion of the total time (about 20% in the serial case), and everything else should be perfectly parallel. But in the example where I read the file inside the parallel_while loop, I see no speedup. So the I/O is suspicious to me. I'm using standard c buffered fread() as my I/O and since the I/O is done in the pop_if_present routene, it should be serial and not be thrashing as I increase the number of threads.

Any Hints?


0 Kudos
3 Replies

Even for serial code, the OS should be overlapping some of the I/O and computing. Hence subtracting times might not reveal the time spent on I/O. If you remove all the compute stuff and do only the I/O portion, what is the running time?

Valued Contributor II

Arch, inMike's other dialog on hispipeline attempt he breaks down I/O times to reads and writes, though it's not clear how much overlap might exist even between these figures. Given that writing generally takes more time than reading, his proportions (10% read, 7% write)suggest a data reduction operation of some sort.

Mike, if we assume the I/O proportions are correct (even so, I hope you'll perform Arch's suggested experiment to truly separate the I/O and processing times), there should be room to scale at least to 4 processors (83/17= 4.8). Both pipelines and parallel_while have inherent serialization that limits their scaling, but within those constraints some scaling is possible. My blog series on TBB pipeline for overlapping I/O with processing demonstrates the possibility. I'm not sure why your results are so dissimilar. Perhaps if you can reveal more details of your code organization, we might be able to ferret out possible performance issues.

Just wanted to update the thread incase anyone else has this problem. (By the way I did read your series on the TBB pipeline before I even started this whole project - good stuff.) I did some more profiling and discovered that the issue was memory allocation. The code that I was trying to run in parallel was mallocing alot of scratch space (I didn't write that code, so I igored the details at first). I replaced the mallocs with scalalbe_malloc and saw only a tiny improvement in speed. However, that was with the tbb20_010 release. I upgraded to the tbb20_014 release and see the kind of scalability I would predict. (That means I need to go back and look throught the docs and see what changed between releases...) I have done the I/O tests too, and my times were consistent with just I/O and no processing, so I do trust my numbers.

Thanks for the replies,