Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2430 Discussions

Not achieving expected performance when using parallel_for

I'm new to TBB, and I'm currently evaluating the latest version of the library on VS 2008. I'm fairly certain that I'm compiling using the Microsoft's compiler, but I'm not 100% sure. I'm also not sure how to change it, and whether is desired to use the intel compiler when using the TBB (significantly better performance?). I'm doing all testing in release mode.

I'd like to use the TBB with a network server, and I'm currently just experimenting with the toolkit to see if it meets my needs, if I can comprehend it, and to see whether it does make my life easier.

One of the first things I wanted to try, is to run a for loop which creates SHA 256 checksum of every file in the C:\\Windows\\System32 directory.

I have a function that returns me a vector of all file names of a directory, which is then passed on to the ParallelGenerateChecksum() function, which in turn runs parallel_for().

I'm comparing the run time of a serial for loop as well as with the parallel for loop, and see only modest speed improvements on an intel i7 quad core workstation.

When generating the checksums, a serial process takes about 10 seconds, whereas the parallel process takes about 6 seconds on average. While this is obviously a good improvement, I was expecting to see at least half the time of the manual for loop. After all, I have 4 cores. I can also see that multiple cores are being utilized while the parallel_for runs. I wouldn't expect this to run more than 3 seconds.

I'm also reading the O'Reilly book on TBB, and using that as a reference and for examples.

As far as the code is concerned:


class ApplyFoo
std::string* const my_a;

void operator( )( const blocked_range& r ) const
std::string *a = my_a;

for( size_t i = r.begin(); i != r.end( ); ++i )

ApplyFoo( std::string a[] ) : my_a(a)


// I have also tried manual grain sizes of 100, 10000 without any significant improvements
void ParallelGenerateChecksum( std::string a[], size_t n )
parallel_for(blocked_range (0, n), ApplyFoo(a), auto_partitioner() );


// FileChecksum is a wrapper that supports creating checksums, uses
// CryptoPP
void DoComplicatedWork(const std::string& fileName)
std::string fileNameWithPath("C:\\\\Windows\\\\System32\\\\");

FileChecksum fc(fileNameWithPath);
std::string checksum = fc.getChecksum();


The main() looks like this:

FileEnumerator fe("C:\\\\windows\\\\system32");
std::vector <:STRING> files = fe.getFiles();
ParallelGenerateChecksum(&files[0], files.size());

At this point I'm not sure if I am doing something wrong, or if that is just the best the algorithm can do?

Thank you.

0 Kudos
2 Replies
Black Belt
Your program may be file I/O bound.
I suggest you insturment your code to determine the computeportion and i/o portion (stall time) for each thread (or in this case for each file since determining thread time in a tasking system is not pertinant).

Also get the time outside the parallel code region. (IOW the time to perform fe.getFiles();)

As an additional note, you may find structuring this as a parallel_pipeline may work better.
Have the input pipe do as little work as possible (e.g. just open and readeach file for each token)

Jim Dempsey
There will definitely be some I/O, since the files have to be read from disk before the checksum is generated, but looking at the Disk Queue Length in the Windows performance monitor (that's the best I can do at the moment to measure this) it doesn't look like file I/O is the issue.

Do the Intel tools offer instrumentation? Sorry, I have not done this before and I'm not sure how to do that.

Getting the file list (getFiles()) shouldn't be an issue, I wasn't including that code when I measured the elapsed time.

I will look into the pipeline and see if that works better, I want to get familiar with the pipeline anyways.

Thank you.