The problem appears to be linked to the g++ -O3 optimisation switch. If I remove this switch and also replace the call to zlib's compress function with a plain busy loop (increment a local variable 10^9 times) then the pthread version is the same speed as either of the TBB versions. Hmm, very interesting....