I would suggest that you first try TBB without the patch to see if it already does what you need, and to compare results with the patched version. Note that TBB's overhead is between about 0 and 16 kB above about 8 kB, so the relative savings would be most important in the beginning of that range. Also, if you have more information than just the pointer at deallocation time you can use that information to directly go to malloc/free instead.
To try the patch anyway, ask yourself if you can build TBB from the source distribution. If you feel up to that, get the patch version of 2008-10-12, unpack, inspect ./README.atomic, get the indicated base version, merge and build. The result should work as a drop-in replacement, i.e., your program should continue to work without changes.
Note that the patch aims to offer the following: more memory semantics for atomics, more hardware architectures (nine for g++ if you don't distinguish between, e.g., x86 and x64), more operating systems, some memory relief.
Sorry, I was too quick there: TBB's overhead above about 8 kB is always about 16 kB. If I remember correctly, with the current settings the patch tries to get that down to about 4 kB plus some wasted space for smaller allocation sizes, but I have not measured it myself.