TBB is agnostic of CUDA (as well as any other 3rd party library except for language support RTLs), so it does not consciously do anything that would prevent your asynchronous CUDA computation to run. Honestly, I have no idea why the setup you described does not work as expected. Out of curiosity, what happens if you replace the parallel_for call with a long do-nothing loop?
The idea to invoke a separate thread that makes the CUDA call makes lots of sense to me. As others noted, since it will supposedly block you should better "oversubscribe" the system. The most natural way to do that, however, is not tbb::task::enqueue() or tbb::parallel_invoke() I think, but std::thread (it's available in TBB in case your compiler does not yet support this C++11 feature). In this case, you don't have to oversubscribe TBB because its workers are not impacted.