While it is certainly true that traversing a std::list walks through memory and might cause the displacement of cache lines that have to be recovered later, thus forcing them out prematurely, expanding a std::vector will eventually require it to be relocated in memory, invalidating all cached copies at its old location, and so poses its own problems for optimal cache line reuse. Any design that accumulates data in some structure for later processing is a possible site for cache cooling. Perhaps some of these cases are unavoidable, such as the structures you might consider to represent the simulation network.
Does using cache_aligned_allocator make containers cache friendly? No, it makes them cache line aligned. This can be important if you're doing things like applying vector operations to the contents, where the penalties for unaligned accesses may be severe. It doesn't do much for "cache friendly" concerns, if by that you mean either keeping needed data from going cold through cache line eviction or avoiding cache line thrashing via such mechanisms as false sharing.
If you are doing object creation across a set of threads, using the scalable allocator will improve performance by trying to keep much of the memory block management local to the thread that is using it. Further restricting that to the cache_aligned_allocator gains you the benefits of the scalable allocator (which underlies the cache_aligned_allocator) but only makes sense if there's some specific advantage to aligning the data.
Linked structures are generally not friendly to modern processors, because for large data structures that do not fit in cache, following each link usually requires loading a new cache line. Therefore, I always try to get by with vectors instead of lists if I can still get the same asymptotic running time. (I.e., if a linked list delivers O(n) time, and the array is O(n^2), that's a bad deal.) The attraction of linked lists is often that you can append an item in O(1) time. Most implementations of std::vector can append an item in O(1) amortized time, so it usually does just as well asymptotically, and is much friendlier to cache.
A related example that Scott Meyers pointed out to me is searching. He noted that in many cases, binary search over an array beats hashing with linked buckets, because of the cost of following links.
The TBB reference manual has ParallelMerge as an example. If results can be collected separated on each thread, then I would use std::vector instead of tbb::concurrent_vector. tbb::concurrent_vector is of interest if you really do need to have multiple threads accessing and growing the vector at the same time. But to do that, it definitely does more gymnastics than std::vector.
The developer source release for today contains some significant space/speed improvements to tbb::concurrent_vector.
I'd first profile the code carefully before pushing cache optimizations. If you do determine that cache is an issue, here's some things to try: