cache_aligned_allocator will pad each matrix to fill out a cache line. You'll have to decided if this is a waste of space or a bargain for avoiding false sharing.
tbb::scalable_allocator might be the right thing to use. If each thread uses tbb::scalable_allocator to allocate 16-byte objects, each thread's objects will be consecutively allocated with no extra padding. Different threads will allocate on different cache lines. The 16 generalizes to any power of 2 between 8 and the cache line size.