Maybe you choose inappropriate level for parallelization. Parallelization is usually applied to:
1. single big task, or
2. many small tasks
If you have single small task maybe it's just not worth parallelization. And if you have many small tasks, then you can consider parallelization on "inter-task" level, not "intra-task". I.e. you have 8 threads, and each thread multiples it's own matrixes. This must scale better.