a) My compiler is intel c++ v10.1
b) compiler swtches are optimized for vectorization
c) matrices sizes are 100x100
d) OS is linux suse 11.0
Try the same test on much bigger matrixes. Such that running time with one thread will be around at least few seconds (or better tens seconds).
Then we will see whether the problem is in your code, or in the attendant overheads (thread creation, thread destruction, thread blocking, thread signalling).
Also, watch out for [false] sharing, it will totally destroy performance/scaling.
Maybe you choose inappropriate level for parallelization. Parallelization is usually applied to:
1. single big task, or
2. many small tasks
If you have single small task maybe it's just not worth parallelization. And if you have many small tasks, then you can consider parallelization on "inter-task" level, not "intra-task". I.e. you have 8 threads, and each thread multiples it's own matrixes. This must scale better.