performance loss with increasing number of threads

rdabra · ‎09-25-2008

I programmed a threaded matrix multiplication mat_A * mat_B where each thread runs a single row of mat_A multiplying all columns of mat_B. I started with one thread and performance got better until only three threads (running concurrently). After 3 threads, multiplication times increased drammatically. I have a dell precision t5400 machine with two xeon quad core 64bit processors. My question is: Shouldn't the performance get worse only after 8 threads ?

Obs:

a) My compiler is intel c++ v10.1

b) compiler swtches are optimized for vectorization

c) matrices sizes are 100x100

d) OS is linux suse 11.0

TimP · ‎09-25-2008

Quoting - rdabra@globo.com

I programmed a threaded matrix multiplication mat_A * mat_B where each thread runs a single row of mat_A multiplying all columns of mat_B. I started with one thread and performance got better until only three threads (running concurrently). After 3 threads, multiplication times increased drammatically. I have a dell precision t5400 machine with two xeon quad core 64bit processors. My question is: Shouldn't the performance get worse only after 8 threads ?

Obs:

a) My compiler is intel c++ v10.1

b) compiler swtches are optimized for vectorization

c) matrices sizes are 100x100

d) OS is linux suse 11.0

TimP · ‎09-25-2008

It's quite difficult, not straightforward,to get parallel scaling to many cores on a moderate sized matrix multiplication. I suppose the only suggestion to be made in a reasonable number of words is to use the MKL library, and see how many threads it is able to use effectively for your problem. Compare the code it uses with another commercial library (but ACML had a requirement that you agree not to do this) and with netlib ?GEMM and your modifications of it.

Dmitry_Vyukov · ‎09-26-2008

Quoting - rdabra@globo.com

I programmed a threaded matrix multiplication mat_A * mat_B where each thread runs a single row of mat_A multiplying all columns of mat_B. I started with one thread and performance got better until only three threads (running concurrently). After 3 threads, multiplication times increased drammatically. I have a dell precision t5400 machine with two xeon quad core 64bit processors. My question is: Shouldn't the performance get worse only after 8 threads ?

Obs:

a) My compiler is intel c++ v10.1

b) compiler swtches are optimized for vectorization

c) matrices sizes are 100x100

d) OS is linux suse 11.0

Try the same test on much bigger matrixes. Such that running time with one thread will be around at least few seconds (or better tens seconds).

Then we will see whether the problem is in your code, or in the attendant overheads (thread creation, thread destruction, thread blocking, thread signalling).

Also, watch out for [false] sharing, it will totally destroy performance/scaling.

Dmitry_Vyukov · ‎09-30-2008

Quoting - rdabra@globo.com

I programmed a threaded matrix multiplication mat_A * mat_B where each thread runs a single row of mat_A multiplying all columns of mat_B. I started with one thread and performance got better until only three threads (running concurrently). After 3 threads, multiplication times increased drammatically. I have a dell precision t5400 machine with two xeon quad core 64bit processors. My question is: Shouldn't the performance get worse only after 8 threads ?

Maybe you choose inappropriate level for parallelization. Parallelization is usually applied to:

1. single big task, or

2. many small tasks

If you have single small task maybe it's just not worth parallelization. And if you have many small tasks, then you can consider parallelization on "inter-task" level, not "intra-task". I.e. you have 8 threads, and each thread multiples it's own matrixes. This must scale better.