double-threaded LU matrix factorization

rdabra · ‎11-19-2008

I constructed a double-threaded LU matrix factorization. At work, running on a windows XP 32Bit system over an Intel Core 2 Duo E6750 2.66GHz CPU, the program runned on half of the time of a single threaded factorization routine. Both factorization routines were compiled by a microsoft c++ compiler. At Home, on a Suse Linux 64bit machine using two intel xeon quad-core cpus, the double-threaded routine is two times slower than the single the single threaded. I worked with linux intel c compiler v10.1. My question is: what are the possible reasons for these different behaviours ? thanx in advance.

jimdempseyatthecove · ‎11-19-2008

Quoting - [email protected]

I constructed a double-threaded LU matrix factorization. At work, running on a windows XP 32Bit system over an Intel Core 2 Duo E6750 2.66GHz CPU, the program runned on half of the time of a single threaded factorization routine. Both factorization routines were compiled by a microsoft c++ compiler. At Home, on a Suse Linux 64bit machine using two intel xeon quad-core cpus, the double-threaded routine is two times slower than the single the single threaded. I worked with linux intel c compiler v10.1. My question is: what are the possible reasons for these different behaviours ? thanx in advance.

Check your option switches. In particular those relating to vectorization (SSE... series of instructions). Note, the switches on Linux are not necessarily the same for on Windows.

The second place to look is your LU matrix factorization routine may run best when both cores share the same L2 cache. Experiment using Affinity settings (force application to run on only 2 of the CPUs of your choice).

Jim Dempsey

TimP · ‎11-19-2008

We can't regurgitate the entire book on possible threading problems. On the Core 2 Duo, you have only a single L2 cache, so problems of the false sharing type, or races on variables not properly made thread private, may not interfere. On Core 2 quad, you would have to pin the 2 threads to the same cache in such a case, so you would be unable to scale beyond 2 threads.

Normal optimizations, not performed by one compiler or the other, may expose or hide threading problems.