MKL versus number of threads on two Xeon 5160 core 2 duo's

john3 · ‎11-29-2006

The table below shows the performance of an application which does mostly cgemm matrix mults. Only the MKL portion is threaded. This problem LU factored a compressed 157,000 x 157,000 matrix by compressed blocks of size (2500 x 2500).

Why such a poor improvement in run time as the number of threads increase from 1 to 4?

And why does Tcpu increase with the number of threads (although not in a linear fashion)?

The platform is a PC workstation with two Xeon 5160 dual core processors (3.0 GHz) (four cores total). There was no disk IO for these runs as all was in core. The OS was Win XP pro_64, IVF 9.1 the compiler, and MKL 9.0.

Overall, this workstation with the Xeon 5160's and MKL is very much faster than my old Pentium 4 (3.2 GHz) single processor. But these results suggest that dual processors are not cost effective (but certainly one dual core processor is worthwhile).

John

Number of Threads

Wall Time (hr)

Cpu Time (hr)

Twall / Tcpu

Speed Improvement

4

1.28

4.38

0.29

1.28

3

1.33

3.5

0.38

1.23

2

1.38

2.51

0.55

1.19

1

1.64

1.0

TimP · ‎11-30-2006

MKL controls the number of threads internally, according to rules about problem size. OMP_NUM_THREADS sets an upper limit, but does not assure that limit will be reached. On much smaller blocks, I've seen MKL 8.1 using more threads than 9.0. With blocks as large as you mention, you should be watching for cache miss and eviction activity, and perhaps trying smaller blocks to optimize threaded performance. In order to get effective cache sharing between 2 threads on the same socket, the typical target is a block size which causes each thread to use 40% of L2. If you aren't using taskset or similar means to improve cache affinity, that could make a significant improvement.
You could argue that multiple cores aren't cost effective, if you are testing nothing but front side bus performance. Even for that purpose, you must assure that the threads distribute evenly between sockets and don't jump often between them, generating extra bus traffic. If you are restricted to legacy Windows versions, your options are limited, but you could try the affinity check boxes to optimize 1 and 2 thread performance.
If you have a snoop filter, you should test with it both enabled and disabled (BIOS setup option). For optimum performance with snoop filter off, you might experiment with changing the preferred order of mapping threads to cores, as you could do with taskset.

jimdempseyatthecove · ‎12-02-2006

Suggestion,

Setup the application to run with a single thread. Experiment with your block size, reducing from 2500 x 2500, and monitoring the effects on wall time. At some point you will find a "sweet spot" where you achieve best performance.

Next, setup the application to run with 2 threads, restrict the threads to processors 0,1 (you can do this with Task Manager). Manipulate the block size until you identify the sweet spot for 2 cores on same processor.

Last, enable all 4 cores. You might not need to adjust for new sweet spot.

If you do not achieve significant improvement then you will have to disect your program using VTune or other performance analisys tool. I would expect your problem to attain a 3x improvement using 4 cores. If the application is memory bandwidth bound and if it cannot be fixed by getting better utilization out of the cach then you might be stuck with the abismal performance gain.

Don't be too quick to give up.

Jim Dempsey

TimP · ‎12-02-2006

On the Xeon 5160 BIOS versions I've seen, processors 0,1 are on separate sockets. You can figure this out by checking /proc/cpuinfo. Jim's advice makes sense, if you choose the right combination, e.g.
taskset -c 0,2 yourbenchmark (syntax for linux 2.6. kernel)
It's also possible that you would get the best performance by optimizing block size for 1 thread per socket, so that each thread gets the full 4MB cache to itself. You would always require taskset -c 0,1 (or 2,3) to schedule properly, unless EL4_U4 is able to do it automatically. This way, you should get speedup by at least 1.8.
I was not able to infer what OS was in use in the original post; now I am assuming Jim figured this out correctly.

jimdempseyatthecove · ‎12-02-2006

Tim, you are quite right. The archetecture of the system, as well as the OS,determines how the cores are sequenced. The system I use has two AMD Opteron Dual Core processors (4 cores). The organization appears to be ((0,1),(2,3)). I would be surprised if the organization was ((0,2),(1,3)) but it is not completely outside of the realm of someone's thinking. The former organization ((0,1),(2,3)) has adjacent processor affinity bits having higher affinity. i.e. they share the same L2/L3 cache and theoreticaly a thread switch between the adjacent pairs could experience favorable cache hits. Seperating the affinity((0,2),(1,3)) makes sense only if your application is somewhat brain dead and then by saying run on 2 processors (0,1) you will get better performance if the threads saturate each cache. You get worse if the two threads can run inside the unified cache (i.e. they would run better on the same chip). You win some - you loose some. I think when you had 2 processors, each with HT, then the((0,2),(1,3)) makes sense.

The processor sequencing may be a BIOS issue as well as a operating system issue. Looking at the BIOS information may help.

When he makesthe 2 core test he should make two such tests (0,1) and (0,2). It might not hurt to also make the othe test of the permutations(0,3), (1,2), (1,3), (2,3).

Addition information.

The original post stated there was no I/O. There was no indication if the application had high video updates (some users to not think of this as I/O). If there is moderate video activity it is best not to use an on-board video adapter. Moderate video would include browsing the internet. High video would be displaying the interrum results in real time or watching a video.

On my 4 core system, running with on board video, using a CMD window (DOS box) and issuing "DIR *.* /S", which is a directory of the entire disk. This would saturate one of the processors computing capability. i.e. the system showed 25% utilization with one of the processor CPU time at 100%. Installing a relatively inexpensive video card (e-GeForce MX 4000 PCI w/ 128MB). The processor time dropped to between 0% and 1% - could hardly see the 1% on the display. Additional, the directory compeleted about twice as fast. Because I use 2 displays I opted to install two of those video cards. It runs real nice now.

The parallel application I run uses several to many big things and not one humongous thing. So I can distribute work by things. I can achieve over 3x improvement in my application. (~85% of 4x = 3.4). I spent a significant amount of time tuning the application. It would be nice if I could afford a 2 x 4core system (or 4x 4 core)but that will have to wait.

Jim Dempsey

TimP · ‎12-02-2006

I've noticed that it is common on Opteron and Itanium systems for BIOS to assign 0 and 1 to the same socket, etc. The reason for assigning them to separate sockets on Xeon 51xx is that you can reboot Windows or linux with 2 CPUs active (or do the same by enabling 2 cores in BIOS setup). Then you get maximum performance for 2 threads: each thread gets a full 4MB cache, and a separate front side bus. If running linux, taskset can accomplish the job for you, in the situation we are discussing, but you must be aware of the way BIOS numbers the cores. As Opteron and Itanium don't have a shared cache, they choose the arrangement which Jim noted.

jimdempseyatthecove · ‎12-03-2006

When you are on a Windows platform you can call GetLogicalProcessorInformation to retrieve a table of information regarding the arangement of processors on the system and how they relate to the Affinity Bitmask setup for the OS. See:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/getnumaproximitynode.asp

Jim