- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why such a poor improvement in run time as the number of threads increase from 1 to 4?
And why does Tcpu increase with the number of threads (although not in a linear fashion)?
The platform is a PC workstation with two Xeon 5160 dual core processors (3.0 GHz) (four cores total). There was no disk IO for these runs as all was in core. The OS was Win XP pro_64, IVF 9.1 the compiler, and MKL 9.0.
Overall, this workstation with the Xeon 5160's and MKL is very much faster than my old Pentium 4 (3.2 GHz) single processor. But these results suggest that dual processors are not cost effective (but certainly one dual core processor is worthwhile).
John
Number of Threads |
Wall Time (hr) |
Cpu Time (hr) |
Twall / Tcpu |
Speed Improvement |
4 |
1.28 |
4.38 |
0.29 |
1.28 |
3 |
1.33 |
3.5 |
0.38 |
1.23 |
2 |
1.38 |
2.51 |
0.55 |
1.19 |
1 |
1.64 |
1.64 |
1.0 |
1.0 |
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could argue that multiple cores aren't cost effective, if you are testing nothing but front side bus performance. Even for that purpose, you must assure that the threads distribute evenly between sockets and don't jump often between them, generating extra bus traffic. If you are restricted to legacy Windows versions, your options are limited, but you could try the affinity check boxes to optimize 1 and 2 thread performance.
If you have a snoop filter, you should test with it both enabled and disabled (BIOS setup option). For optimum performance with snoop filter off, you might experiment with changing the preferred order of mapping threads to cores, as you could do with taskset.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Suggestion,
Setup the application to run with a single thread. Experiment with your block size, reducing from 2500 x 2500, and monitoring the effects on wall time. At some point you will find a "sweet spot" where you achieve best performance.
Next, setup the application to run with 2 threads, restrict the threads to processors 0,1 (you can do this with Task Manager). Manipulate the block size until you identify the sweet spot for 2 cores on same processor.
Last, enable all 4 cores. You might not need to adjust for new sweet spot.
If you do not achieve significant improvement then you will have to disect your program using VTune or other performance analisys tool. I would expect your problem to attain a 3x improvement using 4 cores. If the application is memory bandwidth bound and if it cannot be fixed by getting better utilization out of the cach then you might be stuck with the abismal performance gain.
Don't be too quick to give up.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
taskset -c 0,2 yourbenchmark (syntax for linux 2.6. kernel)
It's also possible that you would get the best performance by optimizing block size for 1 thread per socket, so that each thread gets the full 4MB cache to itself. You would always require taskset -c 0,1 (or 2,3) to schedule properly, unless EL4_U4 is able to do it automatically. This way, you should get speedup by at least 1.8.
I was not able to infer what OS was in use in the original post; now I am assuming Jim figured this out correctly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim, you are quite right. The archetecture of the system, as well as the OS,determines how the cores are sequenced. The system I use has two AMD Opteron Dual Core processors (4 cores). The organization appears to be ((0,1),(2,3)). I would be surprised if the organization was ((0,2),(1,3)) but it is not completely outside of the realm of someone's thinking. The former organization ((0,1),(2,3)) has adjacent processor affinity bits having higher affinity. i.e. they share the same L2/L3 cache and theoreticaly a thread switch between the adjacent pairs could experience favorable cache hits. Seperating the affinity((0,2),(1,3)) makes sense only if your application is somewhat brain dead and then by saying run on 2 processors (0,1) you will get better performance if the threads saturate each cache. You get worse if the two threads can run inside the unified cache (i.e. they would run better on the same chip). You win some - you loose some. I think when you had 2 processors, each with HT, then the((0,2),(1,3)) makes sense.
The processor sequencing may be a BIOS issue as well as a operating system issue. Looking at the BIOS information may help.
When he makesthe 2 core test he should make two such tests (0,1) and (0,2). It might not hurt to also make the othe test of the permutations(0,3), (1,2), (1,3), (2,3).
Addition information.
The original post stated there was no I/O. There was no indication if the application had high video updates (some users to not think of this as I/O). If there is moderate video activity it is best not to use an on-board video adapter. Moderate video would include browsing the internet. High video would be displaying the interrum results in real time or watching a video.
On my 4 core system, running with on board video, using a CMD window (DOS box) and issuing "DIR *.* /S", which is a directory of the entire disk. This would saturate one of the processors computing capability. i.e. the system showed 25% utilization with one of the processor CPU time at 100%. Installing a relatively inexpensive video card (e-GeForce MX 4000 PCI w/ 128MB). The processor time dropped to between 0% and 1% - could hardly see the 1% on the display. Additional, the directory compeleted about twice as fast. Because I use 2 displays I opted to install two of those video cards. It runs real nice now.
The parallel application I run uses several to many big things and not one humongous thing. So I can distribute work by things. I can achieve over 3x improvement in my application. (~85% of 4x = 3.4). I spent a significant amount of time tuning the application. It would be nice if I could afford a 2 x 4core system (or 4x 4 core)but that will have to wait.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you are on a Windows platform you can call GetLogicalProcessorInformation to retrieve a table of information regarding the arangement of processors on the system and how they relate to the Affinity Bitmask setup for the OS. See:
Jim
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page