Performance Gain on multi-core/processor system?

Dost__Conrad_W · ‎02-06-2011

I have converted a CPU bound application to use multiple threads in order gain performance when running on multicore machines. I found that its actually runs a bit slower with 2 threads then with 1 thread on the same machine. I am trying to determine why this happens and if I can expect my threaded application to run faster on a different system.

I am relatively positive that the problem is not access to shared regions. There is a very small critical region and the threads access the critical region infrequently.

The application is a large memory intensive EDA program so two threads should increase the memory access bandwidth requirements. I suspect that my machine does not has enough through put to handle this increased demand. I am running on my MacBook Pro which has the following specs:

Model Identifier: MacBookPro7,1

Processor Name: Intel Core 2 Duo

Processor Speed: 2.4 GHz

Number Of Processors: 1

Total Number Of Cores: 2

L2 Cache: 3 MB

Memory: 4 GB

Bus Speed: 1.07 GHz

I am considering purchasing a Mac Pro with 8 or 12 cores and 2 processors. Assuming that the threads are completely independent and each spends a lot of time accessing memory would multiple threads on one of these machines run faster then a single thread?

Ilnar · ‎02-06-2011

you don't need to purchase new computer right now, coz memory bandwith will be similar, perhaps a little bit higher, but no twice or higher
it seems your programm is need to be optimized in following ways:
- eliminate false sharing of caches if it's present
- make algorithm implementation cache-oblivious
- optimize your programm for smaller size of cache, coz cache size per thread is become smaller
- hm, something else, ...
- purchase a new computer with processor with integrated memory controller (first or second generation core i7). If it's possible, computer with 2 processor with integrated memory controller(xeon e55** or e56** series, where ** is 20 or higher)in order using simultanious memory banks

timintel · ‎02-07-2011

You won't get value from a dual CPU box unless you solve basic threading problems. It seems unlikely that your threads are "completely independent." If they are nearly so, it's barely conceivable that each thread requires most of the cache, thus they ruin each other's cache locality, but, as the other response indicated, more basic problems are likely. The biggest upgrade which would be useful until you solve such problems would be Core I7. I have no idea about Mac adoption of Core I7-2.

Ilnar · ‎02-07-2011

try to figure out bottlenecks using Parallel Inspector on windows machine

gaston-hillar · ‎02-07-2011

Hey conradca,

It's impossible to write an answer without details about your parallelized algorithm.

Can you provide additional details?

Cheers,

Gaston

Dost__Conrad_W · ‎02-08-2011

My program is an ATPG that reads a description of a chip and generates test vectors that test it for manufacturing defects.

The areas where locks are required is in memory allocation, file access and in the shared fault list. The former 2 are not significant as the threads do not make a many memory allocation or file IO calls. I am relatively certain that access to the shared fault list is not a concern either because of the way I have designed access to it.I will try changing the way the fault list is accessed so each thread has it's own independent fault list to eliminated this concern.

While its likely that the code could be cached effectively I suspect that the threads would be bound by accesses to physical memory. There is a huge amount of data that the threads must access which I think would overwhelm the cache. That is why I think the threads are limited by the throughput avail to access memory. If that is the case then adding an extra thread to take advantage of the second core would only slow down the test generation process.

Here are the two top-end Apple Mac Pros:

Two 2.66GHz 6-Core Intel Xeon Westmere processors

One 2.8GHz Quad-Core Intel Xeon "Nehalem" processor

Ilnar · ‎02-08-2011

1. try to use scalable memory allocator from TBB
2. try to use SSD as file storage, pay attention to throughput on 4-8Kb reads/writes (not on 256Kb - it's just marketing things) -- just read the articles with benchmarks
3. choose the system with processor(s) with integrated memory controller and more memory channels

Dost__Conrad_W · ‎02-20-2011

The program performs that vast majority of IO and memory allocations before the threads are created. Once they are created the threads do a little bit of IO, writing 1 line of report to a file and screen, writes each test vector (maybe 1k bytes) to a file and allocates a few small blocs of memory to support these activities. I cant believe that this limits performance.

The treads share access to the fault list so it is possible their waiting to access it. However the locks are on a fault basis and each thread uses a different pseudo random access method. So there are like 50,000 faults each locked and accessed independently. I doubt that this is the problem.

I suspect my MacBook Pro is the problem. While it has 2 cores it is not designed for performance and I suspect that there is not enough bandwidth to memory or a large enough cache to keep 2 CPU/Memory intensive threads running at full speed. So your third suggestion is the one I need to try.

What is "Try to use scalable memory allocator from TBB" ?

Ilnar · ‎02-20-2011

it means that memory allocations in multithreading envitonment require synchronizations in the memry allocation functions. And there are many scalable (in multithreading) allocators, such as TBB's scalable allocator and affinity allocator.

by the way, could yu please present some data about:
- how many timespend your program in the serial stage (before threads created)? just measure it.
- how many time ..... in parallel stage?
- how many time spend ... in IO? you can use strace utility or something like it in your system.
I tend to think that your program sped many time in IO.
In order to check that you can make in memory file system and try to use it in microbenchmarking.
If IO is the bottleneck, there are solutions like SSD, may be in RAID

If you want to buy a new computer, you can select brand new Sandy bridge i7-2600for single socketor Xeon 5620 or higher for two socket.

Ilnar · ‎02-20-2011

also you can check your program in different new hardware in Intel Partner Program
what operating system do you use?

Ilnar · ‎02-20-2011

Sorry, I misunderstood about IO early,

as I understood, initialization time should be small, and there are no large memory blocks.
could you please clarify a little bit more about fault lists?
- how many lists per thread?
- something about list declaration, elements size,
- min, avg, max length of lists== 50,000 ?
- what kind of synchronizations do you use?

Dost__Conrad_W · ‎02-23-2011

My system is an Apple MacBook Pro which has the following spec:

Model Name: MacBook Pro

Model Identifier: MacBookPro7,1

Processor Name: Intel Core 2 Duo

Processor Speed: 2.4 GHz

Number Of Processors: 1

Total Number Of Cores: 2

L2 Cache: 3 MB

Memory: 4 GB

Bus Speed: 1.07 GHz

It runs Apples version of Unix.

Dost__Conrad_W · ‎03-08-2011

I am using a MacBook Pro to develop and run my program. It uses the Mac OS - snow leopard which is a variant of UNIX. I ported it from a SUN system without much trouble.

Does the Intel Developer program have machines that run Linux or Solaris?

dwms · ‎03-16-2011

Is it possible to split your data set and run two single-threaded processes? Even if this won't produce correct output, it'd be interesting to see if it alleviates the performance problem you're seeing. I suppose it depends on the nature of your application whether this comparison would make sense.

jimdempseyatthecove · ‎03-17-2011

I agree with Gaston in that without seeing or profiling the application it is difficult to determine which system to use and where the bottlenecks are located.

It would appear from your description (ATPG) that this is an app that performs mostely integer/logical arithmatic. Further, my guess is the app is memory access intensive. Therefore, look for a processor + motherboard combination that provides the larger number of memory channels (product'd with speed) and the CPU with the number of cores needed to saturate the memory bus (plus one or more cores if something else is happening on the system).

Using a a profiler will tell you where hot spots are but will generally not inform you of design issues that may cause performance issues. You might want to look at underlaying design issues such as Array Of Structures vs Structure Of Arrays. Other issues such as how you allocate your vectors (proximity of interrelating data).

This is not necessarily and easy process since this will take a full understanding of data access interrelates, then with this knowledge you can then organize the data for more efficient access (packing cache lines, packing virtual memory pages to reduce TLB pressure).

I am not familliar with how the Mac Pros structure their RAM. (number of sockets, NUMA, type, etc...). If this system will be a workhorse, then consider a dual socket Nehalem (e.g. Dell R610/R710...)

Jim Dempsey

TimP · ‎03-17-2011

Current NUMA mode dual CPU machines are specifically designed to provide larger aggregate memory bandwidth when using both CPUs, if the application is arranged so that each thread uses the memory local to its CPU.