it seems your programm is need to be optimized in following ways:
- eliminate false sharing of caches if it's present
- make algorithm implementation cache-oblivious
- optimize your programm for smaller size of cache, coz cache size per thread is become smaller
- hm, something else, ...
- purchase a new computer with processor with integrated memory controller (first or second generation core i7). If it's possible, computer with 2 processor with integrated memory controller(xeon e55** or e56** series, where ** is 20 or higher)in order using simultanious memory banks
2. try to use SSD as file storage, pay attention to throughput on 4-8Kb reads/writes (not on 256Kb - it's just marketing things) -- just read the articles with benchmarks
3. choose the system with processor(s) with integrated memory controller and more memory channels
by the way, could yu please present some data about:
- how many timespend your program in the serial stage (before threads created)? just measure it.
- how many time ..... in parallel stage?
- how many time spend ... in IO? you can use strace utility or something like it in your system.
I tend to think that your program sped many time in IO.
In order to check that you can make in memory file system and try to use it in microbenchmarking.
If IO is the bottleneck, there are solutions like SSD, may be in RAID
If you want to buy a new computer, you can select brand new Sandy bridge i7-2600for single socketor Xeon 5620 or higher for two socket.
Sorry, I misunderstood about IO early,
as I understood, initialization time should be small, and there are no large memory blocks.
could you please clarify a little bit more about fault lists?
- how many lists per thread?
- something about list declaration, elements size,
- min, avg, max length of lists== 50,000 ?
- what kind of synchronizations do you use?
Model Name: MacBook Pro
Model Identifier: MacBookPro7,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2.4 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 3 MB
Memory: 4 GB
Bus Speed: 1.07 GHz
It runs Apples version of Unix.
It would appear from your description (ATPG) that this is an app that performs mostely integer/logical arithmatic. Further, my guess is the app is memory access intensive. Therefore, look for a processor + motherboard combination that provides the larger number of memory channels (product'd with speed) and the CPU with the number of cores needed to saturate the memory bus (plus one or more cores if something else is happening on the system).
Using a a profiler will tell you where hot spots are but will generally not inform you of design issues that may cause performance issues. You might want to look at underlaying design issues such as Array Of Structures vs Structure Of Arrays. Other issues such as how you allocate your vectors (proximity of interrelating data).
This is not necessarily and easy process since this will take a full understanding of data access interrelates, then with this knowledge you can then organize the data for more efficient access (packing cache lines, packing virtual memory pages to reduce TLB pressure).
I am not familliar with how the Mac Pros structure their RAM. (number of sockets, NUMA, type, etc...). If this system will be a workhorse, then consider a dual socket Nehalem (e.g. Dell R610/R710...)