Optimizing on multi-core architectures

terrysb · ‎02-05-2009

We have a highly multi-threaded server application and would like some recommendations regarding how we can optimize this application for large configurations consisting of a minimum of 8 cores and 256GB RAM running under Linux or Windows 2008 (both 64-bit). Are there techniques we should consider that would result in better runtime or memory efficiency? Are there specific recommendations that we can adopt that have been used by some of the larger software vendors such as Oracle?

srimks · ‎02-07-2009

Quoting - terrysb

We have a highly multi-threaded server application and would like some recommendations regarding how we can optimize this application for large configurations consisting of a minimum of 8 cores and 256GB RAM running under Linux or Windows 2008 (both 64-bit). Are there techniques we should consider that would result in better runtime or memory efficiency? Are there specific recommendations that we can adopt that have been used by some of the larger software vendors such as Oracle?

Hi

Somehow the query doesn't have much information. In brief I can suggest, since it's multi-threaded server applications assuming sequential code -
(a) Analyze leaks & static-analysis warnings
(b) Analyse the behaviour of threads using Intel Thread Checker.
(c) Perform profiling of the code

By this time after doing both (a), (b) & (c), you would get some status of requirement what has to be done.

(d) Think of Parallelization & Vectorizations if needed

After (d) everything would be much clear.

~BR

fraggy · ‎02-07-2009

Quoting - terrysb

We have a highly multi-threaded server application

What kind of application ? Is it already multithreaded ? Why ?

Do you want to do some software optimisation on the application ?

robert-reed · ‎02-07-2009

Quoting - terrysb

We have a highly multi-threaded server application and would like some recommendations regarding how we can optimize this application for large configurations consisting of a minimum of 8 cores and 256GB RAM running under Linux or Windows 2008 (both 64-bit). Are there techniques we should consider that would result in better runtime or memory efficiency? Are there specific recommendations that we can adopt that have been used by some of the larger software vendors such as Oracle?

16 GB per core suggests your server application is very memory intensive. The first thought given such a memory requirement in a multi-threaded application is whether your memory allocation is efficient. There are a couple of multi-thread aware multi-platform memory allocators that might benefit your app: Hoard and Intel's own tbb_malloc. Both offer efficient allocation to multiple threads, though I've heard that tbb_malloc edges out Hoard in some memory tests.

Beyond that, it really depends a lot on the nature of your service application. If there's a fixed latency component to your service, back-end database access or intense computational processing for example, you'd want to break any dependency chains between the individual service requests and try to maximize parallelism as a latency hiding technique. If the amount of computation per request is small, you'd want to increase the chance that such requests get handled immediately so as to avoid unnecessary thrashing in and out of caches on multiple cores. But the strategy to employ depends a lot on the nature of the application. I'd start by doing a basic hot-spot analysis to find out what are the most active areas of your code. Focus on efficiency on those bits is likely to reveal the most benefit. Improvement may come from figuring out how to do it more efficiently, or better parallelize it, or best yet, avoid using it altogether. Have you tried using Intel's VTune Analyzer to study your code yet? That might be a good place to start.

terrysb · ‎02-09-2009

Quoting - Robert Reed (Intel)

16 GB per core suggests your server application is very memory intensive. The first thought given such a memory requirement in a multi-threaded application is whether your memory allocation is efficient. There are a couple of multi-thread aware multi-platform memory allocators that might benefit your app: Hoard and Intel's own tbb_malloc. Both offer efficient allocation to multiple threads, though I've heard that tbb_malloc edges out Hoard in some memory tests.

Beyond that, it really depends a lot on the nature of your service application. If there's a fixed latency component to your service, back-end database access or intense computational processing for example, you'd want to break any dependency chains between the individual service requests and try to maximize parallelism as a latency hiding technique. If the amount of computation per request is small, you'd want to increase the chance that such requests get handled immediately so as to avoid unnecessary thrashing in and out of caches on multiple cores. But the strategy to employ depends a lot on the nature of the application. I'd start by doing a basic hot-spot analysis to find out what are the most active areas of your code. Focus on efficiency on those bits is likely to reveal the most benefit. Improvement may come from figuring out how to do it more efficiently, or better parallelize it, or best yet, avoid using it altogether. Have you tried using Intel's VTune Analyzer to study your code yet? That might be a good place to start.

Our application is memory intensive. To put it simply, its like an in-memory database, where there are dozens of big data sets (100s of MB to 10s of GB each) and each data set is being shared by 10 or more threads. Each request, currently, works only on one of the data sets (this could be changed soon), and likely, each request will need a full scan of the data set (filtering and then aggregation). We do have various indices built into each data set for such computations. Different requests will be processed on a different thread.

Our app is multithreaded: 100s threads working simultaneously with some of them working on the data sets while others work on other type of requests and preparation of data sets.

We are trying different kinds of tools, including VTune (both Analyzer and Thread Profiler in evaluation stage), and we are using SmartHeap by MicroQuill as the memory manager.

Our basic questions:

In general, for this kind of application, where are the typical bottlenecks from the hardware point of view, especially when all CPUs are running above 80%?

What is the largest RAM configuration that the Intel CPU can support, for example, Nehalem/Westmere with Boxboro Ex Platform? In another words, if we run such an application needing 100s GB RAM as described above, what kind of hardware configuration (existing and future) would Intel would recommend?

robert-reed · ‎02-09-2009

Quoting - terrysb

In general, for this kind of application, where are the typical bottlenecks from the hardware point of view, especially when all CPUs are running above 80%?

What is the largest RAM configuration that the Intel CPU can support, for example, Nehalem/Westmere with Boxboro Ex Platform? In another words, if we run such an application needing 100s GB RAM as described above, what kind of hardware configuration (existing and future) would Intel would recommend?

The bottlenecks will depend on the memory modification requirements of your application. You mention that query threads will filter and aggregate results while other threads will be busy preparing other data sets. If the filter and aggregate activities only do private writes (buffering local to the thread) then that might be done without contention. Keeping the threads out of each other's way is the best way to maximize throughput.

It also sounds like this might be a memory bandwidth bound application.The size of the data sets you describe surely exceed available cache sizes. Filter/aggregate processes are often data streamers: they visit each datum once and move onto the next. In such apps, there's probably not much tobe gained by data reuse (taking advantage of data already in cache) but if these data sets are continuous (or at least piecewise continuous) in VM then their regular memory access order may enable the hardware to prefetch the streams in advance. There is a limit to the number of simultaneous streams the hardware can keep track of, and piling everything on to one server is likely to exacerbate the effects of that.

For computational parallelism we usually recommend a thread pool no larger than the number of hardware threads (for some applications, this means even turning off SMT-Simultaneous Multi-Threading), but many server apps have lots of latency to hide and going to multiple threads per CPU can sometimes hide that latency (continue processing a new request while an older request waits for some resource), but driving average CPU utilization up to 80% on a server app seems a little high. Programs generally need a little headroom to keep response times below some threshold--the less headroom, the longer the response time. I assume you're using one thread per request?

I haven't been keeping track of the hardware roadmap, so I'll let someone more versed in those details to take a stab at your other question.