16 GB per core suggests your server application is very memory intensive. The first thought given such a memory requirement in a multi-threaded application is whether your memory allocation is efficient. There are a couple of multi-thread aware multi-platform memory allocators that might benefit your app: Hoard and Intel's own tbb_malloc. Both offer efficient allocation to multiple threads, though I've heard that tbb_malloc edges out Hoard in some memory tests.
Beyond that, it really depends a lot on the nature of your service application. If there's a fixed latency component to your service, back-end database access or intense computational processing for example, you'd want to break any dependency chains between the individual service requests and try to maximize parallelism as a latency hiding technique. If the amount of computation per request is small, you'd want to increase the chance that such requests get handled immediately so as to avoid unnecessary thrashing in and out of caches on multiple cores. But the strategy to employ depends a lot on the nature of the application. I'd start by doing a basic hot-spot analysis to find out what are the most active areas of your code. Focus on efficiency on those bits is likely to reveal the most benefit. Improvement may come from figuring out how to do it more efficiently, or better parallelize it, or best yet, avoid using it altogether. Have you tried using Intel's VTune Analyzer to study your code yet? That might be a good place to start.
The bottlenecks will depend on the memory modification requirements of your application. You mention that query threads will filter and aggregate results while other threads will be busy preparing other data sets. If the filter and aggregate activities only do private writes (buffering local to the thread) then that might be done without contention. Keeping the threads out of each other's way is the best way to maximize throughput.
It also sounds like this might be a memory bandwidth bound application.The size of the data sets you describe surely exceed available cache sizes. Filter/aggregate processes are often data streamers: they visit each datum once and move onto the next. In such apps, there's probably not much tobe gained by data reuse (taking advantage of data already in cache) but if these data sets are continuous (or at least piecewise continuous) in VM then their regular memory access order may enable the hardware to prefetch the streams in advance. There is a limit to the number of simultaneous streams the hardware can keep track of, and piling everything on to one server is likely to exacerbate the effects of that.
For computational parallelism we usually recommend a thread pool no larger than the number of hardware threads (for some applications, this means even turning off SMT-Simultaneous Multi-Threading), but many server apps have lots of latency to hide and going to multiple threads per CPU can sometimes hide that latency (continue processing a new request while an older request waits for some resource), but driving average CPU utilization up to 80% on a server app seems a little high. Programs generally need a little headroom to keep response times below some threshold--the less headroom, the longer the response time. I assume you're using one thread per request?
I haven't been keeping track of the hardware roadmap, so I'll let someone more versed in those details to take a stab at your other question.