Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

Q&A: Sarmiento, Threading Games for High Performance on Intel(R) Processors

Sara Sarmiento
Here is a question Intel Software Network Support received about this article, followed bya response provided by our Application Engineers:
Q. As I was reading about the multi CPU tech I can only think of one fact. The PIV is starving now waiting for memory read writes to take place with an 800Mhz FSB. So I dont see 2 CPU's helping when it gets down to most applications large data needs. Doing vector math which has small data load requirements doesnt seem to wash with the majority of needs in the world. Such as database search bound by both memory and disk I/O, video processing bound by both memory and disk I/O.

So basically I can not keep the current CPU full with data to crunch.

If you want to speed things up how aobut XMM0 to XMM16 Now that would help a bit. Also byte manipulation is terrible in the PIV MMX/SSE set which is what most RGB space is processed in.

Lots more I can say but how is 2 CPU's going to help me when I cannot keep the one CPU busy?

A. Admittedly there are some applications which are memory bound, that is, they require a great deal of memory bandwidth in order to keep the CPU busy. This will always be the case, no matter how fast the memory subsystem is. As soon as you build a faster subsystem, someone will come up with a new algorithm or usage model that requires even more. We try to design our CPUs and chipsets with the greatest possible memory bandwith within the constrain ts of a mass-produced, low-cost implementation. It is certainly possible to build a system where all the memory is as fast as the cache. Such a system, however, would be exorbitantly expensive. A very fast memory subsystem was one of the features of the old Cray* supercomputers, and they were at least a million dollars each.

If you are having difficulty with memory bandwidth issues, there are a number of approaches that can be taken. One is to review the algorithm and see if it can be revised to accomplish the same task with a more friendly memory access pattern (i.e. locality of reference). Another possibility is to see if you can spin off a thread to do other work while waiting for memory accesses to complete. You can also try to use prefetching to get the required data in the cache prior to computation. RAID systems can also improve bandwidth for disk I/O, and larger disk caches can also help. There are some algorithms, however, which simply cannot run efficiently on the memory subsystem provided with computers designed for the mass market. They require special-purpose hardware to provide the memory bandwidth needed.

One of the steps we are taking to address the memory bandwidth issue is increasing the size of the L2 cache. Our current offerings include models that have 2 MB of L2 cache. This is a large increase over previous models. You may also expect future models to have even larger caches.

While some applications are memory bound, there are a great number which are not. Numerous benchmarks and mainstream applications have demonstrated the performance benefit of multi-core processors, and the greatest benefit can be realized with a well-balanced threading model for an application. Developing an efficient threaded application, however, is much more difficult than developing a single-threaded one. It requires more thought and careful considerations for simultaneous execution of multiple code paths.

To help developers in this effort to thread their applications efficiently, we have developed several tools. The Intel Thread Checker and the Thread Profiler will help a developer analyze the performance of their code in a threaded environment, resolve bugs, and improve the efficiency of their threading model. If you have any questions about these tools, please go to and review the product descriptions.


Lexi S.

IntelSoftware NetworkSupport

Contact us

Message Edited by on 12-02-2005 08:47 PM

Message Edited by on 04-27-2006 08:19 AM

0 Kudos
1 Reply
Honored Contributor III
Where I have seen P4 limited by memory issues, there usually is a problem such as DTLB miss or Write Combine Buffer thrashing.
Taking advantage of HyperThreading often helps where there are DTLB miss stalls, including database operations, so I've heard.
For WCB, HyperThreading is not a solution, but blocking writes so as to limit the number of cache lines active at one time should help. The 64-bit P4 has 2 more WCBs than the 32-bit P4, and those are available in both 32- and 64-bit mode.
Using the 16 XMM registers is most likely to be helpful where it improves data locality and avoids extra memory traffic. The Intel 64-bit compilers should do this fairly effectively when the source maps to SSE3 instructions. 64-bit mode provides a reasonable number of 64-bit general registers, which should help with byte operations not supported in SSE.
0 Kudos