So basically I can not keep the current CPU full with data to crunch.
If you want to speed things up how aobut XMM0 to XMM16 Now that would help a bit. Also byte manipulation is terrible in the PIV MMX/SSE set which is what most RGB space is processed in.
Lots more I can say but how is 2 CPU's going to help me when I cannot keep the one CPU busy?
A. Admittedly there are some applications which are memory bound, that is, they require a great deal of memory bandwidth in order to keep the CPU busy. This will always be the case, no matter how fast the memory subsystem is. As soon as you build a faster subsystem, someone will come up with a new algorithm or usage model that requires even more. We try to design our CPUs and chipsets with the greatest possible memory bandwith within the constrain ts of a mass-produced, low-cost implementation. It is certainly possible to build a system where all the memory is as fast as the cache. Such a system, however, would be exorbitantly expensive. A very fast memory subsystem was one of the features of the old Cray* supercomputers, and they were at least a million dollars each.
If you are having difficulty with memory bandwidth issues, there are a number of approaches that can be taken. One is to review the algorithm and see if it can be revised to accomplish the same task with a more friendly memory access pattern (i.e. locality of reference). Another possibility is to see if you can spin off a thread to do other work while waiting for memory accesses to complete. You can also try to use prefetching to get the required data in the cache prior to computation. RAID systems can also improve bandwidth for disk I/O, and larger disk caches can also help. There are some algorithms, however, which simply cannot run efficiently on the memory subsystem provided with computers designed for the mass market. They require special-purpose hardware to provide the memory bandwidth needed.
One of the steps we are taking to address the memory bandwidth issue is increasing the size of the L2 cache. Our current offerings include models that have 2 MB of L2 cache. This is a large increase over previous models. You may also expect future models to have even larger caches.
While some applications are memory bound, there are a great number which are not. Numerous benchmarks and mainstream applications have demonstrated the performance benefit of multi-core processors, and the greatest benefit can be realized with a well-balanced threading model for an application. Developing an efficient threaded application, however, is much more difficult than developing a single-threaded one. It requires more thought and careful considerations for simultaneous execution of multiple code paths.
To help developers in this effort to thread their applications efficiently, we have developed several tools. The Intel Thread Checker and the Thread Profiler will help a developer analyze the performance of their code in a threaded environment, resolve bugs, and improve the efficiency of their threading model. If you have any questions about these tools, please go to http://www.intel.com/software/products/threading and review the product descriptions.
Message Edited by intel.software.network.support on 12-02-2005 08:47 PM
Message Edited by intel.software.network.support on 04-27-2006 08:19 AM