I might add to Tim's example of how combining threading with vectorization can benefit performance that when we teach a methodology for threading, we always recommend doing serial optimization (which can include vectorization) before threading just so to avoid giving false impressions about thread performance scaling. Threading is a great way to hide latency so poorly optimized programs show really good scaling.
I haven't tried combining auto-vectorization with auto-parallelization so I can't offer much advice there, but I can tell you there is a tension between vectorized code and threaded code that varies depending on the architecture on which the program runs. The old Pentium 4 processor with Hyper-Threading Technology could gain up to about 30% on certain applications through the aforementioned latency-hiding but one thread could saturate the floating-point units. Vectorization may increase ALU pressure butalso means more memory pressure getting operands into and out of the core. Bus or memory channelsaturation is another resource that may be under tension between vector processing demands and concurrent thread demands. Given the varyingarchitectural characteristics, the choice of going for one or the other or both can be quite harry.
Robert/Tim, Thanks.
Idid happen togetsome ideas from these two articles "
Best Practices for Developing and Optimizing Threaded Applications" Part 1 & Part 2 (
http://software.intel.com/en-us/articles/best-practices-for-developing-and-optimizing-threaded-appli... ). Basically, bothPart 1 & 2 did discusshow to useVTune for analyzing "parallelizing of sequential code to parallel code" nicely... But the author of these papers suggested "The next paper in this series will discuss techniques used to thread this particular function." Iam not sure as a part of being Intel users -
When this paper is suppose to published or available for Intel Users? Also, having a query -
Will these papers will make some comparisions between the approaches what can be done on Nehalem(Core i7) and some other old Intel processors (Intel Xeon CPU X5355, Core 2 Quad, Core 2 Duo, etc.)?
I am looking to execute these approach for8,000-10,000 lines ofmulti C/C++ file code probably on Nehalem & Intel Xeon Processor, a quad-core 5355 server processor. Since, Nehalem has Hyper-Threading, you would suggest that I would be having both better threading & vectorization benefits while using fine-grained parallelism within coarse-grained parallelism?
Do you have some links or articles where some daunting sinerios has been taken care when one has mix of granularities to be done for section of code?
I knowit will have very tough time for this work to implement atleast with some better performances gain but still would be good learning experience probably.
~BR
Mukkaysh Srivastav