Re: Mix of Granularities

srimks · ‎03-08-2009

Hello All.

The primary goal of a multi-coreshared memoryprocessor is to determine mix of granularities which can produce highest performance.

If such is a case - How does coarse-grained shared memory multi-processor in which each invidual processor can exploit fine-grained parallelism through pipeline or multiple instruction issuing is done?

Looking for better informed!!

~BR

gaston-hillar · ‎03-17-2009

Quoting - srimks

Hello All.

The primary goal of a multi-coreshared memoryprocessor is to determine mix of granularities which can produce highest performance.

If such is a case - How does coarse-grained shared memory multi-processor in which each invidual processor can exploit fine-grained parallelism through pipeline or multiple instruction issuing is done?

Looking for better informed!!

~BR

Hi srimks,

I think the optimal model depends on the kind of application in which you are working. I don't understand your point. If you provide more specific information, I'll be able to provide information about my experiences with parallel computing.

Cheers,

Gastn

srimks · ‎03-19-2009

Quoting - Gastn C. Hillar

Hi srimks,

I think the optimal model depends on the kind of application in which you are working. I don't understand your point. If you provide more specific information, I'll be able to provide information about my experiences with parallel computing.

Cheers,

Gastn

To make it very simple for better understanding, "Within Coarse-Grained Parallelism (could be either auto-parallelization or OpenMP threads call), how can call of auto-vectorization or SIMD calls (Fine-Grained) could be beneficial?" for a loop.

Do you have some test samples or links which shows performance gains?

~BR

TimP · ‎03-20-2009

Quoting - srimks

, "Within Coarse-Grained Parallelism (could be either auto-parallelization or OpenMP threads call), how can call of auto-vectorization or SIMD calls (Fine-Grained) could be beneficial?" for a loop.

Do you have some test samples or links which shows performance gains?

Were you having difficulty finding articles from 20 years ago, when people started churning out papers on this?
ftp.cwi.nl/CWIreports/1992/NM-R9225.pdf

Are you looking for cases where compilers don't do it all automatically, so it gets a bit ugly? Take the netlib vectors benchmark, s126:

k= 1
do i= 1,n
do j= 2,n
bb(i,j)= bb(i,j-1)+array(k)*cc(i,j)
k= k+1
enddo
k= k+1
enddo

Make it easy for ifort to optimize with OpenMP and registerization of the inner recursion:
!$omp parallel do private(k,tmp) if(n>103)
do i= 1,n
k= i*n+1-n
tmp=bb(i,1)
do j= 2,n
tmp= tmp+array(k)*cc(i,j)
bb(i,j)= tmp
k= k+1
enddo
enddo

The tmp in inner loop helps Intel compilers perform an optimization which several others (Sun, gnu,...) do automatically so you don't need to make it visible.

Add SIMD parallelism by strip mining, taking 4 outer loop iterations at a time in the inner loop:
i__2 = *n;
#pragma omp parallel for if(i__2 > 103)
for (i__ = 1; i__ <= i__2; i__ += 4) {
int i__3 = *n;
int k = i__ * i__3 - i__3;
__m128 tmp = _mm_loadu_ps(&bb[i__ + bb_dim1]);
for (int j = 2; j <= i__3; ++j){
__m128 tmp1 = _mm_set_ps(cdata_1.array[k+3*i__3],
cdata_1.array[k+2*i__3],cdata_1.array[k+1*i__3],
cdata_1.array[k+0*i__3]);
__m128 tmp2 = _mm_loadu_ps(&cc[i__ + j * cc_dim1]);
tmp=_mm_add_ps(tmp,_mm_mul_ps(tmp1,tmp2));
_mm_store_ps(&bb[i__ + j * bb_dim1],tmp);
++k;
}
}

(shows advantage for SSE4 version of mm_set_ps)
Note that the inner SIMD optimization is good down to smaller problem sizes than the outer OpenMP threading.

The fashion the last few years was to avoid the combined inner vector/outer parallel which was so fashionable 20 years ago, as the inner vector might detract from bragging rights about threaded parallel performance scaling, even though it was the way to optimize performance, if only by an additional 50% on earlier CPUs. So, you would be able to find a few publications which ignore the subject.

robert-reed · ‎03-20-2009

I might add to Tim's example of how combining threading with vectorization can benefit performance that when we teach a methodology for threading, we always recommend doing serial optimization (which can include vectorization) before threading just so to avoid giving false impressions about thread performance scaling. Threading is a great way to hide latency so poorly optimized programs show really good scaling.

I haven't tried combining auto-vectorization with auto-parallelization so I can't offer much advice there, but I can tell you there is a tension between vectorized code and threaded code that varies depending on the architecture on which the program runs. The old Pentium 4 processor with Hyper-Threading Technology could gain up to about 30% on certain applications through the aforementioned latency-hiding but one thread could saturate the floating-point units. Vectorization may increase ALU pressure butalso means more memory pressure getting operands into and out of the core. Bus or memory channelsaturation is another resource that may be under tension between vector processing demands and concurrent thread demands. Given the varyingarchitectural characteristics, the choice of going for one or the other or both can be quite harry.

srimks · ‎03-20-2009

Quoting - Robert Reed (Intel)

I might add to Tim's example of how combining threading with vectorization can benefit performance that when we teach a methodology for threading, we always recommend doing serial optimization (which can include vectorization) before threading just so to avoid giving false impressions about thread performance scaling. Threading is a great way to hide latency so poorly optimized programs show really good scaling.

I haven't tried combining auto-vectorization with auto-parallelization so I can't offer much advice there, but I can tell you there is a tension between vectorized code and threaded code that varies depending on the architecture on which the program runs. The old Pentium 4 processor with Hyper-Threading Technology could gain up to about 30% on certain applications through the aforementioned latency-hiding but one thread could saturate the floating-point units. Vectorization may increase ALU pressure butalso means more memory pressure getting operands into and out of the core. Bus or memory channelsaturation is another resource that may be under tension between vector processing demands and concurrent thread demands. Given the varyingarchitectural characteristics, the choice of going for one or the other or both can be quite harry.

Robert/Tim, Thanks.

Idid happen togetsome ideas from these two articles "Best Practices for Developing and Optimizing Threaded Applications" Part 1 & Part 2 ( http://software.intel.com/en-us/articles/best-practices-for-developing-and-optimizing-threaded-applications-part-1/ ). Basically, bothPart 1 & 2 did discusshow to useVTune for analyzing "parallelizing of sequential code to parallel code" nicely... But the author of these papers suggested "The next paper in this series will discuss techniques used to thread this particular function." Iam not sure as a part of being Intel users - When this paper is suppose to published or available for Intel Users? Also, having a query - Will these papers will make some comparisions between the approaches what can be done on Nehalem(Core i7) and some other old Intel processors (Intel Xeon CPU X5355, Core 2 Quad, Core 2 Duo, etc.)?

I am looking to execute these approach for8,000-10,000 lines ofmulti C/C++ file code probably on Nehalem & Intel Xeon Processor, a quad-core 5355 server processor. Since, Nehalem has Hyper-Threading, you would suggest that I would be having both better threading & vectorization benefits while using fine-grained parallelism within coarse-grained parallelism? Do you have some links or articles where some daunting sinerios has been taken care when one has mix of granularities to be done for section of code?

I knowit will have very tough time for this work to implement atleast with some better performances gain but still would be good learning experience probably.

~BR
Mukkaysh Srivastav

TimP · ‎03-20-2009

As Hyper-Threading doesn't increase arithmetic unit resources, examples such as the above where it is possible to gain high arithmetic instruction rates don't tend to gain from HT. This is only one of the ways in which HT complicates the range of possibilities.
In a medum to large application, chances are good that you see a variety of characteristics in the time consuming sections. I think you've heard before the recommendation of profiling tools to identify and characterize those sections where tuning work could pay off.

srimks · ‎03-22-2009

Quoting - tim18

As Hyper-Threading doesn't increase arithmetic unit resources, examples such as the above where it is possible to gain high arithmetic instruction rates don't tend to gain from HT. This is only one of the ways in which HT complicates the range of possibilities.
In a medum to large application, chances are good that you see a variety of characteristics in the time consuming sections. I think you've heard before the recommendation of profiling tools to identify and characterize those sections where tuning work could pay off.

I do use both Intel VTune & Thread Checker, as NHM(Core i7) has HT, how does Threads Level Parallelism (TLP)for HT being taken care within it's cores (assuming NHM being single node/smp, quad-core, eight threaded, 64 bit, 4 issue super-scalar, out-of-order MPU with a 16 stage pipeline, 48 bit virtual and 40 bit physical addressing, do correct me)? There will be some concepts of Data Flow procesors (within NHM) using control parallelism for its Threads alone to utilize fully the FU. How does this Data Flow processors detects Threads to be executed in parallel and resolves dependencies by it's own for having effective Threads Parallelism?

I am asking above so that I am finally able to conclude DLP (Data Level Parallelism) within TLP effectively for any Data Flow processors and also the limitations of ILP (Instructions Level Parallelism)within TLP if any.

~BR