Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

QuickSort OpenMP Parallel

zhangzhe65
Beginner
1,368 Views
why QuickSort OpenMP Parallel
int _tmain(int argc, _TCHAR* argv[])
{


begin=clock();///CLK_TCK;

quicksort(R,0,N-1);

printf("");
for(i=0;i printf("%d ",R);
printf("\n");
end=clock();//CLK_TCK;
printf("time=%lf\n",end-begin);


return 0;
}

int partition(int R[],int l,int h)
{
int i,j,temp;
i=l;j=h;temp=R;
do
{
for(;(R>=temp)&&(i if(i;
for(;(R<=temp)&&(i if(i;
}while(i!=j);
R=temp;
return i;
}



void quicksort(int R[],int s1,int t1)
{
int i;
if(s1 {
i=partition(R,s1,t1);
//#pragma omp parallel sections
{
#pragma omp section
quicksort(R,s1,i-1);
#pragma omp section
quicksort(R,i+1,t1);
}
}
}
why parallel time is slower than serial time ? please tell me whether there are better time funtions than clock() or not ? Thank you
0 Kudos
8 Replies
TimP
Honored Contributor III
1,368 Views
clock(), on many platforms, adds up time spent in all threads, so the usual objective for threaded parallel is to reduce elapsed time at the cost of some increase in the time measured by clock(). Your OpenMP should make a suitable choice of timing function in omp_get_wtime() (typically, gettimeofday()), or, for finer resolution, you can use rdtsc counter with precautions against comparing different timers on a multiple socket platform.
0 Kudos
pvonkaenel
New Contributor III
1,368 Views
Quoting - tim18
clock(), on many platforms, adds up time spent in all threads, so the usual objective for threaded parallel is to reduce elapsed time at the cost of some increase in the time measured by clock(). Your OpenMP should make a suitable choice of timing function in omp_get_wtime() (typically, gettimeofday()), or, for finer resolution, you can use rdtsc counter with precautions against comparing different timers on a multiple socket platform.

On Windows I tend to use QueryPerformanceCounter() with QueryPerformanceFrequency() for fine resolution timings. RDTSC is nice, but you need to use asm to access it, and as Tim mentioned, you need to make sure all calls to it come from the same socket.

Tim, I have been assuming that multiple cores on a single die share the same RDTSC timer. Do you know if this is correct?

Peter
0 Kudos
TimP
Honored Contributor III
1,368 Views
Microsoft and Intel C/C++ compilers for x86/x64 provide an intrinsic __rdtsc() so it is not necessariy to know the asm sequences. I don't know if the SSE4.2 option might in the future switch it over to the new equivalent instruction, or whether the old rdtsc will continue indefinitely to give equivalent results. There is a single timer shared by multiple cores on a CPU.
On Intel 64-bit CPUs (from Prescott on) rdtsc actually counts FSB ticks, and on "Intel microarchitecture code-named Nehalem," QPI ticks, with the reported result scaled in accordance with the nominal core speed. For example, the rdtsc rate doesn't speed up under Turbo mode.
Power-on BIOS logic synchronizes the timers on a multiple socket platform, so they shouldn't differ by more than a few FSB ticks, as long as they share the same FSB clock. So, on ordinary dual CPU platforms, the discrepancy introduced by accidental comparison of the 2 different timers would be within the time delay of switching threads.
0 Kudos
Thomas_W_Intel
Employee
1,368 Views

Concerning the question, why the serial version is faster than the parallel, you might need to have a closer look at the overhead. The first time the parallel section is executed, the OpenMP run-time environment starts the thread pool, which needs a considerable amount of time. For time measurements, it is therefore advisable to havea warm-up call to the parallel routine before the time measurement.

Additionally, you should take into account that parallelization is only beneficial for large data sets. If you sort only 100 integers, you are probably better off using a serial algorithm. If you haven't done so already, you should choose a big data set to time your sort routine. Additionally, you shouldswitch tothe serial variant in your recursion, if the data sets become too small.

You should also be aware that the maximal possible speed-up of a parallel quicksort algorithm is limited by the first partitioning, which is serial. Amdahl's law might bite you, especially if you are running on a machine with a lot of cores.

If you are completely stuck, the Intel Thread Profiler can help you in visualizing the thread behavior.

Kind regards
Thomas

0 Kudos
pramodblackbird
Beginner
1,368 Views
working on a quick sort parallel prgm with increased efficiency...once completed, i'll put it on ISN,...

0 Kudos
teeks99
Beginner
1,368 Views
When I do timing tests I always try to keep the I/O (printf's in this case) out of the area that is timed. They can take a (relatively) large amount of time, and the time they take can be somewhat arbitrary (based on how full buffers are, the status of the console, etc). It looks like you're actually trying to time the quicksort() call, so just put your timer around that.


0 Kudos
robert-reed
Valued Contributor II
1,368 Views
Quoting - pramodblackbird
working on a quick sort parallel prgm with increased efficiency...once completed, i'll put it on ISN,...

Are you aware that Intel already provides a parallel quick sort as part of the Open Source/commercial Intel Threading Building Blocks? The parallel_sort function recursively partitions the sort range and then spawns work on those partitions as separate tasks, scheduled in the Intel TBB thread pool.
0 Kudos
pramodblackbird
Beginner
1,368 Views
thnx a lot...

0 Kudos
Reply