Re: scaling of merge sort algo

Sharada_Kulkarni · ‎06-05-2009

#include

#include
#include
#include

#define MAX 10000000

void mergesort(float a[],int,int);
void merge(float a[],int,int,int);

float a[MAX],b[MAX];
char ar[MAX];

int main()
{
FILE *fp;
int i,n=0,j;
clock_t start,stop;
double startTime,endTime,wt;
int iCPU=omp_get_num_procs();

omp_set_num_threads(iCPU);
printf("\nNo of cpu's=%d\n",iCPU);

printf("Parallel version of Merge sort.\nEnter the data file name :\n");
scanf("%s",ar);
if((fp=fopen(ar,"r"))==NULL)
{
printf("Can't open file.\n");
exit(1);
}
i=fscanf(fp,"%f",&a);
while(i != EOF)
{
//printf("%f,",a);
n++;
i=fscanf(fp,"%f",&a);
}
fclose(fp);

startTime=omp_get_wtime();
start=clock();
mergesort(a,0,n-1);
stop=clock();

endTime=omp_get_wtime();
printf("\nNo of elements=%d\nstart time=%10.9f",n,((float)(start)/ CLOCKS_PER_SEC));
printf("\nstop=%10.9f",((float)(stop)/CLOCKS_PER_SEC));
printf("\nElapsed time=%10.9f",(double)(endTime-startTime));
printf("\nTime taken :%10.9f secs\n",(((float)(stop-start))/CLOCKS_PER_SEC));
}

void mergesort(float a[],int low,int high)
{
int mid;

if (low{
mid=(low+high)/2;
#pragma omp parallel
{
#pragma omp sections nowait
{
#pragma omp section
mergesort(a,low,mid);
#pragma omp section
mergesort(a,mid+1,high);
}}
merge(a,low,mid,high);
}
}
void merge(float a[],int low,int mid,int high)
{
int i,h,j,k;
h=i=low;
j= mid + 1;

while((h<=mid) && (j<=high))
if (a)
b[i++]=a[h++];
else
b[i++]=a[j++];

if (h>mid)
for(k=j;k<=high;k++)
b[i++]=a;
else
for(k=h;k<=mid;k++)
b[i++]=a;

for(k=low;k<=high;k++)
a=b;
}

Can anybody pl. tell me why I am not getting scaling with this code. I am using intel C++ (windows) on dual core machine.
Thanx in advance.

Om_S_Intel · ‎06-05-2009

I think we need some data file to test this. Could you please provide the same? You can attach the data file to this issue.

Thanks,

Om

Om_S_Intel · ‎06-05-2009

You may analyze this code using Intel Thread profiler or Intel Parallel Amplifier.

Sharada_Kulkarni · ‎06-05-2009

Quoting - Om Sachan (Intel)

You may analyze this code using Intel Thread profiler or Intel Parallel Amplifier.

Ya ! as you said I have attached a data file of one lack nos. Just check now
Thanx Om..

jimdempseyatthecove · ‎06-05-2009

Have you verified your setup for OpenMP is creating threads? (options right, environment variable right etc...)
Hint, place a break in the different sections and see if you enter the section from a different thread (first iteration).

Second, your system has two hardwarethreads, you have a recursive algorithm designed to spawn threads as it nests deeper, assuming you have nested OpenMP levels enabled (on your 2 core system). If you have nested levels disabled, then on second level entry to mergesort goes through the OpenMP parallel then sections only to determine it cannot create a new thread pooland must run serial. This is consuming run time to make this decision.

If you have nested levels enabled, and if OpenMP is creating dozens (or more)of threads, then excessive overhead occures in task switching.

Recursive threading algorithims work well only when you take the effort NOT to schedule threads when additional threads are not available. In OpenMP this is problematic.

A reasonably good solution is to enter a parallel region, obtain the number of threads, split the range into that number of pieces, then each thread takes one piece and then calls your recursive routine (without thread splitting) to perform the sort.

Jim Dempsey

TimP · ‎06-05-2009

Quoting - jimdempseyatthecove

Have you verified your setup for OpenMP is creating threads? (options right, environment variable right etc...)
Hint, place a break in the different sections and see if you enter the section from a different thread (first iteration).

Second, your system has two hardwarethreads, you have a recursive algorithm designed to spawn threads as it nests deeper, assuming you have nested OpenMP levels enabled (on your 2 core system). If you have nested levels disabled, then on second level entry to mergesort goes through the OpenMP parallel then sections only to determine it cannot create a new thread pooland must run serial. This is consuming run time to make this decision.

If you have nested levels enabled, and if OpenMP is creating dozens (or more)of threads, then excessive overhead occures in task switching.

Alternate hint (if you don't want to install Thread Profiler):
export LD_PRELOAD=/lib/libiompprof5.so
run
view guide.gvs
It definitely makes threads, each spending as much time as the serial run.
with OMP_NESTED set, it recurses until OMP_STACK is exhausted.

jimdempseyatthecove · ‎06-05-2009

Quoting - tim18

Alternate hint (if you don't want to install Thread Profiler):
export LD_PRELOAD=/lib/libiompprof5.so
run
view guide.gvs
It definitely makes threads, each spending as much time as the serial run.
with OMP_NESTED set, it recurses until OMP_STACK is exhausted.

Certainly

Consider having a paltry 65,536 number of records. The sort (as written) would split to 16 levels and require 32,768 +/- treads! (each thread, and following generation of threads,performing splits until it has 2 records)

Algo's cannot blindly divide to conquer (as they usuallyend up dividing and fail). The division process has to be conditional based on the availability of hardware threads to perform the process.

The merge sort (as well as a parallel recursive Fibonacci series) are good learning examples. These angorithms can be done in parallel, but only when the code is sensitive to the available resources. BTW recursive algorithms tend to be easy to write but also tend to be much slower than looping iterative methods.

The parallel merge sort will work well with a one time divide N-ways (by number of available HW threads) then perform sort without further division.

When you have a means to determine the availability of additional threads after the N-way split (part way through the sort) then each thread could determine if a split is warranted. e.g. split only when additional thread available .and remaining count is .gt. x.

Jim Dempsey

Sharada_Kulkarni · ‎06-05-2009

Quoting - jimdempseyatthecove

Certainly

Consider having a paltry 65,536 number of records. The sort (as written) would split to 16 levels and require 32,768 +/- treads! (each thread, and following generation of threads,performing splits until it has 2 records)

Algo's cannot blindly divide to conquer (as they usuallyend up dividing and fail). The division process has to be conditional based on the availability of hardware threads to perform the process.

The merge sort (as well as a parallel recursive Fibonacci series) are good learning examples. These angorithms can be done in parallel, but only when the code is sensitive to the available resources. BTW recursive algorithms tend to be easy to write but also tend to be much slower than looping iterative methods.

The parallel merge sort will work well with a one time divide N-ways (by number of available HW threads) then perform sort without further division.

When you have a means to determine the availability of additional threads after the N-way split (part way through the sort) then each thread could determine if a split is warranted. e.g. split only when additional thread available .and remaining count is .gt. x.

Jim Dempsey

Thanx Jim.
I wil first find available threads and then sort .