Need urgent hel with TBB parallel_for

Gregory_G_ · ‎12-15-2013

Hi, im getting some troubles when trying to parallel a part of my code, I made it easy using cilk++ but i can't do it with TBB... Maybe someone can help me please...

Well i have this code

#include<stdio.h>
#include <stdlib.h>
#include <time.h>

#define MAX 100000

void selection_sort(int *a);

int main(){
   FILE *arq;
   time_t start, stop;
   int   x,i, vet[MAX];

time(&start);

arq = fopen("numeros.txt","r");

i=0;
while (!feof(arq)){

       fscanf(arq, "%d",&vet);
       i++;
   }
   fclose(arq);

selection_sort(vet);

   printf("Ordem \n");
   for(i=0;i<MAX;i++){

       printf("%d \n",vet);

   }

   time(&stop);

    printf("Finished in about %0.4f seconds. \n", difftime(stop, start));
   return 0;

}

void selection_sort(int *a)
{
int i, j, k, tmp, troca;

for(i = 0; i < MAX-1; i++)
{
troca = 0;
k = i;
tmp = a;
//****** i want to parallel this for loop*******//
for(0, MAX, 1, [=](int ) {
   if(a < tmp)
   {
    k = j;
    tmp = a;
    troca = 1;
   }
}
/**************************************//
if(troca)
{
   a = a;
   a = tmp;
}
}
}

Anyone can help me as how i do it??? i made it easy using cilk++ just adding cilk_for on the loop, but with tbb i'm not getting success, i will be so thankfull if u guys help me, its for my conclusion test. Thank you

Gregory_G_ · ‎12-16-2013

Sorry, the right for is for(j+1;j<MAX;j++)

RafSchietekat · ‎12-16-2013

How about "parallel_for (i + 1, MAX, [&] (int i) { … });"? You have to add a mutex there, though, and that won't do well for performance at all!

Finding the minimum in a range is a reduction operation, and TBB has a better solution for that: parallel_reduce(). Forget about setting tmp during the reduction, you only need an index. No need for a mutex here.

Hmm, why does parallel_for return Func?

Gregory_G_ · ‎12-16-2013

Yeah, i made the parallel_for as you said, and you're right i didnt got a good performance...

I'll try the parallel_reduce, thank you!

Gregory_G_ · ‎12-16-2013

Just to know i'm comparing the OpenMp, cilk++ and TBB

And i made it easy with the other just adding a "cilk_for" or #pragma omp parallel for...

But with TBB it's a bit different

RafSchietekat · ‎12-16-2013

I don't see how just parallelising the loop would make the program work without a mutex, whatever the toolkit. The program probably takes long enough for all threads to get in gear for stealing, and you'll probably still have a long enough inner loop by that time for one of them to succeed, so my first thought would be that you just didn't check, but I do see that output loop, so I'm confused.

And this isn't the most efficient sort anyhow, but I'll assume it's just a little exercise. You'll see the difference if you start and stop time around the sorting itself, and substitute other algorithms. Maybe even serial std::sort() could be faster, let alone tbb::parallel_sort().

Have fun experimenting...

(2013-12-18 Added) Maybe it is all over by the time the workers get launched, after all? I hear it takes longer to launch a thread on Windows...

Gregory_G_ · ‎12-18-2013

Hi, I'm thinking.

I'm need to compare cilk,tbb,openmp,pthread and cuda taking notice about the time execution,portability,etc...and I'll run it only in linux.

And maybe you're right thats not best algorithm to do it.

Can you suggest me any good algorithm to do this comparison?

It must be simple and easy algorithm.

I hope u can help me, thank you!

jimdempseyatthecove · ‎12-18-2013

Why not ask, is there a better parallel sort algorithm?

Is your query based on a lab project for a CS course?

What you have is called an interchange sort, with the optimization that you do not interchange until the completion of the remainder pass.

Parallelizing this type of sort typically involves partitioning the array into number of partitions == number of threads, performing the sort of each partition in parallel, then serially performing an N-way merge.

[cpp]
void SplitSortMerge(int* a, int from, int to, int nThreads)
{
   if(nThreads > 1)
   {
        int halfway = from + ((to-from)/2);
        parallel_invoke(
           [=](){SplitSortMerge( a, from, halfway, nThreads/2);},
           [=](){SplitSortMerge( a, halfway, to, nThreads - nThreads/2);});
           // TBD merge both halves
}
else
{
      // your modified interchange sort here
}
...
int nThreads = getNumberOfThreadsYourself();
SplitSortMerge(a, 0, MAX, nThreads);
[/cpp]

jimdempseyatthecove · ‎12-18-2013

Note, the above is non-optimal because it is performing more merges than necessary.

To reduce the number of merges, you will need to remember the split points (in a thread-safe manner).

Also it is non-optimal when the number of threads is not a power of 2. To fix this you might want to consider a filter with a series of different parallel_invokes

[cpp]
if(nThreads %2 == 0)
parallel_invoke( 2-way );
else
if(nThreads %3 == 0)
parallel_invoke( 3-way );
else
if(nThreads % 5 == 0)
parallel_invoke( 5-way);
else
...
else
// default when you are tired of making cases
parallel_invoke(2-way)
endif
[/cpp]

Jim Dempsey

RafSchietekat · ‎12-18-2013

I disagree with these suggestions because exact division does not compose well (other concurrent workloads may cause the parts to be staggered) and because the merge is still a serial bottleneck; see Parallel Stable Sort.

As for Gregory's question, I doubt there's any one algorithm that can realistically be implemented on all these toolkits for a fair comparison, because they all have different strengths and weaknesses. You should try several patterns, probably starting with simple patterns (map, reduce), and only then more challenging ones like divide-and-conquer. Modifying existing algorithms by partial parallelisation is not generally a good approach, as shown with insertion sort.

jimdempseyatthecove · ‎12-19-2013

Raf,

If the workload per partition is unequal, you could choose to start the SplitSortMerge with a number larger than the number of treads, or not pass the number of partitions and quit splitting when the iteration space (to-from) is less than some cutoff. In this respect it becomes similar to the parallel_for with chunk size. He can do the same thing with Cilk++.

From the "Dirty Harry" movie "A man's got'a know his limitations.".

Meaning the requirements of the sort (limitations) has to be laid out before the choice of method is made. Some of the unstated requirements:

a) Do all the records fit in memory?
b) Are the records variable length?
c) Is this an in-place sort (or has different input and output)?
d) Are duplicates permitted, if so is order in original file required to be maintained in output?
e) Is the output required to be written to file or used in situ?
f) number of records (no sense in performing a parallel sort on a very few number of records)?
g)...

The "best" sort will depend on the answers to the above questions. (Quicksort, Radixsort, CheckSort, MergeSort, other...)

Jim Dempsey